Announcing hashabler: like hashable only more so
Last updated: Dec 11, 2022
I’ve just released the first version of a haskell library for principled, cross-platform & extensible hashing of types, which includes an implementation of the FNV-1a algorithm. It is available on hackage, and can be installed with:
cabal install hashabler
hashabler
is a rewrite of the hashable
library by Milan Straka and Johan Tibell, having the following goals:
-
Extensibility; it should be easy to implement a new hashing algorithm on any Hashable type, for instance if one needed more hash bits
-
Honest hashing of values, and principled hashing of algebraic data types (see e.g. #30)
-
Cross-platform consistent hash values, with a versioning guarantee. Where possible we ensure morally identical data hashes to indentical values regardless of processor word size and endianness.
-
Make implementing identical hash routines in other languages as painless as possible. We provide an implementation of a simple hashing algorithm (FNV-1a) and make an effort define Hashable instances in a way that is well-documented and sensible, so that e.g. one can (hopefully) easily implement string hashing routine in JavaScript that will match the way we hash strings here.
Motivation
I started writing a fast concurrent bloom filter variant, but found none of the
existing libraries fit my needs. In particular hashable
was deficient in a
number of ways:
-
The number of hash bits my data structure requires can vary based on user parameters, and possibly be more than the 64-bits supported by hashable
-
Users might like to serialize their bloomfilter and store it, pass it to other machines, or work with it in a different language, so we need
- hash values that are consistent across platforms
- some guarantee of consistency across library versions
I was also very concerned about the general approach taken for algebraic types,
which results in collision, the use of “hashing” numeric values to themselves,
dubious combining functions, etc. It wasn’t at all clear to me how to ensure my
data structure wouldn’t be broken if I used hashable
. See below for a very
brief investigation into hash goodness of the two libraries.
There isn’t interest in supporting my use case or addressing these issues in
hashable
(see e.g. #73, #30, and #74)
and apparently hashable is working in practice for people, but maybe this new
package will be useful for some other folks.
Hash goodness of hashable and hashabler, briefly
Hashing-based data structures assume some “goodness” of the underlying hash function, and may depend on the goodness of the hash function in ways that aren’t always clear or well-understood. “Goodness” also seems to be somewhat subjective, but can be expressed statistically in terms of bit-independence tests, and avalanche properties, etc.; various things that e.g. smhasher looks at.
I thought for fun I’d visualize some distributions, as that’s easier for my puny brain to understand than statistics. We visualize 32-bit hashes by quantizing by 64x64 and mapping that to a pixel following a hilbert curve to maintain locality of hash values. Then when multiple hash values fall within the same 64x64 pixel, we darken the pixel, and finally mark it red if we can’t go any further to indicate clipping.
It’s easy to cherry-pick inputs that will result in some bad behavior by
hashable, but below I’ve tried to show some fairly realistic examples of
strange or less-good distributions in hashable
. I haven’t analysed these
at all. Images are cropped 1/4 size, but are representative of the whole 32-bit
range.
First, here’s a hash of all [Ordering]
of size 10 (~59K distinct values):
Hashabler:
Hashable:
Next here’s the hash of one million (Word8,Word8,Word8)
(having a domain ~ 16 mil):
Hashabler:
Hashable:
I saw no difference when hashing english words, which is good news as that’s probably a very common use-case.
Please help
If you could test the library on a big endian machine and let me know how it goes, that would be great. See here.
You can also check out the TODOs scattered throughout the code and send pull requests. I mayb not be able to get to them until June, but will be very grateful!
P.S. hire me
I’m always open to interesting work or just hearing about how companies are using haskell. Feel free to send me an email at brandon.m.simmons@gmail.com