Brandon.Si(mmons)

code / art / projects

Translating Some Stateful Bit-twiddling to Haskell

I just started implementing SipHash in hashabler and wanted to share a nice way I found to translate stateful bit-twiddling code in C (which makes heavy use of bitwise assignment operators) to haskell.

I was working from the reference implementation. As you can see statefulness and mutability are an implicit part of how the algorithm is defined, as it modifies the states of the v variables.

#define SIPROUND                                        \
  do {                                                  \
    v0 += v1; v1=ROTL(v1,13); v1 ^= v0; v0=ROTL(v0,32); \
    v2 += v3; v3=ROTL(v3,16); v3 ^= v2;                 \
    v0 += v3; v3=ROTL(v3,21); v3 ^= v0;                 \
    v2 += v1; v1=ROTL(v1,17); v1 ^= v2; v2=ROTL(v2,32); \
  } while(0)

int  siphash( uint8_t *out, const uint8_t *in, uint64_t inlen, const uint8_t *k )
{

  /* ... */

  for ( ; in != end; in += 8 )
  {
    m = U8TO64_LE( in );
    v3 ^= m;

    TRACE;
    for( i=0; i<cROUNDS; ++i ) SIPROUND;

    v0 ^= m;
  }

I wanted to translate this sort of code as directly as possible (I’d already decided if it didn’t work on the first try I would burn my laptop and live in the woods, rather than debug this crap).

First we’ll use name shadowing to “fake” our mutable variables, making it easy to ensure we’re always dealing with the freshest values.

{-# OPTIONS_GHC -fno-warn-name-shadowing #-}

We’ll also use RecordWildCards to make it easy to capture the “current state” of these values, through folds and helper functions.

{-# LANGUAGE RecordWildCards #-}

And finally we use the trivial Identity monad (this trick I learned from Oleg) which gets us the proper scoping we want for our v values:

import Data.Functor.Identity

Here’s a bit of the haskell:

siphash :: Hashable a => SipKey -> a -> Word64
siphash (k0,k1) = \a-> runIdentity $ do
    let v0 = 0x736f6d6570736575
        v1 = 0x646f72616e646f6d
        v2 = 0x6c7967656e657261
        v3 = 0x7465646279746573

    ...

    v3 <- return $ v3 `xor` k1;
    v2 <- return $ v2 `xor` k0;
    v1 <- return $ v1 `xor` k1;
    v0 <- return $ v0 `xor` k0;

    ...

    -- Initialize rest of SipState:
    let mPart          = 0
        bytesRemaining = 8
        inlen          = 0
    SipState{ .. } <- return $ hash (SipState { .. }) a

    let !b = inlen `unsafeShiftL` 56

    v3 <- return $ v3 `xor` b
    -- for( i=0; i<cROUNDS; ++i ) SIPROUND;
    (v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
    (v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
    v0 <- return $ v0 `xor` b

    ...

    (v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3

    return $! v0 `xor` v1 `xor` v2 `xor` v3

If you were really doing a lot of this sort of thing, you could even make a simple quasiquoter that could translate bitwise assignment into code like the above.

Announcing Hashabler: Like Hashable Only More So

I’ve just released the first version of a haskell library for principled, cross-platform & extensible hashing of types, which includes an implementation of the FNV-1a algorithm. It is available on hackage, and can be installed with:

cabal install hashabler

hashabler is a rewrite of the hashable library by Milan Straka and Johan Tibell, having the following goals:

  • Extensibility; it should be easy to implement a new hashing algorithm on any Hashable type, for instance if one needed more hash bits

  • Honest hashing of values, and principled hashing of algebraic data types (see e.g. #30)

  • Cross-platform consistent hash values, with a versioning guarantee. Where possible we ensure morally identical data hashes to indentical values regardless of processor word size and endianness.

  • Make implementing identical hash routines in other languages as painless as possible. We provide an implementation of a simple hashing algorithm (FNV-1a) and make an effort define Hashable instances in a way that is well-documented and sensible, so that e.g. one can (hopefully) easily implement string hashing routine in JavaScript that will match the way we hash strings here.

Motivation

I started writing a fast concurrent bloom filter variant, but found none of the existing libraries fit my needs. In particular hashable was deficient in a number of ways:

  • The number of hash bits my data structure requires can vary based on user parameters, and possibly be more than the 64-bits supported by hashable

  • Users might like to serialize their bloomfilter and store it, pass it to other machines, or work with it in a different language, so we need

    • hash values that are consistent across platforms
    • some guarantee of consistency across library versions

I was also very concerned about the general approach taken for algebraic types, which results in collision, the use of “hashing” numeric values to themselves, dubious combining functions, etc. It wasn’t at all clear to me how to ensure my data structure wouldn’t be broken if I used hashable. See below for a very brief investigation into hash goodness of the two libraries.

There isn’t interest in supporting my use case or addressing these issues in hashable (see e.g. #73, #30, and #74) and apparently hashable is working in practice for people, but maybe this new package will be useful for some other folks.

Hash goodness of hashable and hashabler, briefly

Hashing-based data structures assume some “goodness” of the underlying hash function, and may depend on the goodness of the hash function in ways that aren’t always clear or well-understood. “Goodness” also seems to be somewhat subjective, but can be expressed statistically in terms of bit-independence tests, and avalanche properties, etc.; various things that e.g. smhasher looks at.

I thought for fun I’d visualize some distributions, as that’s easier for my puny brain to understand than statistics. We visualize 32-bit hashes by quantizing by 64x64 and mapping that to a pixel following a hilbert curve to maintain locality of hash values. Then when multiple hash values fall within the same 64x64 pixel, we darken the pixel, and finally mark it red if we can’t go any further to indicate clipping.

It’s easy to cherry-pick inputs that will result in some bad behavior by hashable, but below I’ve tried to show some fairly realistic examples of strange or less-good distributions in hashable. I haven’t analysed these at all. Images are cropped ¼ size, but are representative of the whole 32-bit range.

First, here’s a hash of all [Ordering] of size 10 (~59K distinct values):

Hashabler:

Hashable:

Next here’s the hash of one million (Word8,Word8,Word8) (having a domain ~ 16 mil):

Hashabler:

Hashable:

I saw no difference when hashing english words, which is good news as that’s probably a very common use-case.

Please help

If you could test the library on a big endian machine and let me know how it goes, that would be great. See here.

You can also check out the TODOs scattered throughout the code and send pull requests. I mayb not be able to get to them until June, but will be very grateful!

P.S. hire me

I’m always open to interesting work or just hearing about how companies are using haskell. Feel free to send me an email at brandon.m.simmons@gmail.com

Benchmarking Very Fast Things With Criterion

There’s a pervasive myth that Bryan O’Sullivan’s excellent haskell benchmarking library criterion is only useful for benchmarks that take some significant chunk of time (I’ve even heard some people claim on the ms scale). In fact criterion is useful for almost anything you’d want to benchmark.

At a high level criterion makes your benchmark the inner loop of a function, and runs that loop a bunch of times, measures the result, and then divides by the number of iterations it performed. The approach is both useful for comparing alternative implementations, and probably the only meaningful way of answering “how long does this code take to run”, short of looking at the assembly and counting the instructions and consulting your processor’s manual.

If you’re skeptical, here’s a benchmark we’d expect to be very fast:

import Criterion.Main

main :: IO ()
main = do
    defaultMain [ 
        bench "sum2" $ nf sum [1::Int,2]
      , bench "sum4" $ nf sum [1::Int,2,3,4]
      , bench "sum5" $ nf sum [1::Int,2,3,4,5]
      ]

And indeed it’s on the order of nanoseconds:

benchmarking sum2
time                 27.20 ns   (27.10 ns .. 27.35 ns)
                     0.994 R²   (0.984 R² .. 1.000 R²)
mean                 28.72 ns   (27.29 ns .. 32.44 ns)
std dev              6.730 ns   (853.1 ps .. 11.71 ns)
variance introduced by outliers: 98% (severely inflated)

benchmarking sum4
time                 58.45 ns   (58.31 ns .. 58.59 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 58.47 ns   (58.26 ns .. 58.66 ns)
std dev              654.6 ps   (547.1 ps .. 787.8 ps)
variance introduced by outliers: 11% (moderately inflated)

benchmarking sum5
time                 67.08 ns   (66.84 ns .. 67.33 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 67.04 ns   (66.85 ns .. 67.26 ns)
std dev              705.5 ps   (596.3 ps .. 903.5 ps)

The results are consistent with each other; sum seems to be linear, taking 13-14ns per list element, across our different input sizes.

Trying to measure even faster things

This is what I was doing today which motivated this post. I was experimenting with measuring the inner loop of a hash function:

fnvInnerLoopTest :: Word8 -> Word32
{-# INLINE fnvInnerLoopTest #-}
fnvInnerLoopTest b = (2166136261 `xor` fromIntegral b) * 16777619

These were the results criterion gave me:

benchmarking test
time                 9.791 ns   (9.754 ns .. 9.827 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.798 ns   (9.759 ns .. 9.862 ns)
std dev              167.3 ps   (117.0 ps .. 275.3 ps)
variance introduced by outliers: 24% (moderately inflated)

These are the sorts of timescales that get into possibly measuring overhead of function calls, boxing/unboxing, etc. and should make you skeptical of criterion’s result. So I unrolled 4 and 8 iteration versions of these and measured the results:

main :: IO ()
main = do
    defaultMain [ 
        bench "test"  $ nf fnvInnerLoopTest   7
      , bench "test4" $ nf fnvInnerLoopTest4 (7,8,9,10)
      , bench "test8" $ nf fnvInnerLoopTest8 (7,8,9,10,11,12,13,14)
      ]

benchmarking test
time                 9.380 ns   (9.346 ns .. 9.418 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.448 ns   (9.399 ns .. 9.567 ns)
std dev              240.4 ps   (137.9 ps .. 418.6 ps)
variance introduced by outliers: 42% (moderately inflated)

benchmarking test4
time                 12.66 ns   (12.62 ns .. 12.72 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 12.68 ns   (12.64 ns .. 12.73 ns)
std dev              158.8 ps   (126.9 ps .. 215.7 ps)
variance introduced by outliers: 15% (moderately inflated)

benchmarking test8
time                 17.88 ns   (17.82 ns .. 17.94 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.89 ns   (17.81 ns .. 17.97 ns)
std dev              262.7 ps   (210.3 ps .. 349.7 ps)
variance introduced by outliers: 19% (moderately inflated)

So this seems to give a more clear picture of how good our bit twiddling is in that inner loop. I was curious if I could measure the overhead directly in criterion though. Somewhat surprisingly to me, it seems I could!

I added the following benchmark to my list:

  , bench "baseline32" $ nf (\x-> x) (777::Word32)

The idea being to isolate the overhead of applying the most trivial function and calling nf on an example value of our output type (Word32 in this case).

benchmarking baseline32
time                 9.485 ns   (9.434 ns .. 9.543 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.509 ns   (9.469 ns .. 9.559 ns)
std dev              155.8 ps   (122.6 ps .. 227.8 ps)
variance introduced by outliers: 23% (moderately inflated)

If we consider this value the baseline for the measurements initially reported, the new results are both linear-ish, as we would expect, and also the resulting absolute measurements fall about where we’d expect from the assembly we’d hope for (I still need to verify that this is actually the case), e.g. our intial test is in the ~1ns range, about what we’d expect from an inner loop with a couple instructions.

I thought this was compelling enough to open an issue to see whether this technique might be incorporated into criterion directly. It’s at least a useful technique that I’ll keep playing with.

Anyway, benchmark your code.

Announcing Unagi-chan

Today I released version 0.2 of unagi-chan, a haskell library implementing fast and scalable FIFO queues with a nice and familiar API. It is available on hackage and you can install it with:

$ cabal install unagi-chan

This version provides a bounded queue variant (and closes issue #1!) that has performance on par with the other variants in the library. This is something I’m somewhat proud of, considering that the standard TBQueue is not only significantly slower than e.g. TQueue, but also was seen to livelock at a fairly low level of concurrency (and so is not included in the benchmark suite).

Here are some example benchmarks. Please do try the new bounded version and see how it works for you.

Benchmarks

What follows are a few random thoughts more or less generally-applicable to the design of bounded FIFO queues, especially in a high-level garbage-collected language. These might be obvious, uninteresting, or unintelligible.

What is Bounding For?

I hadn’t really thought much about this before: a bounded queue limits memory consumption because the queue is restricted from growing beyond some size.

But this isn’t quite right. If for instance we implement a bounded queue by pre-allocating an array of size bounds then a write operation need not consume any additional memory; indeed the value to be written has already been allocated on the heap before the write even begins, and will persist whether the write blocks or returns immediately.

Instead constraining memory usage is a knock-on effect of what we really care about: backpressure; when the ratio of “producers” to their writes is high (the usual scenario), blocking a write may limit memory usage by delaying heap allocations associated with elements for future writes.

So bounded queues with blocking writes let us:

  • when threads are “oversubscribed”, transparently indicate to the runtime which work has priority
  • limit future resource usage (CPU time and memory) by producer threads

We might also like our bounded queue to support a non-blocking write which returns immediately with success or failure. This might be thought of (depending on the capabilities of your language’s runtime) as more general than a blocking write, but it also supports a distinctly different notion of bounding, that is bounding message latency: a producer may choose to drop messages when a consumer falls behind, in exchange for lower latency for future writes.

Unagi.Bounded Implementation Ideas

Trying to unpack the ideas above helped in a few ways when designing Unagi.Bounded. Here are a few observations I made.

We need not block before “writing”

When implementing blocking writes, my intuition was to (when the queue is “full”) have writers block before “making the message available” (whatever that means for your implementation). For Unagi that means blocking on an MVar, and then writing a message to an assigned array index.

But this ordering presents a couple of problems: first, we need to be able to handle async exceptions raised during writer blocking; if its message isn’t yet “in place” then we need to somehow coordinate with the reader that would have received this message, telling it to retry.

By unpacking the purpose of bounding it became clear that we’re free to block at any point during the write (because the write per se does not have the memory-usage implications we originally naively assumed it had), so in Unagi.Bounded writes proceed exactly like in our other variants, until the end of the writeChan, at which point we decide when to block.

This is certainly also better for performance: if a wave of readers comes along, they need not wait (themselves blocking) for previously blocked writers to make their messages available.

One hairy detail from this approach: an async exception raised in a blocked writer does not cause that write to be aborted; i.e. once entered, writeChan always succeeds. Reasoning in terms of linearizability this only affects situations in which a writer thread is known-blocked and we would like to abort that write.

Fine-grained writer unblocking in probably unnecessary and harmful

In Unagi.Bounded I relax the bounds constraint to “somewhere between bounds and bounds*2”. This allows me to eliminate a lot of coordination between readers and writers by using a single reader to unblock up to bounds number of writers. This constraint (along with the constraint that bounds be a power of two, for fast modulo) seemed like something everyone could live with.

I also guess that this “cohort unblocking” behavior could result in some nicer stride behavior, with more consecutive non-blocking reads and writes, rather than having a situation where the queue is almost always either completely full or empty.

One-shot MVars and Semaphores

This has nothing to do with queues, but just a place to put this observation: garbage-collected languages permit some interesting non-traditional concurrency patterns. For instance I use MVars and IORefs that only ever go from empty to full, or follow a single linear progression of three or four states in their lifetime. Often it’s easier to design algorithms this way, rather than by using long-lived mutable variables (for instance I struggled to come up with a blocking bounded queue design that used a circular buffer which could be made async-exception-safe).

Similarly the CAS operation (which I get exported from atomic-primops) turns out to be surprisingly versatile far beyond the traditional read/CAS/retry loop, and to have very useful semantics when used on short-lived variables. For instance throughout unagi-chan I do both of the following:

  • CAS without inspecting the return value, content that we or any other competing thread succeeded.

  • CAS using a known initial state, avoiding an initial read