Brandon.Si(mmons)

code / art / projects

Benchmarking Very Fast Things With Criterion

There’s a pervasive myth that Bryan O’Sullivan’s excellent haskell benchmarking library criterion is only useful for benchmarks that take some significant chunk of time (I’ve even heard some people claim on the ms scale). In fact criterion is useful for almost anything you’d want to benchmark.

At a high level criterion makes your benchmark the inner loop of a function, and runs that loop a bunch of times, measures the result, and then divides by the number of iterations it performed. The approach is both useful for comparing alternative implementations, and probably the only meaningful way of answering “how long does this code take to run”, short of looking at the assembly and counting the instructions and consulting your processor’s manual.

If you’re skeptical, here’s a benchmark we’d expect to be very fast:

import Criterion.Main

main :: IO ()
main = do
    defaultMain [ 
        bench "sum2" $ nf sum [1::Int,2]
      , bench "sum4" $ nf sum [1::Int,2,3,4]
      , bench "sum5" $ nf sum [1::Int,2,3,4,5]
      ]

And indeed it’s on the order of nanoseconds:

benchmarking sum2
time                 27.20 ns   (27.10 ns .. 27.35 ns)
                     0.994 R²   (0.984 R² .. 1.000 R²)
mean                 28.72 ns   (27.29 ns .. 32.44 ns)
std dev              6.730 ns   (853.1 ps .. 11.71 ns)
variance introduced by outliers: 98% (severely inflated)

benchmarking sum4
time                 58.45 ns   (58.31 ns .. 58.59 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 58.47 ns   (58.26 ns .. 58.66 ns)
std dev              654.6 ps   (547.1 ps .. 787.8 ps)
variance introduced by outliers: 11% (moderately inflated)

benchmarking sum5
time                 67.08 ns   (66.84 ns .. 67.33 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 67.04 ns   (66.85 ns .. 67.26 ns)
std dev              705.5 ps   (596.3 ps .. 903.5 ps)

The results are consistent with each other; sum seems to be linear, taking 13-14ns per list element, across our different input sizes.

Trying to measure even faster things

This is what I was doing today which motivated this post. I was experimenting with measuring the inner loop of a hash function:

fnvInnerLoopTest :: Word8 -> Word32
{-# INLINE fnvInnerLoopTest #-}
fnvInnerLoopTest b = (2166136261 `xor` fromIntegral b) * 16777619

These were the results criterion gave me:

benchmarking test
time                 9.791 ns   (9.754 ns .. 9.827 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.798 ns   (9.759 ns .. 9.862 ns)
std dev              167.3 ps   (117.0 ps .. 275.3 ps)
variance introduced by outliers: 24% (moderately inflated)

These are the sorts of timescales that get into possibly measuring overhead of function calls, boxing/unboxing, etc. and should make you skeptical of criterion’s result. So I unrolled 4 and 8 iteration versions of these and measured the results:

main :: IO ()
main = do
    defaultMain [ 
        bench "test"  $ nf fnvInnerLoopTest   7
      , bench "test4" $ nf fnvInnerLoopTest4 (7,8,9,10)
      , bench "test8" $ nf fnvInnerLoopTest8 (7,8,9,10,11,12,13,14)
      ]

benchmarking test
time                 9.380 ns   (9.346 ns .. 9.418 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.448 ns   (9.399 ns .. 9.567 ns)
std dev              240.4 ps   (137.9 ps .. 418.6 ps)
variance introduced by outliers: 42% (moderately inflated)

benchmarking test4
time                 12.66 ns   (12.62 ns .. 12.72 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 12.68 ns   (12.64 ns .. 12.73 ns)
std dev              158.8 ps   (126.9 ps .. 215.7 ps)
variance introduced by outliers: 15% (moderately inflated)

benchmarking test8
time                 17.88 ns   (17.82 ns .. 17.94 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.89 ns   (17.81 ns .. 17.97 ns)
std dev              262.7 ps   (210.3 ps .. 349.7 ps)
variance introduced by outliers: 19% (moderately inflated)

So this seems to give a more clear picture of how good our bit twiddling is in that inner loop. I was curious if I could measure the overhead directly in criterion though. Somewhat surprisingly to me, it seems I could!

I added the following benchmark to my list:

  , bench "baseline32" $ nf (\x-> x) (777::Word32)

The idea being to isolate the overhead of applying the most trivial function and calling nf on an example value of our output type (Word32 in this case).

benchmarking baseline32
time                 9.485 ns   (9.434 ns .. 9.543 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.509 ns   (9.469 ns .. 9.559 ns)
std dev              155.8 ps   (122.6 ps .. 227.8 ps)
variance introduced by outliers: 23% (moderately inflated)

If we consider this value the baseline for the measurements initially reported, the new results are both linear-ish, as we would expect, and also the resulting absolute measurements fall about where we’d expect from the assembly we’d hope for (I still need to verify that this is actually the case), e.g. our intial test is in the ~1ns range, about what we’d expect from an inner loop with a couple instructions.

I thought this was compelling enough to open an issue to see whether this technique might be incorporated into criterion directly. It’s at least a useful technique that I’ll keep playing with.

Anyway, benchmark your code.

Comments