Benchmarking very fast things with criterion
There’s a pervasive myth that Bryan O’Sullivan’s excellent haskell benchmarking library criterion is only useful for benchmarks that take some significant chunk of time (I’ve even heard some people claim on the ms scale). In fact criterion is useful for almost anything you’d want to benchmark.
At a high level criterion
makes your benchmark the inner loop of a function,
and runs that loop a bunch of times, measures the result, and then divides by
the number of iterations it performed. The approach is both useful for comparing
alternative implementations, and probably the only meaningful way of answering
“how long does this code take to run”, short of looking at the assembly and
counting the instructions and consulting your processor’s manual.
If you’re skeptical, here’s a benchmark we’d expect to be very fast:
import Criterion.Main
main :: IO ()
main = do
defaultMain [
bench "sum2" $ nf sum [1::Int,2]
, bench "sum4" $ nf sum [1::Int,2,3,4]
, bench "sum5" $ nf sum [1::Int,2,3,4,5]
]
And indeed it’s on the order of nanoseconds:
benchmarking sum2
time 27.20 ns (27.10 ns .. 27.35 ns)
0.994 R² (0.984 R² .. 1.000 R²)
mean 28.72 ns (27.29 ns .. 32.44 ns)
std dev 6.730 ns (853.1 ps .. 11.71 ns)
variance introduced by outliers: 98% (severely inflated)
benchmarking sum4
time 58.45 ns (58.31 ns .. 58.59 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 58.47 ns (58.26 ns .. 58.66 ns)
std dev 654.6 ps (547.1 ps .. 787.8 ps)
variance introduced by outliers: 11% (moderately inflated)
benchmarking sum5
time 67.08 ns (66.84 ns .. 67.33 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 67.04 ns (66.85 ns .. 67.26 ns)
std dev 705.5 ps (596.3 ps .. 903.5 ps)
The results are consistent with each other; sum
seems to be linear, taking
13-14ns per list element, across our different input sizes.
Trying to measure even faster things
This is what I was doing today which motivated this post. I was experimenting with measuring the inner loop of a hash function:
fnvInnerLoopTest :: Word8 -> Word32
{-# INLINE fnvInnerLoopTest #-}
fnvInnerLoopTest b = (2166136261 `xor` fromIntegral b) * 16777619
These were the results criterion gave me:
benchmarking test
time 9.791 ns (9.754 ns .. 9.827 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.798 ns (9.759 ns .. 9.862 ns)
std dev 167.3 ps (117.0 ps .. 275.3 ps)
variance introduced by outliers: 24% (moderately inflated)
These are the sorts of timescales that get into possibly measuring overhead of function calls, boxing/unboxing, etc. and should make you skeptical of criterion’s result. So I unrolled 4 and 8 iteration versions of these and measured the results:
main :: IO ()
main = do
defaultMain [
bench "test" $ nf fnvInnerLoopTest 7
, bench "test4" $ nf fnvInnerLoopTest4 (7,8,9,10)
, bench "test8" $ nf fnvInnerLoopTest8 (7,8,9,10,11,12,13,14)
]
benchmarking test
time 9.380 ns (9.346 ns .. 9.418 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.448 ns (9.399 ns .. 9.567 ns)
std dev 240.4 ps (137.9 ps .. 418.6 ps)
variance introduced by outliers: 42% (moderately inflated)
benchmarking test4
time 12.66 ns (12.62 ns .. 12.72 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 12.68 ns (12.64 ns .. 12.73 ns)
std dev 158.8 ps (126.9 ps .. 215.7 ps)
variance introduced by outliers: 15% (moderately inflated)
benchmarking test8
time 17.88 ns (17.82 ns .. 17.94 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 17.89 ns (17.81 ns .. 17.97 ns)
std dev 262.7 ps (210.3 ps .. 349.7 ps)
variance introduced by outliers: 19% (moderately inflated)
So this seems to give a more clear picture of how good our bit twiddling is in that inner loop. I was curious if I could measure the overhead directly in criterion though. Somewhat surprisingly to me, it seems I could!
I added the following benchmark to my list:
, bench "baseline32" $ nf (\x-> x) (777::Word32)
The idea being to isolate the overhead of applying the most trivial function
and calling nf
on an example value of our output type (Word32
in this
case).
benchmarking baseline32
time 9.485 ns (9.434 ns .. 9.543 ns)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.509 ns (9.469 ns .. 9.559 ns)
std dev 155.8 ps (122.6 ps .. 227.8 ps)
variance introduced by outliers: 23% (moderately inflated)
If we consider this value the baseline for the measurements initially reported,
the new results are both linear-ish, as we would expect, and also the resulting
absolute measurements fall about where we’d expect from the assembly we’d hope
for (I still need to verify that this is actually the case), e.g. our intial
test
is in the ~1ns range, about what we’d expect from an inner loop with a
couple instructions.
I thought this was compelling enough to open an issue to see whether this technique might be incorporated into criterion directly. It’s at least a useful technique that I’ll keep playing with.
Anyway, benchmark your code.