Brandon.Si(mmons)

code / art / projects

Choosing a Binary-to-text Encoding

I had an occasion to think about text-friendly binary encoding schemes for the first time at work. The obvious choice is Base64, but my immediate subsequent thought was “there must be something much more efficient for our purposes”, and a quick google led here in which OP echos the same feeling:

It seems to me that we can greatly improve since on my keyboard I already see 94 easily distinguishable printable keys.

But of course if our goal is to “greatly improve” over Base64, then with a with a little thought we might conclude that the answer is “no, we can’t”. In Base64 we use 2^6 = 64 tokens, each of which represents a 6-bit string. Since those tokens are ASCII they take up 8-bits. So with Base64 we’re already at 75% efficiency (or “25% bloat”, or “33% larger than binary”), which seems… not so bad.

You can read about other binary encoding schemes here. From that table, we can see that Base85 which @rene suggests is modestly-better at 80% efficient. Base122 (the best on the list that can reasonably be called a “text encoding”) is 86% efficient.

Decision criteria

So you can make your messages ~13% smaller by ditching Base64 for the most exotic Base122, but what about other considerations?

Things you really want in a binary-to-text encoding, aside from size-efficiency are:

  • correct cross-platform compatible encoding/decoding; easy to get right
  • fast to encode/decode
  • compresses well (gzip is everywhere)

Other things that would be nice are that the encoding make some sense to the naked eye, be easily diffable, maybe even support some operations without requiring decoding (a binary search say).

It’s possible that all of these are more important to you than the efficiency of the raw encoding. With that in mind let’s consider a third (in addition to Base64 and BaseExoticInteger): the venerable hex encoding.

Hexadecimal (base-16) encoding requires two ASCII characters to represent a byte, so it’s 50% efficient. But as we’ll see it’s arguably better than these other two according to every other of our criteria!

Base64 is not sensible to the naked eye

Base64 encodes 6 bits per character. This means 3 octets (bytes) of input become 4 characters of the output. In a world where our data is overwhelmingly based on bytes and words our Base64 encoding is horribly out of phase!

When we see the two strings:

MTEyMlhYWVk=
WFhZWTExMjI=

…our senses don’t tell us anything. Whereas in hex the lizard brain is able to perceive patterns, symmetry and magnitude right away:

3131323258585959
5858595931313232

There must be value to being able to use our eyes (especially when it’s the only sense we haven’t abandoned for the work we’re doing). The former might represent an obscured bug in an encryption routine for instance.

Interestingly a Base85 encoding is also superior to Base64 in this respect: every 5 characters represent 4 bytes of the input, so we retain the ability to recognize and compare 32- and 64-bit word chunks.

Base85 is tricky, but Base64 is the worst kind of tricky

It’s a nearly-alphanumeric encoding, which reserves for the (in some cases, more rare) last two code words the + and / characters. Furthermore the choice of these last two characters varies among implementations. I have no doubt that this has caused bugs, e.g. a validation regex that assumed an alphanumeric encoding.

Similarly, the encoding must itself be url-encoded if the + and / scheme is used, which has certainly caused bugs. Same story for the = padding rule (quite possible to misunderstand, fail to test against, or never observe in examples).

Base85 schemes are of course more complicated (and likely slower). We’d hope to find well-tested implementations on all the platforms we require but even so we should be prepared for the possibility that we’d need to implement it ourselves in the future.

More compact encodings compress worse

Much of the data flying around the internet is gzipped at some layer of a protocol. Because Base64/85 etc. are out of phase with bytes, and word sizes, they tend to frustrate compression schemes by obscuring patterns in block oriented data. Here are examples of gzip applied to the same tortured Hobbes quote (270 bytes of ASCII text, compressing to 194 bytes):

Encoding | Original size | Compressed size
-------- | ------------- | ---------------
hex      | 541           | 249
Base64   | 361           | 289
Base85   | 342           | 313

So for uncompressed binary data we can probably expect a more compact encoding to result in more bloat over the wire in a gzipped payload.

Two other things that were interesting to me:

  • all of the compression tools I tried did worse on the hex encoded string than on the original ascii. Maybe that’s due to the size required for the decoding table? We could test on larger strings
  • gzip was able to compress 361 bytes drawn from /dev/urandom to 316 bytes, so it’s clear Base64 doesn’t wholly obscure the structure of the data to our compression algorithm

Other alternatives and conclusions

It probably doesn’t matter, so just use Base64. If size is the only thing that matters then I’d suggest zipping first and then using the most gnarly encoding you can stomach. But maybe you should be using a proper binary protocol in that case.

In a world where it was totally ubiquitous I would suggest using either the terminal-friendly ZeroMQ Base85 flavor or a simple hex encoding.

I also like that encodings like this one exist though. It’s worth stepping back, doing some quick math, and making sure that you’re optimizing for the right thing.

Almost Inline ASM in Haskell With Foreign Import Prim

With help from Reid Barton in questions here and here I discovered it’s pretty easy to call assembly from GHC haskell with minimal overhead, so I cleaned up an example of this technique and posted it here:

https://github.com/jberryman/almost-inline-asm-haskell-example

This is especially useful if you want to return multiple values from a foreign procedure, where otherwise with the traditional FFI approach you would have to do some allocation and stuff the values into a struct or something. I find the above more understandable in any case.

Here’s an example of the dumped ASM from the Main in the example above:

...
    call newCAF
    addq $8,%rsp
    testq %rax,%rax
    je _c73k
_c73j:
    movq $stg_bh_upd_frame_info,-16(%rbp)
    movq %rax,-8(%rbp)
    movq $block_info,-24(%rbp)
    movl $4,%edi
    movl $3,%esi
    movl $2,%r14d
    movl $1,%ebx
    addq $-24,%rbp
    jmp sipRound_s_x3
_c73z:
    movq $104,904(%r13)
    movq $block_info,-32(%rbp)
    movq %r14,-24(%rbp)
    movq %rsi,-16(%rbp)
    movq %rdi,-8(%rbp)
    movq %rbx,(%rbp)
    addq $-32,%rbp
...

You can see we just prepare argument registers, do whatever with the stack pointer, do a jump, and then push the return values onto the stack. For my purposes this was almost too much overhead to make this worthwhile (you can look at notes in the code).

I thought about sketching out a ghc proposal about a way to formalize this, maybe make it safer, and maybe somehow more efficient but I don’t have the time right now and don’t really have the expertise to know if this is even a good idea or how it could work.

Echo

K

echo

Announcing: Unagi-bloomfilter

I just released a new Haskell library called unagi-bloomfilter that is up now on hackage. You can install it with:

$ cabal install unagi-bloomfilter

The library uses the bloom-1 variant from “Fast Bloom Filters and Their Generalization” by Yan Qiao, et al. I’ll try to write more about it when I have the time. Also I just gave a talk on things I learned working on the project last night at the New York Haskell User Group:

http://www.meetup.com/NY-Haskell/events/233372271/

It was quite rough, but I was happy to hear from folks that found some interesting things to take away from it.

Thanks to Gershom for inviting me to speak, for my company Signal Vine for sponsoring my trip out, and to Yan Qiao for generously answering my silly questions and helping me understand the paper.

P.S. We’re hiring haskell developers

Signal Vine is an awesome group of people, with interesting technology and problems to solve, and we’re looking to grow the small development team. If you have some experience with haskell (you don’t have to be a guru) and are interested, please reach out to Jason or me at:

brandon@signalvine.com
jason@signalvine.com

Announcing: Hashabler 1.0. Now Even More Hashy With SipHash

I’ve just released version 1.0 of a haskell library for principled, cross-platform & extensible hashing of types. It is available on hackage, and can be installed with:

cabal install hashabler

(see my initial announcement post which has some motivation and pretty pictures)

You can see the CHANGELOG but the main change is an implementation of SipHash. It’s about as fast as our implementation of FNV-1a for bytestrings of length fifty and slightly faster when you get to length 1000 or so, so you should use it unless you’re wanting a hash with a simple implementation.

If you’re implementing a new hashing algorithm or hash-based data structure, please consider using hashabler instead of hashable.

Translating Some Stateful Bit-twiddling to Haskell

I just started implementing SipHash in hashabler and wanted to share a nice way I found to translate stateful bit-twiddling code in C (which makes heavy use of bitwise assignment operators) to haskell.

I was working from the reference implementation. As you can see statefulness and mutability are an implicit part of how the algorithm is defined, as it modifies the states of the v variables.

#define SIPROUND                                        \
  do {                                                  \
    v0 += v1; v1=ROTL(v1,13); v1 ^= v0; v0=ROTL(v0,32); \
    v2 += v3; v3=ROTL(v3,16); v3 ^= v2;                 \
    v0 += v3; v3=ROTL(v3,21); v3 ^= v0;                 \
    v2 += v1; v1=ROTL(v1,17); v1 ^= v2; v2=ROTL(v2,32); \
  } while(0)

int  siphash( uint8_t *out, const uint8_t *in, uint64_t inlen, const uint8_t *k )
{

  /* ... */

  for ( ; in != end; in += 8 )
  {
    m = U8TO64_LE( in );
    v3 ^= m;

    TRACE;
    for( i=0; i<cROUNDS; ++i ) SIPROUND;

    v0 ^= m;
  }

I wanted to translate this sort of code as directly as possible (I’d already decided if it didn’t work on the first try I would burn my laptop and live in the woods, rather than debug this crap).

First we’ll use name shadowing to “fake” our mutable variables, making it easy to ensure we’re always dealing with the freshest values.

{-# OPTIONS_GHC -fno-warn-name-shadowing #-}

We’ll also use RecordWildCards to make it easy to capture the “current state” of these values, through folds and helper functions.

{-# LANGUAGE RecordWildCards #-}

And finally we use the trivial Identity monad (this trick I learned from Oleg) which gets us the proper scoping we want for our v values:

import Data.Functor.Identity

Here’s a bit of the haskell:

siphash :: Hashable a => SipKey -> a -> Word64
siphash (k0,k1) = \a-> runIdentity $ do
    let v0 = 0x736f6d6570736575
        v1 = 0x646f72616e646f6d
        v2 = 0x6c7967656e657261
        v3 = 0x7465646279746573

    ...

    v3 <- return $ v3 `xor` k1;
    v2 <- return $ v2 `xor` k0;
    v1 <- return $ v1 `xor` k1;
    v0 <- return $ v0 `xor` k0;

    ...

    -- Initialize rest of SipState:
    let mPart          = 0
        bytesRemaining = 8
        inlen          = 0
    SipState{ .. } <- return $ hash (SipState { .. }) a

    let !b = inlen `unsafeShiftL` 56

    v3 <- return $ v3 `xor` b
    -- for( i=0; i<cROUNDS; ++i ) SIPROUND;
    (v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
    (v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3
    v0 <- return $ v0 `xor` b

    ...

    (v0,v1,v2,v3) <- return $ sipRound v0 v1 v2 v3

    return $! v0 `xor` v1 `xor` v2 `xor` v3

If you were really doing a lot of this sort of thing, you could even make a simple quasiquoter that could translate bitwise assignment into code like the above.

Announcing Hashabler: Like Hashable Only More So

I’ve just released the first version of a haskell library for principled, cross-platform & extensible hashing of types, which includes an implementation of the FNV-1a algorithm. It is available on hackage, and can be installed with:

cabal install hashabler

hashabler is a rewrite of the hashable library by Milan Straka and Johan Tibell, having the following goals:

  • Extensibility; it should be easy to implement a new hashing algorithm on any Hashable type, for instance if one needed more hash bits

  • Honest hashing of values, and principled hashing of algebraic data types (see e.g. #30)

  • Cross-platform consistent hash values, with a versioning guarantee. Where possible we ensure morally identical data hashes to indentical values regardless of processor word size and endianness.

  • Make implementing identical hash routines in other languages as painless as possible. We provide an implementation of a simple hashing algorithm (FNV-1a) and make an effort define Hashable instances in a way that is well-documented and sensible, so that e.g. one can (hopefully) easily implement string hashing routine in JavaScript that will match the way we hash strings here.

Motivation

I started writing a fast concurrent bloom filter variant, but found none of the existing libraries fit my needs. In particular hashable was deficient in a number of ways:

  • The number of hash bits my data structure requires can vary based on user parameters, and possibly be more than the 64-bits supported by hashable

  • Users might like to serialize their bloomfilter and store it, pass it to other machines, or work with it in a different language, so we need

    • hash values that are consistent across platforms
    • some guarantee of consistency across library versions

I was also very concerned about the general approach taken for algebraic types, which results in collision, the use of “hashing” numeric values to themselves, dubious combining functions, etc. It wasn’t at all clear to me how to ensure my data structure wouldn’t be broken if I used hashable. See below for a very brief investigation into hash goodness of the two libraries.

There isn’t interest in supporting my use case or addressing these issues in hashable (see e.g. #73, #30, and #74) and apparently hashable is working in practice for people, but maybe this new package will be useful for some other folks.

Hash goodness of hashable and hashabler, briefly

Hashing-based data structures assume some “goodness” of the underlying hash function, and may depend on the goodness of the hash function in ways that aren’t always clear or well-understood. “Goodness” also seems to be somewhat subjective, but can be expressed statistically in terms of bit-independence tests, and avalanche properties, etc.; various things that e.g. smhasher looks at.

I thought for fun I’d visualize some distributions, as that’s easier for my puny brain to understand than statistics. We visualize 32-bit hashes by quantizing by 64x64 and mapping that to a pixel following a hilbert curve to maintain locality of hash values. Then when multiple hash values fall within the same 64x64 pixel, we darken the pixel, and finally mark it red if we can’t go any further to indicate clipping.

It’s easy to cherry-pick inputs that will result in some bad behavior by hashable, but below I’ve tried to show some fairly realistic examples of strange or less-good distributions in hashable. I haven’t analysed these at all. Images are cropped ¼ size, but are representative of the whole 32-bit range.

First, here’s a hash of all [Ordering] of size 10 (~59K distinct values):

Hashabler:

Hashable:

Next here’s the hash of one million (Word8,Word8,Word8) (having a domain ~ 16 mil):

Hashabler:

Hashable:

I saw no difference when hashing english words, which is good news as that’s probably a very common use-case.

Please help

If you could test the library on a big endian machine and let me know how it goes, that would be great. See here.

You can also check out the TODOs scattered throughout the code and send pull requests. I mayb not be able to get to them until June, but will be very grateful!

P.S. hire me

I’m always open to interesting work or just hearing about how companies are using haskell. Feel free to send me an email at brandon.m.simmons@gmail.com

Benchmarking Very Fast Things With Criterion

There’s a pervasive myth that Bryan O’Sullivan’s excellent haskell benchmarking library criterion is only useful for benchmarks that take some significant chunk of time (I’ve even heard some people claim on the ms scale). In fact criterion is useful for almost anything you’d want to benchmark.

At a high level criterion makes your benchmark the inner loop of a function, and runs that loop a bunch of times, measures the result, and then divides by the number of iterations it performed. The approach is both useful for comparing alternative implementations, and probably the only meaningful way of answering “how long does this code take to run”, short of looking at the assembly and counting the instructions and consulting your processor’s manual.

If you’re skeptical, here’s a benchmark we’d expect to be very fast:

import Criterion.Main

main :: IO ()
main = do
    defaultMain [ 
        bench "sum2" $ nf sum [1::Int,2]
      , bench "sum4" $ nf sum [1::Int,2,3,4]
      , bench "sum5" $ nf sum [1::Int,2,3,4,5]
      ]

And indeed it’s on the order of nanoseconds:

benchmarking sum2
time                 27.20 ns   (27.10 ns .. 27.35 ns)
                     0.994 R²   (0.984 R² .. 1.000 R²)
mean                 28.72 ns   (27.29 ns .. 32.44 ns)
std dev              6.730 ns   (853.1 ps .. 11.71 ns)
variance introduced by outliers: 98% (severely inflated)

benchmarking sum4
time                 58.45 ns   (58.31 ns .. 58.59 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 58.47 ns   (58.26 ns .. 58.66 ns)
std dev              654.6 ps   (547.1 ps .. 787.8 ps)
variance introduced by outliers: 11% (moderately inflated)

benchmarking sum5
time                 67.08 ns   (66.84 ns .. 67.33 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 67.04 ns   (66.85 ns .. 67.26 ns)
std dev              705.5 ps   (596.3 ps .. 903.5 ps)

The results are consistent with each other; sum seems to be linear, taking 13-14ns per list element, across our different input sizes.

Trying to measure even faster things

This is what I was doing today which motivated this post. I was experimenting with measuring the inner loop of a hash function:

fnvInnerLoopTest :: Word8 -> Word32
{-# INLINE fnvInnerLoopTest #-}
fnvInnerLoopTest b = (2166136261 `xor` fromIntegral b) * 16777619

These were the results criterion gave me:

benchmarking test
time                 9.791 ns   (9.754 ns .. 9.827 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.798 ns   (9.759 ns .. 9.862 ns)
std dev              167.3 ps   (117.0 ps .. 275.3 ps)
variance introduced by outliers: 24% (moderately inflated)

These are the sorts of timescales that get into possibly measuring overhead of function calls, boxing/unboxing, etc. and should make you skeptical of criterion’s result. So I unrolled 4 and 8 iteration versions of these and measured the results:

main :: IO ()
main = do
    defaultMain [ 
        bench "test"  $ nf fnvInnerLoopTest   7
      , bench "test4" $ nf fnvInnerLoopTest4 (7,8,9,10)
      , bench "test8" $ nf fnvInnerLoopTest8 (7,8,9,10,11,12,13,14)
      ]

benchmarking test
time                 9.380 ns   (9.346 ns .. 9.418 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.448 ns   (9.399 ns .. 9.567 ns)
std dev              240.4 ps   (137.9 ps .. 418.6 ps)
variance introduced by outliers: 42% (moderately inflated)

benchmarking test4
time                 12.66 ns   (12.62 ns .. 12.72 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 12.68 ns   (12.64 ns .. 12.73 ns)
std dev              158.8 ps   (126.9 ps .. 215.7 ps)
variance introduced by outliers: 15% (moderately inflated)

benchmarking test8
time                 17.88 ns   (17.82 ns .. 17.94 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.89 ns   (17.81 ns .. 17.97 ns)
std dev              262.7 ps   (210.3 ps .. 349.7 ps)
variance introduced by outliers: 19% (moderately inflated)

So this seems to give a more clear picture of how good our bit twiddling is in that inner loop. I was curious if I could measure the overhead directly in criterion though. Somewhat surprisingly to me, it seems I could!

I added the following benchmark to my list:

  , bench "baseline32" $ nf (\x-> x) (777::Word32)

The idea being to isolate the overhead of applying the most trivial function and calling nf on an example value of our output type (Word32 in this case).

benchmarking baseline32
time                 9.485 ns   (9.434 ns .. 9.543 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.509 ns   (9.469 ns .. 9.559 ns)
std dev              155.8 ps   (122.6 ps .. 227.8 ps)
variance introduced by outliers: 23% (moderately inflated)

If we consider this value the baseline for the measurements initially reported, the new results are both linear-ish, as we would expect, and also the resulting absolute measurements fall about where we’d expect from the assembly we’d hope for (I still need to verify that this is actually the case), e.g. our intial test is in the ~1ns range, about what we’d expect from an inner loop with a couple instructions.

I thought this was compelling enough to open an issue to see whether this technique might be incorporated into criterion directly. It’s at least a useful technique that I’ll keep playing with.

Anyway, benchmark your code.