Choosing a binary-to-text encoding
I had an occasion to think about text-friendly binary encoding schemes for the first time at work. The obvious choice is Base64, but my immediate subsequent thought was “there must be something much more efficient for our purposes”, and a quick google led here in which OP echos the same feeling:
It seems to me that we can greatly improve since on my keyboard I already see 94 easily distinguishable printable keys.
But of course if our goal is to “greatly improve” over Base64, then with a
with a little thought we might conclude that the answer is “no, we can’t”. In
Base64 we use 2^6 = 64
tokens, each of which represents a 6-bit string. Since
those tokens are ASCII they take up 8-bits. So with Base64 we’re already at 75%
efficiency (or “25% bloat”, or “33% larger than binary”), which seems… not so
bad.
You can read about other binary encoding schemes here. From that table, we can see that Base85 which @rene suggests is modestly-better at 80% efficient. Base122 (the best on the list that can reasonably be called a “text encoding”) is 86% efficient.
Decision criteria
So you can make your messages ~13% smaller by ditching Base64 for the most exotic Base122, but what about other considerations?
Things you really want in a binary-to-text encoding, aside from size-efficiency are:
- correct cross-platform compatible encoding/decoding; easy to get right
- fast to encode/decode
- compresses well (gzip is everywhere)
Other things that would be nice are that the encoding make some sense to the naked eye, be easily diffable, maybe even support some operations without requiring decoding (a binary search say).
It’s possible that all of these are more important to you than the efficiency of the raw encoding. With that in mind let’s consider a third (in addition to Base64 and BaseExoticInteger): the venerable hex encoding.
Hexadecimal (base-16) encoding requires two ASCII characters to represent a byte, so it’s 50% efficient. But as we’ll see it’s arguably better than these other two according to every other of our criteria!
Base64 is not sensible to the naked eye
Base64 encodes 6 bits per character. This means 3 octets (bytes) of input become 4 characters of the output. In a world where our data is overwhelmingly based on bytes and words our Base64 encoding is horribly out of phase!
When we see the two strings:
MTEyMlhYWVk=
WFhZWTExMjI=
…our senses don’t tell us anything. Whereas in hex the lizard brain is able to perceive patterns, symmetry and magnitude right away:
3131323258585959
5858595931313232
There must be value to being able to use our eyes (especially when it’s the only sense we haven’t abandoned for the work we’re doing). The former might represent an obscured bug in an encryption routine for instance.
Interestingly a Base85 encoding is also superior to Base64 in this respect: every 5 characters represent 4 bytes of the input, so we retain the ability to recognize and compare 32- and 64-bit word chunks.
Base85 is tricky, but Base64 is the worst kind of tricky
It’s a nearly-alphanumeric encoding, which reserves for the (in some cases,
more rare) last two code words the +
and /
characters. Furthermore the
choice of these last two characters varies among implementations. I have no
doubt that this has caused bugs, e.g. a validation regex that assumed an
alphanumeric encoding.
Similarly, the encoding must itself be url-encoded if the +
and /
scheme is
used, which has certainly caused bugs. Same story for the =
padding rule
(quite possible to misunderstand, fail to test against, or never observe in
examples).
Base85 schemes are of course more complicated (and likely slower). We’d hope to find well-tested implementations on all the platforms we require but even so we should be prepared for the possibility that we’d need to implement it ourselves in the future.
More compact encodings compress worse
Much of the data flying around the internet is gzipped at some layer of a protocol. Because Base64/85 etc. are out of phase with bytes, and word sizes, they tend to frustrate compression schemes by obscuring patterns in block oriented data. Here are examples of gzip applied to the same tortured Hobbes quote (270 bytes of ASCII text, compressing to 194 bytes):
Encoding | Original size | Compressed size
-------- | ------------- | ---------------
hex | 541 | 249
Base64 | 361 | 289
Base85 | 342 | 313
So for uncompressed binary data we can probably expect a more compact encoding to result in more bloat over the wire in a gzipped payload.
Two other things that were interesting to me:
- all of the compression tools I tried did worse on the hex encoded string than on the original ascii. Maybe that’s due to the size required for the decoding table? We could test on larger strings
- gzip was able to compress 361 bytes drawn from /dev/urandom to 316 bytes, so it’s clear Base64 doesn’t wholly obscure the structure of the data to our compression algorithm
Other alternatives and conclusions
It probably doesn’t matter, so just use Base64. If size is the only thing that matters then I’d suggest zipping first and then using the most gnarly encoding you can stomach. But maybe you should be using a proper binary protocol in that case.
In a world where it was totally ubiquitous I would suggest using either the terminal-friendly ZeroMQ Base85 flavor or a simple hex encoding.
I also like that encodings like this one exist though. It’s worth stepping back, doing some quick math, and making sure that you’re optimizing for the right thing.