I had an occasion to think about text-friendly binary encoding schemes for the
first time at work. The obvious choice is
Base64, but my immediate subsequent
thought was “there must be something much more efficient for our purposes”, and a quick google led
in which OP echos the same feeling:
It seems to me that we can greatly improve since on my keyboard I already see
94 easily distinguishable printable keys.
But of course if our goal is to “greatly improve” over Base64, then with a
with a little thought we might conclude that the answer is “no, we can’t”. In
Base64 we use
2^6 = 64 tokens, each of which represents a 6-bit string. Since
those tokens are ASCII they take up 8-bits. So with Base64 we’re already at 75%
efficiency (or “25% bloat”, or “33% larger than binary”), which seems… not so
You can read about other binary encoding schemes
here. From that table,
we can see that Base85 which @rene suggests is modestly-better at 80%
efficient. Base122 (the best on the list that can reasonably be called a
“text encoding”) is 86% efficient.
So you can make your messages ~13% smaller by ditching Base64 for the most
exotic Base122, but what about other considerations?
Things you really want in a binary-to-text encoding, aside from size-efficiency
- correct cross-platform compatible encoding/decoding; easy to get right
- fast to encode/decode
- compresses well (gzip is everywhere)
Other things that would be nice are that the encoding make some sense to the
naked eye, be easily diffable, maybe even support some operations without
requiring decoding (a binary search say).
It’s possible that all of these are more important to you than the efficiency
of the raw encoding. With that in mind let’s consider a third (in addition to
Base64 and BaseExoticInteger): the venerable hex encoding.
Hexadecimal (base-16) encoding requires two ASCII characters to represent a
byte, so it’s 50% efficient. But as we’ll see it’s arguably better than these
other two according to every other of our criteria!
Base64 is not sensible to the naked eye
Base64 encodes 6 bits per character. This means 3 octets (bytes) of input
become 4 characters of the output. In a world where our data is overwhelmingly
based on bytes and words our Base64 encoding is horribly out of phase!
When we see the two strings:
…our senses don’t tell us anything. Whereas in hex the lizard brain is able
to perceive patterns, symmetry and magnitude right away:
There must be value to being able to use our eyes (especially when it’s the
only sense we haven’t abandoned for the work we’re doing). The former might
represent an obscured bug in an encryption routine for instance.
Interestingly a Base85 encoding is also superior to Base64 in this respect:
every 5 characters represent 4 bytes of the input, so we retain the ability to
recognize and compare 32- and 64-bit word chunks.
Base85 is tricky, but Base64 is the worst kind of tricky
It’s a nearly-alphanumeric encoding, which reserves for the (in some cases,
more rare) last two code words the
/ characters. Furthermore the
choice of these last two characters varies among implementations. I have no
doubt that this has caused bugs, e.g. a validation regex that assumed an
Similarly, the encoding must itself be url-encoded if the
/ scheme is
used, which has certainly caused bugs. Same story for the
= padding rule
(quite possible to misunderstand, fail to test against, or never observe in
Base85 schemes are of course more complicated (and likely slower). We’d hope to
find well-tested implementations on all the platforms we require but even so we
should be prepared for the possibility that we’d need to implement it ourselves
in the future.
More compact encodings compress worse
Much of the data flying around the internet is gzipped at some layer of a
protocol. Because Base64/85 etc. are out of phase with bytes, and word sizes,
they tend to frustrate compression schemes by obscuring patterns in block
oriented data. Here are examples of gzip applied to the same tortured
Hobbes quote (270 bytes of
ASCII text, compressing to 194 bytes):
Encoding | Original size | Compressed size
-------- | ------------- | ---------------
hex | 541 | 249
Base64 | 361 | 289
Base85 | 342 | 313
So for uncompressed binary data we can probably expect a more compact
encoding to result in more bloat over the wire in a gzipped payload.
Two other things that were interesting to me:
- all of the compression tools I tried did worse on the hex encoded string than
on the original ascii. Maybe that’s due to the size required for the decoding
table? We could test on larger strings
- gzip was able to compress 361 bytes drawn from /dev/urandom to 316 bytes, so
it’s clear Base64 doesn’t wholly obscure the structure of the data to our
Other alternatives and conclusions
It probably doesn’t matter, so just use Base64. If size is the only thing that
matters then I’d suggest zipping first and then using the most gnarly encoding
you can stomach. But maybe you should be using a proper binary protocol in that
In a world where it was totally ubiquitous I would suggest using either the terminal-friendly
ZeroMQ Base85 flavor
or a simple hex encoding.
I also like that encodings like this one
exist though. It’s worth stepping back, doing some quick math, and making sure
that you’re optimizing for the right thing.