code / art / projects

GHC LLVM LTO Experiments Scratch Notes

This is a followup to this previous post where I play with GCC’s link-time optimization in GHC-compiled programs. I note there that this is limited since it only works on sources compiled from C (crucially the RTS). Here I’ve compiled rough notes doing the same thing with LLVM LTO, where I attempt to get true whole-program optimization cooking.

The goals were to play with some new tech and possibly discover some significant performance gains (without thinking very hard) that could be ported back or realized in a less hacky way. Results were underwhelming and so I don’t do much real digging into assembly (the other thing besides good results that would have made for an interesting post). So after waffling I decided to basically dump my notes here, lightly edited, in case it’s helpful to someone who wants to take up similar work later.

The GHC branch with scratch work is also a mess since it took a lot of fiddling and some manual intervention (below) before a full build completed. If you’re really interested in this work shoot me an email and I can try to help.

Building GHC for LLVM LTO

I used shims as before to quickly manipulate clang, lld, etc for LTO. The first issue a ran into was that symbols StgReturn and StgRun (both defined with inline asm) weren’t making it into the RTS library archive (in particular StgCRun.thr_o), possibly due to dead-code elimination; note all objects are actually LLVM bitcode, not native object code.

I worked around this by adding the functions back directly to the LLVM IR which was quite nasty. See commit message.

The next big issue was we weren’t able to produce LLVM IR bitcode from haskell code (e.g. in libraries) due to the LLVM mangler which fiddles directly with assembly, and is supposedly necessary for tables-next-to-code.

Normally this is what happens when you compile with -fllvm

opt-6.0 -O2 -enable-tbaa -tbaa '-relocation-model=static' hello.ll -o /tmp/ghc15832_0/ghc_2.bc
llc-6.0 -O2 -enable-tbaa '-relocation-model=static' '-mcpu=x86-64' '-mattr=+sse2,+sse' /tmp/ghc15832_0/ghc_2.bc -o /tmp/ghc15832_0/ghc_3.lm_s
*** LLVM Mangler
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -iquote. -Qunused-arguments -x assembler -c /tmp/ghc15832_0/ghc_4.s -o hello.o
Linking blaaaah.lto ...
*** C Compiler:
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -c /tmp/ghc15832_0/ghc_5.c -o /tmp/ghc15832_0/ghc_6.o ...
*** C Compiler:
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -c /tmp/ghc15832_0/ghc_8.s -o /tmp/ghc15832_0/ghc_9.o ...
*** Linker:
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -o blaaaah.lto -lm -Wl,--gc-sections hello.o ...
link: done

But I found an (undocumented/experimental?) flag -fast-llvm, that bypasses the mangler like so:

opt-6.0 -O2 -enable-tbaa -tbaa '-relocation-model=static' hello.ll -o /tmp/ghc18922_0/ghc_2.bc
llc-6.0 -O2 -enable-tbaa '-relocation-model=static' '-filetype=obj' '-mcpu=x86-64' '-mattr=+sse2,+sse' /tmp/ghc18922_0/ghc_2.bc -o hello.o
Linking blaaaah.lto ...
*** C Compiler:
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -c /tmp/ghc18922_0/ghc_3.c -o /tmp/ghc18922_0/ghc_4.o ...
*** C Compiler:
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -c /tmp/ghc18922_0/ghc_6.s -o /tmp/ghc18922_0/ghc_7.o ...
*** Linker:
clang -fno-stack-protector -DTABLES_NEXT_TO_CODE -o blaaaah.lto -lm -Wl,--gc-sections hello.o ...

So I manually added -fast-llvm to all the cabal files in the GHC tree, shimmed llc-6.0 to be just a cp (so hello.o stays an LLVM IR bitcode that can participate in LTO).

It’s not clear to me why the mangler doesn’t seem to actually be necessary (for the binaries I tested, on my machine).

The shims contained a fallback to compiling to object code, but eventually I got a GHC source tree of mostly LLVM bitcode:

$ for f in `find libraries -name '*.o'`; do grep -q LLVM <(file "$f") && echo "$f"; done  2>/dev/null | wc -l
$ for f in `find libraries -name '*.o'`; do grep -q  ELF <(file "$f") && echo "$f"; done  2>/dev/null | wc -l


I played with a few things in addition to just LTO (again timing hlint).


This was basically the same process as before, using AutoFDO. It wasn’t clear if it was really working as well as it could. Got a lot of warnings like:

ld.lld: warning: No debug information found in function r3UWn_info$def: Function profile not used
ld.lld: warning: No debug information found in function r3UUK_info$def: Function profile not used
ld.lld: warning: No debug information found in function hlintzm2zi0zi12zminplace_HintziPattern_zdwpatternHint_info$def: Function profile not used

Should have tried again with -fprofile-sample-accurate which treats functions with no call data as cold rather than unknown. This isn’t in clang 6, but I learned later that newer clang/LLVM seem to work pretty well with IR from GHC’s LLVM backend.


POLLY is a research optimization framework targeting LLVM IR and which is included/installed by default with my llvm-6 distribution, and it’s straightforward to try from opt or clang.

As an aside, to see what opt -polly (or other flags) actually does, you can do:

    $ touch fake
    $ opt fake  -O3 -polly -debug-pass=Arguments   # and compare to...
    $ opt fake  -O2 -debug-pass=Arguments

It’s a bit sketchy since optimization passes are not generally idempotent, but I decided to just run opt -O3 -polly over all the (optimized) LLVM IR in the GHC tree and ~/.cabal/store. There are a couple scripts for unpacking/repacking library archives in the SHIMS directory, if you ever need something like that.


An experimental tool from google that was annoying to install but actually pretty straightforward to use. It tries to discover new peephole-style optimization opportunities using an SMT solver (there’s like a whole paper and stuff), so it’s more something for compiler writers rather than something to include in an application compilation pipeline.

I first focused on Evac.c, a hotspot in the RTS. This found a few optimizations:

$ souper -z3-path=/home/me/.local/bin/z3  Evac_thr.thr_o

Ran much longer and found a few more:

$ souper -z3-path=/home/me/.local/bin/z3  -souper-exhaustive-synthesis   Evac_thr.thr_o

Here’s one example picked at random that you can stare at for a while:


Timing Results

Again an hlint run over the lens code tree, like:

hlint lens-4.17 --report=/dev/null -q --threads=4 +RTS -s -A32M
  • Stock GHC 8.6 (native codegen):

     30,029,597,382 instructions
     3.577023660 seconds time elapsed
  • Stock LLVM (stock ghc + fllvm)

     29,361,619,562 instructions:u            ( +-  0.45% )
     3.412444661 seconds time elapsed      ( +-  1.49% )
  • LLVM LTO after successfully compiling libraries to LTO objs:

     28,858,041,897 instructions:u            ( +-  0.03% )
     3.372028439 seconds time elapsed      ( +-  0.19% )
  • LLVM LTO (-O3):

     27,912,731,892 instructions:u            ( +-  0.17% )
     3.330403689 seconds time elapsed      ( +-  1.00% )
  • First attempt at PGO (lots of warnings about lack of profile data (mangling issue?)):

     27,965,192,528 instructions:u            ( +-  0.25% )
     3.396919779 seconds time elapsed      ( +-  1.97% )
  • PGO with O3:

     27,882,344,880 instructions:u            ( +-  0.06% )
     3.332535151 seconds time elapsed      ( +-  0.88% )

     27,509,027,644 instructions:u            ( +-  0.41% )
     3.370770830 seconds time elapsed      ( +-  1.51% )
  • LLVM LTO O3 + Souper over full cabal new-build:

     27,799,718,133 instructions:u            ( +-  0.18% )
     3.292543353 seconds time elapsed      ( +-  0.73% )
  • LLVM LTO O3 + souper on Evac (RTS):

     27,905,099,478 instructions:u            ( +-  0.16% )
     3.308439822 seconds time elapsed      ( +-  0.47% )

Stripped hlint executable sizes for LLVM lto experiments…

    20945776    llvm_lto
    20966256    llvm_lto_pgo1
    21113712    llvm_lto_O3
    21336792    llvm_lto_souper_evac_lld9
    21349120    llvm_lto_O3_souper_all
    21349184    llvm_lto_O3_souper_all_exhaustive
    21357304    llvm_lto_souper_evac
    22039256    llvm_lto_polly

…and for previous gcc LTO work:

    22415344    lto_pgo
    22423424    lto_Os
    22431616    lto_basic_O2
    22431616    lto_march_native
    22431616    lto_march_native2
    22431616    lto_march_native_O3
    22432216    gcc_stock
    24612696    llvm_stock

A Little Tool for Visualizing Ghc Verbose Timing

I wrote a little tool to graph the fine-grained timing logs produced by ghc when the -v (verbose) flag is given. These logs to STDERR look like, e.g.

!!! Liberate case [Data.Binary.Class]: finished in 32.06 milliseconds, allocated 14.879 megabytes
!!! Simplifier [Data.Binary.Class]: finished in 873.97 milliseconds, allocated 563.867 megabytes

The project is on GitHub at jberryman/ghc-timing-treemap, along with a screenshot.

Navigating within the graph is pretty slow at least in firefox, and there are some other improvements that could be made, for instance some of the phases are run multiple times on the same module and it would be nice to see these grouped, where the module name is logged.

Initial Hacking of GHC for GCC Link-time Optimization

I spent some time hacking GHC to use GCC’s link-time optimization, and wanted to share the results. The idea was to see whether we could get performance gains or other interesting results from :

  • cross module inlining, e.g. between C libraries from base and the RTS
  • GCC build flags that can only be added later, like -march=native to tailor optimizations to our microarchitecture
  • possibly nice interactions at barrier of haskell code and LTO’d C?

This was all very speculative, and the results you see below aren’t ultimately very interesting.

The change is pushed to this branch and you can check out the commit or cherrypick it here: 6f369534. It’s full of hacky workarounds, notes and scratch work (some of which isn’t relevant at all), my is included, etc.

Notably hacky is the use of shim executables for gcc, etc. after I found GHC’s build system wasn’t respecting flags or doing what I was expecting.

Hacking GHC

All the ugly details are in the changeset and I won’t go over most of, but essentially the process was:

  • compile with -flto -fno-fat-lto-objects (see shim), and keep fixing or working around things until the build completes
  • verify that LTO seemed to have worked by looking for gnu.lto sections, e.g. with for f infind . -name ‘*.o’; do grep -q gnu.lto <(readelf -Wa "$f") && echo "$f"; done 2>/dev/null
  • finally recompile GHC again with -ffat-lto-objects so that we can compile our own haskell programs, choosing to use LTO or not

“Fat LTO objects” are regular object files which have a special sections containing gcc’s GIMPLE IR, and other stuff it needs, so that they can be compiled normally (even by e.g. clang) or take advantage of LTO by invoking gcc from the linker. The downside is they make it difficult to be sure LTO is really happening.

Compiling haskell using hacked GHC

You can use a local GHC tree from cabal as follows:

$ cabal --with-compiler=SOMEDIR/ghc/inplace/bin/ghc-stage2  --package-db=SOMEDIR/ghc/inplace/lib/package.conf.d ...

I found it was pretty obvious when LTO was working and when we were (successfully) falling back to normal compilation using the “fat” part of our “fat objects” because the LTO build would bizarrely spit out a Werror warning from ghc libraries (that I couldn’t figure out how to silence, but which didn’t actually cause an error).

libraries/ghc-prim/cbits/atomic.c:219:10: error:
     note: ‘__sync_fetch_and_nand’ changed semantics in GCC 4.4

Also interestingly linking took notcieably longer with LTO, but only by a few seconds.

To build using the hacked GHC but without doing LTO this worked:

$ cabal new-build --ghc-option=-optl-fno-lto` ...

…but again we need the shims in our path.

…and Using PGO

GCC can perform profile-guided optimization by recompiling and optimizing with a performance profile collected from the program to be (re)compiled. The result might be better decisions re. inlining or locality. This was a good PGO tutorial that covers some gotchas that I encountered.

We can generate a profile using AutoFDO from data collected by perf, which means the binary doesn’t have to be built with expensive instrumentation (which also might throw off the resulting performance profile).

The general steps looked like:

$ cabal new-build ...
$ perf record -b <hlint ...>
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.125 MB (12379 samples) ]
$ create_gcov --binary=dist-newstyle/build/x86_64-linux/ghc-8.6.3/hlint-2.0.12/x/hlint/build/hlint/hlint --gcov=/tmp/ -gcov_version 1

Then we rebuild hlint after first modifying our gcc shim by adding -fauto-profile=/tmp/, and removing -flto.


I used as a test hlint, compiled in various ways, running over the lens codebase. The variations I measured (again by passing different args in the gcc shim, which stayed in my PATH) were

  • -O2 (a flag already in the shim when we built GHC)
  • -Os (optimize for small code size)
  • -O2 with profile-guided optimization (see below for a walk-through)
  • -march=native -O2, and
  • -march=native -O3 for auto-vectorization and magic ponies

I was hopeful the last two results would be interesting, since for a very slight increase in compile time users could tune the RTS to their microarchitecture in a way that GHC HQ cannot when preparing binary distributions, for instance.

As I mentioned the results were not too exciting:

bench results

Some notes:

  • timing and metric collection was done with several runs of perf stat --repeat=5 -e event1,event2,... limiting number of events we passed to perf so it didn’t need to sample
  • standard deviation for minor-faults was 3 - 5%, all others were < 1.5%
  • number of allocations and GCs were identical
  • I suspect that the regular O2 and two march=native versions were essentially the same (they all have exactly the same stripped size), although I didn’t look more closely at it.

On the last point, the manual states that “it is generally ineffective to specify an optimization level option only at link time and not at compile time”, though I can’t find much detailed guidance on that.

Interpreting Benchmarks

We’re interested in elapsed-time and how the other metrics might be correlated with it. The percentage improvements are relative to hlint compiled with stock ghc. The no_lto benchmark is hlint compiled with our hacked GHC but without LTO enabled (see above).

We can see a very small improvement in runtime. This might be attributable to the smaller number of branch misses, or something else entirely…

For the most part no_lto is acting as a useful control. It mostly tracks very closely to the performance of stock-ghc-compiled-hlint.

I’m not sure what to make of the instruction cache misses (this tends to be somewhat high in GHC code so I was interested). In sum the perf metrics don’t provide me much insight.

It’s not surprising that LTO doesn’t provide a huge benefit since only a small (but important) part of a haskell program originated from C. I’m not sure whether any optimizations become possible for haskell code linking with fat LTO objects, but likely not.

Final note: ghc itself seems to dynamically link its runtime and other libraries, so I’m not sure it could fully benefit from LTO (if there were any significant benefit).

Inspecting the code

I wanted to find at least one example of LTO that I could understand and present. I took the opportunity to start learning a new tool for me: radare2. This tutorial walkthrough was really helpful.

First since I was mostly interested in the runtime I compiled two “Hello world” programs, both with my patched GHC, one with LTO and one without.

Here are the stripped sizes:

821016 hello_lto
768280 hello_no_lto

I spent some time trying to use radiff2 to compare the files, which claimed to find thousands of differences while printing screenfulls of unpleasant messages about “anal threshold”.

I ended up trying to diff various hot functions

$ radiff2 -AC hello_lto hello_no_lto
                  sym.evacuate1   189 0x405bd0 | UNMATCH  (0.888889) | 0x408110    189 sym.evacuate1
sym.scavenge_block1.lto_priv.90  3327 0x408c10 |     NEW  (0.000000)
            sym.scavenge_block1  3117 0x4050a0 |     NEW  (0.000000)

$ radiff2  -g sym.evacuate1 hello_lto hello_no_lto | xdot -

I saw some instruction reordering (maybe due to an implicit -march=native making different pipelining decisions?), but nothing very interesting.

I went back and browsed the radiff2 instruction list, this time paying attention to size differences (hoping to see inlining). This gave me a jumping off point to start poking with r2 (the main radare forensics program), and I found getMonotonicNSec and getProcessElapsedTime formed a nice example.

In rts/ in the GHC tree we can see this code:

StgWord64 getMonotonicNSec(void)
    struct timespec ts;
    int res;

    res = clock_gettime(CLOCK_ID, &ts);
    if (res != 0) {
    return (StgWord64)ts.tv_sec * 1000000000 +


Time getProcessElapsedTime(void)
    return NSToTime(getMonotonicNSec());

Asking r2 to print what it thinks is the disassembly of this function in the non-LTO version of our hello.hs:

We can see a few things here:

  • radare seems to be confused and thinks that getMonotonicNSec is recursive, because the function that follows (getProcessElapsedTime) is optimized into a jmp. So it just displays the label corresponding to that function as a turquoise little comment.
  • r2 is telling us about two different callers of getMonotonicNSec and a few different callers of getProcessElapsedTime (…which it doesn’t think is a function…?)

We can jump (like ls in a shell) to the first of the callers of getProcessElapsedTime, and again ask r2 to print the function we’re in. Again, this is the non-LTO binary:

So r2 thinks we’re in a function called fdReady() and we call sym.getProcessElapsedTime towards the bottom there.

We can again reference the GHC source and find in libraries/:

fdReady(int fd, bool write, int64_t msecs, bool isSock)
    bool infinite = msecs < 0;
    // ... SNIP ...
            if (!infinite) {
                Time now = getProcessElapsedTime();
                remaining = endTime - now;
    // ... SNIP ...

Okay, so that all makes some sense; the libraries (and the C bits in them) shipped with GHC are compiled separately from (and before?) the RTS code. While we can see there is some indirection here, it makes sense that this couldn’t be optimized away (but see conclusion for more).

Let’s look at the LTO version of our hello.hs:

So we can see again (after some head scratching already accomplished), that r2 can’t figure out where getMonotonicNSec ends. But we also don’t see getProcessElapsedTime at all. Indeed if we ask r2 to print it we see it’s not there:

[0x00409a20]> pdf@getProcessElapsedTime
Invalid address (getProcessElapsedTime)
|ERROR| Invalid command 'pdf@getProcessElapsedTime' (0x70)

Looking for the fdReady function that, in the non-LTO code, called getProcessElapsedTime we see:

So fdReady is calling getMonotonicNSec and the jmp indirection that we had in the non-LTO version has been optimized away. And across modules that were compiled completely separately!

Conclusions and Future Work

Without demonstrating significant performance improvements I don’t think you can motivate supporting this in GHC.

It is nice though that this change could be added without breaking support for other C compilers (by using fat LTO objects), and (if using LTO) with very little increase in compile time.

Some misc possibilities:

  • gcc LTO simply doesn’t have much to do since all the haskell code is (I think) opaque to the linker-compiler
  • there are optimization barriers in the ghc build system; if we fixed them we would see more improvement
  • better optimizations on the initial build side might yield improvements during LTO (i.e. it’s not clear to me what gcc does with flags passed during LTO vs in the original compilation)
  • a different test case (maybe something less IO-bound? Or something more IO-bound…?) might demonstrate more improvements
  • maybe PGO is not effective during LTO, and we’d see some benefits to, say, recompiling the RTS using a generic profile
  • maybe things would work much better with new GCC (I’m using 6.3)
  • maybe there are other arch-specific optimizations I didn’t try that are beneficial, but which can’t be performed when compiling the generic bindist; i.e. LTO is beneficial not for whole-program optimization, but just for deferring some compilation decisions.

The work also uncovered some quirks or posssible improvements to some of the C code in the GHC tree. For instance there might be benefits to hiding symbols in the RTS code; in our shim we use -fno-semantic-interposition as a big hammer to do this, though I’m not sure what effect it had.

I’m not going to work more on this, but do plan on playing with LLVM’s LTO (which is how this project started out); I think I may have worked out the issue that blocked me in the course of this GCC LTO work. The interesting thing there is that theoretically we could get actual whole-program LTO on LLVM IR bitcode (since haskell has an LLVM backend).

Hacking Around on an FV-1 Based Guitar Pedal

This is a diary-style post where I chronicle what I learn digging into a digital guitar effects pedal for the first time. I’m not sure how useful or interesting it will be to others.

I picked up an oddball modulation pedal from a “boutique” maker (being coy here) and decided there were a bunch of things I could improve about it, so I thought I’d try to tweak it. I know almost nothing about electronics.

I took the guts out of the enclosure and first googled the chips on the board. The largest was labeled “SpinSemiconductor FV-1” This is a “batteries included” DSP chip and the only real product from Spin.


One of the pleasures of this project has been learning a little about the designer of the FV-1, Keith Barr. He passed away in 2010 and some of his contributions are summarized in this tribute: cofounder of MXR and creator of the Phase 90, founder of Alesis, pioneer in bringing digital recording and then digital reverb to the masses, etc.

The FV-1 was developed and released in the mid-2000s and is responsible for the boom in “boutique” reverb and (along with the PT2399) non-BBD delay pedals, being used by the likes of: Old Blood Noise, Catalinbread, Neunaber, Earthquaker, Red Panda, Keeley, Dr. Scientist, Walrus Audio, etc, etc.

I’m new to the space but it seems like Keith was the only person doing this sort of work (cheap DSP chip that acts like another analog component, accessible to hobbyists) and with his death there is sadly no “next-gen” FV-1 coming along.

I also noticed what turned out to be an EEPROM chip And a spot marked “pgm” where a 5-pin connector header could be soldered, presumably for programming. I guessed the EEPROM was where the code itself was stored (having read this was a capability).

I traced the “pgm” solder pads carefully (using the setting on my multimeter that beeps when a circuit is formed), and found they all connected to the EEPROM. I drew up a little picture showing how the header and EEPROM chip relate.

At this point I was feeling pretty hopeful that I might be able to both write and possibly dump/decompile code from the EEPROM and tweak the functionality to my liking (maybe… eventually).

Dumping the dang ROM

I didn’t have any kind of special programmer, but found that reading and writing to an EEPROM is easily done with an arduino

However I got concerned looking at the hardware setup: they connect 5v to pin A2, but all of mine are grounded.

I looked at the data sheet for the EEPROM and was surprised to find it really readable even for someone without much embedded background! I learned (sec 5.0 of the manual) these pins were for configuring the I²C address for the chip (this is clear in the tutorial, but I didn’t get it at first). So my EEPROM was hardwired to communicate at a different address (0x50).

Another thing that confused me for a second: the manual states the EEPROM’s write-protect pin needs to be connected to either Vcc or Vss (ground), but it didn’t seem to be connected to either (according to my multimeter). But after looking closer I found it was connected to Vcc via a 1k resistor (I guess too much resistance for my multimeter to see continuity). I think the idea is that the corresponding pin from the header could be connected to ground and current would flow there overriding the write-protect). I’m sure this is probably a common technique.

I have an older “Duemilanove” arduino and had to look up the SDA and SCL pins on my board (pretty easy to find on the arduino site).

I hoped the pedal’s circuitboard would have the necessary pullup resistors soldered on so I didn’t have to use a breadboard at all but that didn’t seem to be the case. The arduino’s ATmega’s built-in pull-up resistors also apparently won’t work reliably for i2c communication, so I just wired things up like the SparkFun article above shows.

Here’s what that looked like (mostly for my own future reference); mystery pedal on the left, Duemilanove in the background:

wired up for flashing EEPROM

(the dangling green wire can be connected to Gnd on the arduino to make the EEPROM writeable).

I used the arduino sketches from the sparkfun tutorial above, just tweaking some basics:

  • modified EEPROM_ADR (zero out least sig bits, since my EEPROM was hardwired to 000 since all three address pins were grounded)
  • loop over 32k bits with #define EEPROM_SIZE 4096

For developing and uploading the arduino sketch I used this this vim plugin. I just had to add to my .vimrc:

let g:arduino_board = 'arduino:avr:diecimila'

…then an :ArduinoUpload in VIM builds and uploads the sketch to the board.

And the following seemed to be the simplest way to listen on the arduino serial port for the dumped EEPROM bytes. Note the baud passed to -s here should match Serial.begin(<baud>) in the arduino code:

$ busybox microcom -s 115200 /dev/ttyUSB0 > pedal.rom

After running this for the first time and inspecting the output with hexdump the project went off the rails for a while…: the output repeated every 512 bytes (so was this actually a 4096 bit EEPROM?) Microchip’s docs describe it as “a single block of 4K x 8-bit memory”

IYI: in fact “4k” here means 4096; they also write “kbit” to mean 1024 bits, which is not what I understand the term to mean. Bafflingly nowhere in the datasheet are they unambiguous about the actual size of the EEPROM. Also AFAIU there is no way to determine the size of an EEPROM without writing to it and reading back.

Or was it communicating with the arduino’s internal EEPROM on the same i2c address or something? (I unplugged one of the lines going to the EEPROM and got garbage, so ruled this out)

Or is the reader program actually buggy and skipping every 8 bytes, and then wrapping?

After inspecting the code I became more and more convinced the latter was what was happening:

  • Arduino’s Wire library doesn’t look like i2c at all (in fact it’s communicating with a separate chip that does i2c asynchronously (although the Wire library doesn’t take advantage of this))
  • the library is an undocumented mess, though this helped a litte
  • most of the EEPROM code I tried didn’t actually match the documented spec as far as I could tell (e.g. no repeated START)

In short fertile ground for wasting a bunch of time chasing red herrings…

What was really happening is the chip actually contained 8 banks of identical programs, which is what the FV-1 in fact expects (they can be eight different programs obviously). Had I done a little more initial basic research about the FV-1, or taken the time to quickly rule out the idea that the EEPROM dump was correct despite the fact that I thought that was unlikely (easily done by writing to the last byte and reading again, which is what I eventually did), I would have saved myself a lot of time. This is like bayesian reasoning and it’s really easy to not do.

Also, oops. I had 5v running to the EEPROM (connected to the FV-1), although FV-1 takes 3.3v. Luckily this didn’t seem to fry the board…

Assembling / Disassembling

Someone named Igor has written an FV1 decompiler presumably as a PHP script. Unfortunately the code doesn’t seem to be open source, and I wasn’t immediately sure how to contact the author. After uploading my ROM I got back a legitimate looking but at-this-point-totally-meaningless-to-me bunch of annotated assembly, like:

   CHO     RDAL, SIN0      ; { ACC = SIN0 } ; Get LFO value
   WRAX    REG7 , 0        ; { REG7 = ACC; ACC = 0 * ACC ; PACC = ACC } ; * Note: C=0(16 bit) - s1.14 format
   OR      0x20000         ; { ACC = ACC | %00000000000000100000000000000000 }
   RDAX    REG7 , 1        ; { ACC = ACC + REG7 * 1 ; PACC = ACC } ; * Note: C=1(16 bit) - s1.14 format

So that’s cool. Now I need to get an assembler and I’m off to the races.

The standard SpinASM assembler is Windows only (I haven’t tried it with Wine yet), so I tried an alternate one github user ndf-zz has written in python called asfv1:

$ pip3 install asfv1
$ asfv1 -b pedal.rom.disassembled pedal.rom.reassembled

I got some errors which seemed related to a documented quirk that I didn’t totally understand. After adding decimals to the problematic literals I was able to successfully assemble my disassembled program!

Unfortunately comparing the first 512 bytes dumped from the EEPROM with my disassembled/reassembled output from asfv1 showed some differences. I upload the new ROM to the disassembler service again and looked at a diff and it appeared many immediate values of 1 were turned into e.g. 6.103515625E-5 or 0.001953125 after going through asfv1 (and the mystery blockbox disassembler).

I re-read the asfv1 README more carefully (I’d read “fixed-point” as “floating point”), did a little research and looked at a diff of the hexdumps of the roms and what was happening was pretty obvious:

asfv1 compiled the first line of assembly below to 0x0001, while the binary output for the original dumped ROM was achieved by using the decimal literal, as in the second line below:

                 `1` as unsigned 16-bit int

                          /   \
RDAX   ADCL , 1           00 01 02 84 ...etc
RDAX   ADCL , 1.0         40 00 02 84 ...etc
                          \   /

               `1` as s1.14 fixed point value:
                       \ fractional /

I wasn’t familiar with the notation “s1.14” used in the asfv1 and FV-1 ASM docs, but it quite simply means a fixed-point real value represented by: a sign bit, followed by 1 integer bit, followed by 14 fractional bits (totalling 16 bits).

I dug into the asfv1 code and tweaked things so that, for real arguments, we treat decimal literals as real literals and values entered in hexadecimal or binary notation as a raw bit value.

With my fork I successfully assembled the output I got from the blackbox disassembler, and miraculously the output from the new asfv1 matches the original program we dumped from the EEPROM (head -c 512 pedal.rom)!

$ md5sum * | sort                                                                                                    
38092c4673ff63f561ad3413c732db43  pedal.rom.reassembled_with_my_asfv1_fork
38092c4673ff63f561ad3413c732db43  pedal.rom.1
9d13dcb79754603e85eca19cbf405c4a  pedal.rom.reassembled

I did quick hack job on asfv1 without understanding the code or SpinASM very deeply, so beware, but if you want to try out the fork you can do:

$ pip3 install git+

Building SpinCAD Designer

I get the impression many if not most pedal makers are developing their algorithms with the GPL’d drag-and-drop editor SpinCAD Designer.

It seems difficult to build, but luisfcorreia has a fork where they’ve made it buildable with maven. Here’s what I had to do to get it to build and run successfully on my debian stretch box:

$ git clone
$ apt install maven
$ git branch maven
$ git pull dev
$ cd spincad-designer  # seems this work was done on a copy of the codebase to a different dir in the repo...?
$  mvn package
$ java -classpath ./target/spincad-1.0-SNAPSHOT.jar:./lib/elmGen-0.5.jar  com.holycityaudio.SpinCAD.SpinCADFrame

SpinCAD is able to import “Spin Hex” which apparently means the Intel HEX encoded ROM data. This is I guess a common format to feed to EEPROM writers, programmers, etc.

After some trial and error I was able to convert the binary image into HEX in such a way that “File > Open Hex” in SpinCAD didn’t choke:

$ sudo apt install srecord
$ srec_cat pedal.rom.1 -binary -output pedal.rom.1.hex -Intel --line-length=19

I was curious if SpinCAD would be able to disassemble and recognize the idioms from a spin ROM but unsurprisingly it does not. I probably won’t be using SpinCAD for this project, but the library of “modules” in the source code might be really valuable to learn from, or maybe I’ll build a visual “disassembler” myself using them some day.

Appendix: links, etc

A nice write-up/experience report from a first-time FV-1 developer, with good advice and links:

Easy DIY dev boards for FV-1 (why aren’t there more of these? What is the cheapest commercial FV-1 based pedal available?)

A alternatives to the FV-1 and some history:

Great resources for DIY pedal building generally to be found here. I found discussions of the FV-1 on all of these, though most builders focusing on analog electronics:

Choosing a Binary-to-text Encoding

I had an occasion to think about text-friendly binary encoding schemes for the first time at work. The obvious choice is Base64, but my immediate subsequent thought was “there must be something much more efficient for our purposes”, and a quick google led here in which OP echos the same feeling:

It seems to me that we can greatly improve since on my keyboard I already see 94 easily distinguishable printable keys.

But of course if our goal is to “greatly improve” over Base64, then with a with a little thought we might conclude that the answer is “no, we can’t”. In Base64 we use 2^6 = 64 tokens, each of which represents a 6-bit string. Since those tokens are ASCII they take up 8-bits. So with Base64 we’re already at 75% efficiency (or “25% bloat”, or “33% larger than binary”), which seems… not so bad.

You can read about other binary encoding schemes here. From that table, we can see that Base85 which @rene suggests is modestly-better at 80% efficient. Base122 (the best on the list that can reasonably be called a “text encoding”) is 86% efficient.

Decision criteria

So you can make your messages ~13% smaller by ditching Base64 for the most exotic Base122, but what about other considerations?

Things you really want in a binary-to-text encoding, aside from size-efficiency are:

  • correct cross-platform compatible encoding/decoding; easy to get right
  • fast to encode/decode
  • compresses well (gzip is everywhere)

Other things that would be nice are that the encoding make some sense to the naked eye, be easily diffable, maybe even support some operations without requiring decoding (a binary search say).

It’s possible that all of these are more important to you than the efficiency of the raw encoding. With that in mind let’s consider a third (in addition to Base64 and BaseExoticInteger): the venerable hex encoding.

Hexadecimal (base-16) encoding requires two ASCII characters to represent a byte, so it’s 50% efficient. But as we’ll see it’s arguably better than these other two according to every other of our criteria!

Base64 is not sensible to the naked eye

Base64 encodes 6 bits per character. This means 3 octets (bytes) of input become 4 characters of the output. In a world where our data is overwhelmingly based on bytes and words our Base64 encoding is horribly out of phase!

When we see the two strings:


…our senses don’t tell us anything. Whereas in hex the lizard brain is able to perceive patterns, symmetry and magnitude right away:


There must be value to being able to use our eyes (especially when it’s the only sense we haven’t abandoned for the work we’re doing). The former might represent an obscured bug in an encryption routine for instance.

Interestingly a Base85 encoding is also superior to Base64 in this respect: every 5 characters represent 4 bytes of the input, so we retain the ability to recognize and compare 32- and 64-bit word chunks.

Base85 is tricky, but Base64 is the worst kind of tricky

It’s a nearly-alphanumeric encoding, which reserves for the (in some cases, more rare) last two code words the + and / characters. Furthermore the choice of these last two characters varies among implementations. I have no doubt that this has caused bugs, e.g. a validation regex that assumed an alphanumeric encoding.

Similarly, the encoding must itself be url-encoded if the + and / scheme is used, which has certainly caused bugs. Same story for the = padding rule (quite possible to misunderstand, fail to test against, or never observe in examples).

Base85 schemes are of course more complicated (and likely slower). We’d hope to find well-tested implementations on all the platforms we require but even so we should be prepared for the possibility that we’d need to implement it ourselves in the future.

More compact encodings compress worse

Much of the data flying around the internet is gzipped at some layer of a protocol. Because Base64/85 etc. are out of phase with bytes, and word sizes, they tend to frustrate compression schemes by obscuring patterns in block oriented data. Here are examples of gzip applied to the same tortured Hobbes quote (270 bytes of ASCII text, compressing to 194 bytes):

Encoding | Original size | Compressed size
-------- | ------------- | ---------------
hex      | 541           | 249
Base64   | 361           | 289
Base85   | 342           | 313

So for uncompressed binary data we can probably expect a more compact encoding to result in more bloat over the wire in a gzipped payload.

Two other things that were interesting to me:

  • all of the compression tools I tried did worse on the hex encoded string than on the original ascii. Maybe that’s due to the size required for the decoding table? We could test on larger strings
  • gzip was able to compress 361 bytes drawn from /dev/urandom to 316 bytes, so it’s clear Base64 doesn’t wholly obscure the structure of the data to our compression algorithm

Other alternatives and conclusions

It probably doesn’t matter, so just use Base64. If size is the only thing that matters then I’d suggest zipping first and then using the most gnarly encoding you can stomach. But maybe you should be using a proper binary protocol in that case.

In a world where it was totally ubiquitous I would suggest using either the terminal-friendly ZeroMQ Base85 flavor or a simple hex encoding.

I also like that encodings like this one exist though. It’s worth stepping back, doing some quick math, and making sure that you’re optimizing for the right thing.

Almost Inline ASM in Haskell With Foreign Import Prim

With help from Reid Barton in questions here and here I discovered it’s pretty easy to call assembly from GHC haskell with minimal overhead, so I cleaned up an example of this technique and posted it here:

This is especially useful if you want to return multiple values from a foreign procedure, where otherwise with the traditional FFI approach you would have to do some allocation and stuff the values into a struct or something. I find the above more understandable in any case.

Here’s an example of the dumped ASM from the Main in the example above:

    call newCAF
    addq $8,%rsp
    testq %rax,%rax
    je _c73k
    movq $stg_bh_upd_frame_info,-16(%rbp)
    movq %rax,-8(%rbp)
    movq $block_info,-24(%rbp)
    movl $4,%edi
    movl $3,%esi
    movl $2,%r14d
    movl $1,%ebx
    addq $-24,%rbp
    jmp sipRound_s_x3
    movq $104,904(%r13)
    movq $block_info,-32(%rbp)
    movq %r14,-24(%rbp)
    movq %rsi,-16(%rbp)
    movq %rdi,-8(%rbp)
    movq %rbx,(%rbp)
    addq $-32,%rbp

You can see we just prepare argument registers, do whatever with the stack pointer, do a jump, and then push the return values onto the stack. For my purposes this was almost too much overhead to make this worthwhile (you can look at notes in the code).

I thought about sketching out a ghc proposal about a way to formalize this, maybe make it safer, and maybe somehow more efficient but I don’t have the time right now and don’t really have the expertise to know if this is even a good idea or how it could work.




Announcing: Unagi-bloomfilter

I just released a new Haskell library called unagi-bloomfilter that is up now on hackage. You can install it with:

$ cabal install unagi-bloomfilter

The library uses the bloom-1 variant from “Fast Bloom Filters and Their Generalization” by Yan Qiao, et al. I’ll try to write more about it when I have the time. Also I just gave a talk on things I learned working on the project last night at the New York Haskell User Group:

It was quite rough, but I was happy to hear from folks that found some interesting things to take away from it.

Thanks to Gershom for inviting me to speak, for my company Signal Vine for sponsoring my trip out, and to Yan Qiao for generously answering my silly questions and helping me understand the paper.

P.S. We’re hiring haskell developers

Signal Vine is an awesome group of people, with interesting technology and problems to solve, and we’re looking to grow the small development team. If you have some experience with haskell (you don’t have to be a guru) and are interested, please reach out to Jason or me at:

Announcing: Hashabler 1.0. Now Even More Hashy With SipHash

I’ve just released version 1.0 of a haskell library for principled, cross-platform & extensible hashing of types. It is available on hackage, and can be installed with:

cabal install hashabler

(see my initial announcement post which has some motivation and pretty pictures)

You can see the CHANGELOG but the main change is an implementation of SipHash. It’s about as fast as our implementation of FNV-1a for bytestrings of length fifty and slightly faster when you get to length 1000 or so, so you should use it unless you’re wanting a hash with a simple implementation.

If you’re implementing a new hashing algorithm or hash-based data structure, please consider using hashabler instead of hashable.