Initial hacking of GHC for GCC link-time optimization

Publish date: Apr 10, 2019
Last updated: Jan 3, 2025

Tags:

I spent some time hacking GHC to use GCC’s link-time optimization, and wanted to share the results. The idea was to see whether we could get performance gains or other interesting results from :

cross module inlining, e.g. between C libraries from base and the RTS
GCC build flags that can only be added later, like -march=native to tailor optimizations to our microarchitecture
possibly nice interactions at barrier of haskell code and LTO’d C?

This was all very speculative, and the results you see below aren’t ultimately very interesting.

The change is pushed to this branch and you can check out the commit or cherrypick it here: 6f369534. It’s full of hacky workarounds, notes and scratch work (some of which isn’t relevant at all), my build.mk is included, etc.

Notably hacky is the use of shim executables for gcc, etc. after I found GHC’s build system wasn’t respecting flags or doing what I was expecting.

Hacking GHC

All the ugly details are in the changeset and I won’t go over most of, but essentially the process was:

compile with -flto -fno-fat-lto-objects (see shim), and keep fixing or working around things until the build completes
verify that LTO seemed to have worked by looking for gnu.lto sections, e.g. with for f in find . -name ‘*.o’; do grep -q gnu.lto <(readelf -Wa "$f") && echo "$f"; done 2>/dev/null
finally recompile GHC again with -ffat-lto-objects so that we can compile our own haskell programs, choosing to use LTO or not

“Fat LTO objects” are regular object files which have a special sections containing gcc’s GIMPLE IR, and other stuff it needs, so that they can be compiled normally (even by e.g. clang) or take advantage of LTO by invoking gcc from the linker. The downside is they make it difficult to be sure LTO is really happening.

Compiling haskell using hacked GHC

You can use a local GHC tree from cabal as follows:

$ cabal --with-compiler=SOMEDIR/ghc/inplace/bin/ghc-stage2  --package-db=SOMEDIR/ghc/inplace/lib/package.conf.d ...

I found it was pretty obvious when LTO was working and when we were (successfully) falling back to normal compilation using the “fat” part of our “fat objects” because the LTO build would bizarrely spit out a Werror warning from ghc libraries (that I couldn’t figure out how to silence, but which didn’t actually cause an error).

libraries/ghc-prim/cbits/atomic.c:219:10: error:
     note: ‘__sync_fetch_and_nand’ changed semantics in GCC 4.4

Also interestingly linking took notcieably longer with LTO, but only by a few seconds.

To build using the hacked GHC but without doing LTO this worked:

$ cabal new-build --ghc-option=-optl-fno-lto` ...

…but again we need the shims in our path.

…and Using PGO

GCC can perform profile-guided optimization by recompiling and optimizing with a performance profile collected from the program to be (re)compiled. The result might be better decisions re. inlining or locality. This was a good PGO tutorial that covers some gotchas that I encountered.

We can generate a profile using AutoFDO from data collected by perf, which means the binary doesn’t have to be built with expensive instrumentation (which also might throw off the resulting performance profile).

The general steps looked like:

$ cabal new-build ...
$ perf record -b <hlint ...>
[ perf record: Woken up 20 times to write data ]
[ perf record: Captured and wrote 5.125 MB perf.data (12379 samples) ]
$ create_gcov --binary=dist-newstyle/build/x86_64-linux/ghc-8.6.3/hlint-2.0.12/x/hlint/build/hlint/hlint --profile=perf.data --gcov=/tmp/perf.data.gcov -gcov_version 1

Then we rebuild hlint after first modifying our gcc shim by adding -fauto-profile=/tmp/perf.data.gcov, and removing -flto.

Results

I used as a test hlint, compiled in various ways, running over the lens codebase. The variations I measured (again by passing different args in the gcc shim, which stayed in my PATH) were

-O2 (a flag already in the shim when we built GHC)
-Os (optimize for small code size)
-O2 with profile-guided optimization (see below for a walk-through)
-march=native -O2, and
-march=native -O3 for auto-vectorization and magic ponies

I was hopeful the last two results would be interesting, since for a very slight increase in compile time users could tune the RTS to their microarchitecture in a way that GHC HQ cannot when preparing binary distributions, for instance.

As I mentioned the results were not too exciting:

bench results

Some notes:

timing and metric collection was done with several runs of perf stat --repeat=5 -e event1,event2,... limiting number of events we passed to perf so it didn’t need to sample
standard deviation for minor-faults was 3 - 5%, all others were < 1.5%
number of allocations and GCs were identical
I suspect that the regular O2 and two march=native versions were essentially the same (they all have exactly the same stripped size), although I didn’t look more closely at it.

On the last point, the manual states that “it is generally ineffective to specify an optimization level option only at link time and not at compile time”, though I can’t find much detailed guidance on that.

Interpreting Benchmarks

We’re interested in elapsed-time and how the other metrics might be correlated with it. The percentage improvements are relative to hlint compiled with stock ghc. The no_lto benchmark is hlint compiled with our hacked GHC but without LTO enabled (see above).

We can see a very small improvement in runtime. This might be attributable to the smaller number of branch misses, or something else entirely…

For the most part no_lto is acting as a useful control. It mostly tracks very closely to the performance of stock-ghc-compiled-hlint.

I’m not sure what to make of the instruction cache misses (this tends to be somewhat high in GHC code so I was interested). In sum the perf metrics don’t provide me much insight.

It’s not surprising that LTO doesn’t provide a huge benefit since only a small (but important) part of a haskell program originated from C. I’m not sure whether any optimizations become possible for haskell code linking with fat LTO objects, but likely not.

Final note: ghc itself seems to dynamically link its runtime and other libraries, so I’m not sure it could fully benefit from LTO (if there were any significant benefit).

Inspecting the code

I wanted to find at least one example of LTO that I could understand and present. I took the opportunity to start learning a new tool for me: radare2. This tutorial walkthrough was really helpful.

First since I was mostly interested in the runtime I compiled two “Hello world” programs, both with my patched GHC, one with LTO and one without.

Here are the stripped sizes:

821016 hello_lto
768280 hello_no_lto

I spent some time trying to use radiff2 to compare the files, which claimed to find thousands of differences while printing screenfulls of unpleasant messages about “anal threshold”.

I ended up trying to diff various hot functions

$ radiff2 -AC hello_lto hello_no_lto
...
                  sym.evacuate1   189 0x405bd0 | UNMATCH  (0.888889) | 0x408110    189 sym.evacuate1
sym.scavenge_block1.lto_priv.90  3327 0x408c10 |     NEW  (0.000000)
            sym.scavenge_block1  3117 0x4050a0 |     NEW  (0.000000)
...

$ radiff2  -g sym.evacuate1 hello_lto hello_no_lto | xdot -

I saw some instruction reordering (maybe due to an implicit -march=native making different pipelining decisions?), but nothing very interesting.

I went back and browsed the radiff2 instruction list, this time paying attention to size differences (hoping to see inlining). This gave me a jumping off point to start poking with r2 (the main radare forensics program), and I found getMonotonicNSec and getProcessElapsedTime formed a nice example.

In rts/ in the GHC tree we can see this code:

StgWord64 getMonotonicNSec(void)
{
    struct timespec ts;
    int res;

    res = clock_gettime(CLOCK_ID, &ts);
    if (res != 0) {
        sysErrorBelch("clock_gettime");
        stg_exit(EXIT_FAILURE);
    }
    return (StgWord64)ts.tv_sec * 1000000000 +
           (StgWord64)ts.tv_nsec;

}

Time getProcessElapsedTime(void)
{
    return NSToTime(getMonotonicNSec());
}

Asking r2 to print what it thinks is the disassembly of this function in the non-LTO version of our hello.hs:

We can see a few things here:

radare seems to be confused and thinks that getMonotonicNSec is recursive, because the function that follows (getProcessElapsedTime) is optimized into a jmp. So it just displays the label corresponding to that function as a turquoise little comment.
r2 is telling us about two different callers of getMonotonicNSec and a few different callers of getProcessElapsedTime (…which it doesn’t think is a function…?)

We can jump (like ls in a shell) to the first of the callers of getProcessElapsedTime, and again ask r2 to print the function we’re in. Again, this is the non-LTO binary:

So r2 thinks we’re in a function called fdReady() and we call sym.getProcessElapsedTime towards the bottom there.

We can again reference the GHC source and find in libraries/:

int
fdReady(int fd, bool write, int64_t msecs, bool isSock)
{
    bool infinite = msecs < 0;
    // ... SNIP ...
            if (!infinite) {
                Time now = getProcessElapsedTime();
                remaining = endTime - now;
            }
    // ... SNIP ...
}

Okay, so that all makes some sense; the libraries (and the C bits in them) shipped with GHC are compiled separately from (and before?) the RTS code. While we can see there is some indirection here, it makes sense that this couldn’t be optimized away (but see conclusion for more).

Let’s look at the LTO version of our hello.hs:

So we can see again (after some head scratching already accomplished), that r2 can’t figure out where getMonotonicNSec ends. But we also don’t see getProcessElapsedTime at all. Indeed if we ask r2 to print it we see it’s not there:

[0x00409a20]> pdf@getProcessElapsedTime
Invalid address (getProcessElapsedTime)
|ERROR| Invalid command 'pdf@getProcessElapsedTime' (0x70)

Looking for the fdReady function that, in the non-LTO code, called getProcessElapsedTime we see:

So fdReady is calling getMonotonicNSec and the jmp indirection that we had in the non-LTO version has been optimized away. And across modules that were compiled completely separately!

Conclusions and Future Work

Without demonstrating significant performance improvements I don’t think you can motivate supporting this in GHC.

It is nice though that this change could be added without breaking support for other C compilers (by using fat LTO objects), and (if using LTO) with very little increase in compile time.

Some misc possibilities:

gcc LTO simply doesn’t have much to do since all the haskell code is (I think) opaque to the linker-compiler
there are optimization barriers in the ghc build system; if we fixed them we would see more improvement
better optimizations on the initial build side might yield improvements during LTO (i.e. it’s not clear to me what gcc does with flags passed during LTO vs in the original compilation)
a different test case (maybe something less IO-bound? Or something more IO-bound…?) might demonstrate more improvements
maybe PGO is not effective during LTO, and we’d see some benefits to, say, recompiling the RTS using a generic profile
maybe things would work much better with new GCC (I’m using 6.3)
maybe there are other arch-specific optimizations I didn’t try that are beneficial, but which can’t be performed when compiling the generic bindist; i.e. LTO is beneficial not for whole-program optimization, but just for deferring some compilation decisions.

The work also uncovered some quirks or posssible improvements to some of the C code in the GHC tree. For instance there might be benefits to hiding symbols in the RTS code; in our shim we use -fno-semantic-interposition as a big hammer to do this, though I’m not sure what effect it had.

I’m not going to work more on this, but do plan on playing with LLVM’s LTO (which is how this project started out); I think I may have worked out the issue that blocked me in the course of this GCC LTO work. The interesting thing there is that theoretically we could get actual whole-program LTO on LLVM IR bitcode (since haskell has an LLVM backend).