Cutting the fat in haskell executables

use bloaty
link to helpful blog post
try kcov on the pandoc-test binary; we would expect full coverage right?

ICF:

  Since we can't seem to specify lld-9 directly in cabal invocation:
export PATH=/usr/lib/llvm-9/bin:$PATH
cabal new-build --ghc-option=-optl-fuse-ld=lld --ghc-option=-optl-Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections,--print-gc-sections    --ghc-option=-threaded -j --ghc-option=-g  --ghc-option=-v  --disable-profiling --disable-library-profiling hlint-2.0.12

We see e.g.:

selected section /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/crti.o:(.text)
  removing identical section /home/me/Code/BOLT-experiments/dist-newstyle/build/x86_64-linux/ghc-8.6.3/hlint-2.0.12/build/HSE/Type.dyn_o:(.text)
  removing identical section /usr/lib/gcc/x86_64-linux-gnu/6/crtendS.o:(.text)
  removing identical section /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/crtn.o:(.text)
...
selected section /usr/local/lib/ghc-8.6.3/base-4.12.0.0/libHSbase-4.12.0.0.a(Float.o):(.text.slD4_info)
  removing identical section /usr/local/lib/ghc-8.6.3/base-4.12.0.0/libHSbase-4.12.0.0.a(Float.o):(.text.slH0_info)

…so I guess this isn’t impeded by split-objects (since it actually works on sections?)

Add --ghc-option=-split-sections:

cabal new-build --ghc-option=-split-sections --ghc-option=-optl-fuse-ld=lld --ghc-option=-optl-Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections,--print-gc-sections    --ghc-option=-threaded -j --ghc-option=-g  --ghc-option=-v  --disable-profiling --disable-library-profiling hlint-2.0.12

NOTE: these sections disappear during the final link TODO: is that right? yes.

Now split C code

cabal new-build --ghc-option=-split-sections --ghc-option=-optc-fdata-sections  --ghc-option=-optc-ffunction-sections  --ghc-option=-optl-fuse-ld=lld --ghc-option=-optl-Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections,--print-gc-sections    --ghc-option=-threaded -j --ghc-option=-g  --ghc-option=-v  --disable-profiling --disable-library-profiling hlint-2.0.12

compile on stock ghc (verbose), ensure split-objects is necessary, ensure flags through cabal work, gc-sections called automatically
compile hlint using just split-objects/ gc sections (fllvm and native)
- check that functions etc in their own section using readelf (before and after)
- does it even work for llvm backend? (did not seem to for LTO branch)
then try icf only (native and LLVM)
- figure out if safe
then combine them (probably just on native)
(do the same for a hello.hs)
inspect what code is actually folded out / GCd
try with gold as well
try with call-graph PGO (gold or lld)
metrics:
- stripped size
- running time
- instr cache metrics
lld-9 options:
- identical code folding (with -ffunction-sections) (THESE ALSO AVAILABLE ON GOLD)
  - and –icf=all –ignore-data-address-equality –ignore-function-address-equality
  - …with -split-sections https://phabricator.haskell.org/D1242 Nice post: https://www.vidarholen.net/contents/blog/?p=729
- Gold: –orphan-handling [place,discard,warn,error]
- call-graph PGO x –gc-sections (already passed in clang!)
- –lto-O3
ld.lld-9 –icf=all –ignore-data-address-equality –ignore-function-address-equality –print-icf-sections

# EVENT_SET=instructions:u,L1-icache-load-misses,iTLB-load-misses,stalled-cycles-frontend,cache-misses

exec  perf stat -e $EVENT_SET --append -o "/tmp/ghc8.6.3_perf_$EVENT_SET"   --    "$executablename" -B"$topdir" ${1+"$@"}

OLD POST STUFF, TO RECYCLE:

Trying BOLT on runtime

$ $HOME/.stack/programs/x86_64-linux/ghc-8.6.3/lib/ghc-8.6.3/rts/libHSrts_thr-ghc8.6.3.so

With llvm-bolt -relocs apparently “relocations against code are missing from the input file. Cannot proceed in relocations mode (-relocs)”

Checked if bindist might allow us to re-link just rts, but no. comes with .so’s.

BUILD GHC

make a couple mods (esp to rts/ghc.mk)

./boot
./configure --enable-dwarf-unwind

...hm LDFLAGS doesnt seem right, as it gets passed to gcc; hopefully our mod works for linking RTS

This got us a runtime with .rela.text sections. BOLT made some progress but got an error in llvm-bolt:

LLVMSymbolizer: error reading file: No such file or directory

…followed by a segfault

Filed here: https://github.com/facebookincubator/BOLT/issues/32#issuecomment-466684836

TODO: x experiment with hello world

segfaults similarly when try to use llvm-bolt on runtime directly
segfaults with perf2bolt using -ignore-build-id and running on rts.so
try with -fno-jump-tables … ?
???

I’ve never really looked closely at instruction cache misses with haskell code. I edited the ghc wrapper script (see stack path | grep compiler-exe) to count instructions and instruction cache misses like so:

# ...last line:
exec  perf stat -e instructions,L1-icache-misses --append -o /tmp/ghc_perf --  "$executablename" -B"$topdir" ${1+"$@"}


## ACTUALLY THIS IS WHAT WE USED:
# We need to run these in batches, so we get reliable numbers (avoid muxing):
# EVENT_SET=instructions:u,major-faults,minor-faults,branch-misses,cache-misses,stalled-cycles-frontend
EVENT_SET=instructions:u,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses
# EVENT_SET=instructions:u,L1-icache-load-misses,dTLB-load-misses,dTLB-store-misses,iTLB-load-misses

exec  perf stat -e $EVENT_SET --append -o "/tmp/ghc8.6.3_perf_$EVENT_SET"   --    "$executablename" -B"$topdir" ${1+"$@"}

Light event docs, for perf:

man perf_event_open

TODO record more stats and show correlation with runtime (normalized by instruction count): - context switches - various cache misses? - major and minor page faults - basic loads and stores

PERF_COUNT_HW_CACHE_MISSES / PERF_COUNT_HW_CACHE_REFERENCES - llc misses pct PERF_COUNT_HW_BRANCH_MISSES PERF_COUNT_HW_STALLED_CYCLES_FRONTEND and BACKEND - instruction cache stuffs PERF_COUNT_SW_PAGE_FAULTS_MIN and MAJ

scatter plot with different colors for each metric:
    - N.B.: make polling interval faster (check until compiling hello.hs gives stable results)
    - check how timings are affected by additional event logging

- filter out samples below some short runtime and see if it looks different

TODO use perf to try to profile ghc (hot spots) flamegraph

Had to do this for perf to work without complaint:

sudo sysctl kernel.nmi_watchdog=0

Still no dice, multiplexed events: https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events

Install papi-tools:

jberryman Code/BOLT-experiments ‹master*› » papi_avail | grep Counter
Number Hardware Counters : 11
Max Multiplex Counters   : 384

It seems I can only do about 4 events w/out muxing

Note: no one fucking knows what these events actually mean: https://www.spinics.net/lists/linux-perf-users/msg04444.html most people including perf devs assume: “you have doubtless heard what results from assuming? you make butt-cheese of the world’s civilizations.” https://download.01.org/perfmon/index/ivybridge.html

Some perf runs for code that is fast/good for comparison

 Performance counter stats for 'ag -l rtti /home/me':

    28,688,822,054      instructions              #    0.63  insn per cycle         
                                                  #    1.20  stalled cycles per insn
    45,614,603,074      cpu-cycles                                                  
        32,755,085      cache-misses              # 0.0010                                 

    34,452,661,609      stalled-cycles-frontend   #   75.53% frontend cycles idle   
       910,616,196      L1-icache-load-misses     # 0.0317
        25,868,972      iTLB-load-misses          # 0.0010  NOTE: bafflingly, on my processor iTLB-loads counts second level TLB cache _hits_

       496,019,792      L1-dcache-load-misses     # 0.0170                         
       123,184,879      L1-dcache-store-misses    # 0.0043                       
        85,146,310      branch-misses             # 0.0030

      26.771462891 seconds time elapsed
        = 1.07e9 instrs/sec


 Performance counter stats for 'make' of kubernetes (golang project)

 3,897,432,670,709      instructions              #    0.94  insn per cycle         
                                                  #    0.72  stalled cycles per insn
 4,131,498,018,917      cpu-cycles                                                  
     8,153,293,029      cache-misses              # 0.0021  (.0045 - .0085)

    ------ Two i cache metrics are both ~5x worse in `ag`
 2,798,822,970,981      stalled-cycles-frontend   #   67.74% frontend cycles idle   
    26,685,898,617      L1-icache-load-misses     # 0.0068   (.02   - .034)   3-5x !!
       847,355,581      iTLB-load-misses          # 0.0002   (.0004 - .0011)  2-5x !!
    ------- These similar to ag run:
    43,651,117,729      L1-dcache-load-misses     # 0.0112   (.019  - .028)          
    12,732,485,616      L1-dcache-store-misses    # 0.0033   (.0084 - .0102)        
    12,353,654,215      branch-misses             # 0.0032   (.0145 - .0185)  5x   !!

     533.539062708 seconds time elapsed
       = 7.30e9 instrs/sec

     NOTE: 33 sec ghc run was:  2.1e9 instrs/sec
     HYPOTHESIS: bad code locality, this maybe makes branch misses hurt even worse

Digression: meaning of stalled-cycles-frontend … ?

jberryman Code/BOLT-experiments ‹master*› » more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend 
event=0x0e,umask=0x01,inv,cmask=0x01
https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf
    pg 394

UOPS_ISSUED.ANY
Increments each cycle the # of Uops issued by the RAT to RS. 
Set Cmask = 1, Inv = 1, Any= 1to count stalled cycles of this core.  Set Cmask
= 1, Inv = 1to count stalled cycles.



80H 04H ICACHE.IFETCH_STALL
Cycles where a code-fetch stalled due to L1 instruction-cache miss or an iTLB
miss.

80H 04H ICACHE.IFETCH_STALL
Cycles where a code-fetch stalled due to L1 instruction-cache miss or an iTLB
miss.

https://sites.utexas.edu/jdm4372/2014/06/04/counting-stall-cycles-on-the-intel-sandy-bridge-processor/

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/506515

LATER:

optimizing GHC, looking at:
- results of treemap analysis we performed
  - think about how much of that work is duplicated when heavy inlining happens i.e. can we be faster by doing whole-program compilation?
- concurrent or improved GC
- implementing key changes we found in our GCC LTO experiment
- Evac.c:
  - look at code hotspots, no inline cold paths
  - port over Souper findings, maybe
Survey/explore additional backend project ideas:
try Mesh allocator on GHC and other shit.
- mimalloc: https://github.com/microsoft/mimalloc
- https://github.com/ghc/ghc/blob/master/rts/Arena.c
try enabling/disabling THP (transparent huge pages) for TLB cache shit
Look more carefully at LLVM passes:
- https://llvm.org/docs/Passes.html
- https://stackoverflow.com/questions/14829959/how-do-i-see-what-optimisation-passes-are-used-by-llvms-opt