Cutting the fat in haskell executables
- use bloaty
- link to helpful blog post
- try kcov on the pandoc-test binary; we would expect full coverage right?
ICF:
Since we can't seem to specify lld-9 directly in cabal invocation:
export PATH=/usr/lib/llvm-9/bin:$PATH
cabal new-build --ghc-option=-optl-fuse-ld=lld --ghc-option=-optl-Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections,--print-gc-sections --ghc-option=-threaded -j --ghc-option=-g --ghc-option=-v --disable-profiling --disable-library-profiling hlint-2.0.12
We see e.g.:
selected section /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/crti.o:(.text)
removing identical section /home/me/Code/BOLT-experiments/dist-newstyle/build/x86_64-linux/ghc-8.6.3/hlint-2.0.12/build/HSE/Type.dyn_o:(.text)
removing identical section /usr/lib/gcc/x86_64-linux-gnu/6/crtendS.o:(.text)
removing identical section /usr/lib/gcc/x86_64-linux-gnu/6/../../../x86_64-linux-gnu/crtn.o:(.text)
...
selected section /usr/local/lib/ghc-8.6.3/base-4.12.0.0/libHSbase-4.12.0.0.a(Float.o):(.text.slD4_info)
removing identical section /usr/local/lib/ghc-8.6.3/base-4.12.0.0/libHSbase-4.12.0.0.a(Float.o):(.text.slH0_info)
…so I guess this isn’t impeded by split-objects (since it actually works on sections?)
Add --ghc-option=-split-sections
:
cabal new-build --ghc-option=-split-sections --ghc-option=-optl-fuse-ld=lld --ghc-option=-optl-Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections,--print-gc-sections --ghc-option=-threaded -j --ghc-option=-g --ghc-option=-v --disable-profiling --disable-library-profiling hlint-2.0.12
NOTE: these sections disappear during the final link TODO: is that right? yes.
Now split C code
cabal new-build --ghc-option=-split-sections --ghc-option=-optc-fdata-sections --ghc-option=-optc-ffunction-sections --ghc-option=-optl-fuse-ld=lld --ghc-option=-optl-Wl,--icf=all,--ignore-data-address-equality,--ignore-function-address-equality,--print-icf-sections,--print-gc-sections --ghc-option=-threaded -j --ghc-option=-g --ghc-option=-v --disable-profiling --disable-library-profiling hlint-2.0.12
-
compile on stock ghc (verbose), ensure split-objects is necessary, ensure flags through cabal work, gc-sections called automatically
-
compile hlint using just split-objects/ gc sections (fllvm and native)
- check that functions etc in their own section using readelf (before and after)
- does it even work for llvm backend? (did not seem to for LTO branch)
-
then try icf only (native and LLVM)
- figure out if safe
-
then combine them (probably just on native)
-
(do the same for a hello.hs)
-
inspect what code is actually folded out / GCd
-
try with gold as well
-
try with call-graph PGO (gold or lld)
-
metrics:
- stripped size
- running time
- instr cache metrics
-
lld-9 options:
- identical code folding (with -ffunction-sections) (THESE ALSO AVAILABLE ON GOLD)
- and –icf=all –ignore-data-address-equality –ignore-function-address-equality
- …with
-split-sections
https://phabricator.haskell.org/D1242 Nice post: https://www.vidarholen.net/contents/blog/?p=729
- Gold: –orphan-handling [place,discard,warn,error]
- call-graph PGO x –gc-sections (already passed in clang!)
- –lto-O3
ld.lld-9 –icf=all –ignore-data-address-equality –ignore-function-address-equality –print-icf-sections
- identical code folding (with -ffunction-sections) (THESE ALSO AVAILABLE ON GOLD)
# EVENT_SET=instructions:u,L1-icache-load-misses,iTLB-load-misses,stalled-cycles-frontend,cache-misses
exec perf stat -e $EVENT_SET --append -o "/tmp/ghc8.6.3_perf_$EVENT_SET" -- "$executablename" -B"$topdir" ${1+"$@"}
OLD POST STUFF, TO RECYCLE:
Trying BOLT on runtime
$ $HOME/.stack/programs/x86_64-linux/ghc-8.6.3/lib/ghc-8.6.3/rts/libHSrts_thr-ghc8.6.3.so
With llvm-bolt -relocs
apparently “relocations against code are missing from
the input file. Cannot proceed in relocations mode (-relocs)”
Checked if bindist might allow us to re-link just rts, but no. comes with .so’s.
BUILD GHC
make a couple mods (esp to rts/ghc.mk)
./boot
./configure --enable-dwarf-unwind
...hm LDFLAGS doesnt seem right, as it gets passed to gcc; hopefully our mod works for linking RTS
This got us a runtime with .rela.text
sections.
BOLT made some progress but got an error in llvm-bolt
:
LLVMSymbolizer: error reading file: No such file or directory
…followed by a segfault
Filed here: https://github.com/facebookincubator/BOLT/issues/32#issuecomment-466684836
TODO: x experiment with hello world
- segfaults similarly when try to use llvm-bolt on runtime directly
- segfaults with perf2bolt using -ignore-build-id and running on rts.so
- try with -fno-jump-tables … ?
- ???
I’ve never really looked closely at instruction cache misses with haskell code.
I edited the ghc
wrapper script (see stack path | grep compiler-exe
) to count instructions and instruction cache misses like so:
# ...last line:
exec perf stat -e instructions,L1-icache-misses --append -o /tmp/ghc_perf -- "$executablename" -B"$topdir" ${1+"$@"}
## ACTUALLY THIS IS WHAT WE USED:
# We need to run these in batches, so we get reliable numbers (avoid muxing):
# EVENT_SET=instructions:u,major-faults,minor-faults,branch-misses,cache-misses,stalled-cycles-frontend
EVENT_SET=instructions:u,L1-dcache-load-misses,L1-dcache-prefetch-misses,L1-dcache-store-misses
# EVENT_SET=instructions:u,L1-icache-load-misses,dTLB-load-misses,dTLB-store-misses,iTLB-load-misses
exec perf stat -e $EVENT_SET --append -o "/tmp/ghc8.6.3_perf_$EVENT_SET" -- "$executablename" -B"$topdir" ${1+"$@"}
Light event docs, for perf:
man perf_event_open
TODO record more stats and show correlation with runtime (normalized by instruction count): - context switches - various cache misses? - major and minor page faults - basic loads and stores
PERF_COUNT_HW_CACHE_MISSES / PERF_COUNT_HW_CACHE_REFERENCES - llc misses pct PERF_COUNT_HW_BRANCH_MISSES PERF_COUNT_HW_STALLED_CYCLES_FRONTEND and BACKEND - instruction cache stuffs PERF_COUNT_SW_PAGE_FAULTS_MIN and MAJ
scatter plot with different colors for each metric:
- N.B.: make polling interval faster (check until compiling hello.hs gives stable results)
- check how timings are affected by additional event logging
- filter out samples below some short runtime and see if it looks different
TODO use perf to try to profile ghc (hot spots) flamegraph
Had to do this for perf to work without complaint:
sudo sysctl kernel.nmi_watchdog=0
Still no dice, multiplexed events: https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events
Install papi-tools
:
jberryman Code/BOLT-experiments ‹master*› » papi_avail | grep Counter
Number Hardware Counters : 11
Max Multiplex Counters : 384
It seems I can only do about 4 events w/out muxing
Note: no one fucking knows what these events actually mean: https://www.spinics.net/lists/linux-perf-users/msg04444.html most people including perf devs assume: “you have doubtless heard what results from assuming? you make butt-cheese of the world’s civilizations.” https://download.01.org/perfmon/index/ivybridge.html
Some perf
runs for code that is fast/good for comparison
Performance counter stats for 'ag -l rtti /home/me':
28,688,822,054 instructions # 0.63 insn per cycle
# 1.20 stalled cycles per insn
45,614,603,074 cpu-cycles
32,755,085 cache-misses # 0.0010
34,452,661,609 stalled-cycles-frontend # 75.53% frontend cycles idle
910,616,196 L1-icache-load-misses # 0.0317
25,868,972 iTLB-load-misses # 0.0010 NOTE: bafflingly, on my processor iTLB-loads counts second level TLB cache _hits_
496,019,792 L1-dcache-load-misses # 0.0170
123,184,879 L1-dcache-store-misses # 0.0043
85,146,310 branch-misses # 0.0030
26.771462891 seconds time elapsed
= 1.07e9 instrs/sec
Performance counter stats for 'make' of kubernetes (golang project)
3,897,432,670,709 instructions # 0.94 insn per cycle
# 0.72 stalled cycles per insn
4,131,498,018,917 cpu-cycles
8,153,293,029 cache-misses # 0.0021 (.0045 - .0085)
------ Two i cache metrics are both ~5x worse in `ag`
2,798,822,970,981 stalled-cycles-frontend # 67.74% frontend cycles idle
26,685,898,617 L1-icache-load-misses # 0.0068 (.02 - .034) 3-5x !!
847,355,581 iTLB-load-misses # 0.0002 (.0004 - .0011) 2-5x !!
------- These similar to ag run:
43,651,117,729 L1-dcache-load-misses # 0.0112 (.019 - .028)
12,732,485,616 L1-dcache-store-misses # 0.0033 (.0084 - .0102)
12,353,654,215 branch-misses # 0.0032 (.0145 - .0185) 5x !!
533.539062708 seconds time elapsed
= 7.30e9 instrs/sec
NOTE: 33 sec ghc run was: 2.1e9 instrs/sec
HYPOTHESIS: bad code locality, this maybe makes branch misses hurt even worse
Digression: meaning of stalled-cycles-frontend … ?
jberryman Code/BOLT-experiments ‹master*› » more /sys/bus/event_source/devices/cpu/events/stalled-cycles-frontend
event=0x0e,umask=0x01,inv,cmask=0x01
https://software.intel.com/sites/default/files/managed/7c/f1/253669-sdm-vol-3b.pdf
pg 394
UOPS_ISSUED.ANY
Increments each cycle the # of Uops issued by the RAT to RS.
Set Cmask = 1, Inv = 1, Any= 1to count stalled cycles of this core. Set Cmask
= 1, Inv = 1to count stalled cycles.
80H 04H ICACHE.IFETCH_STALL
Cycles where a code-fetch stalled due to L1 instruction-cache miss or an iTLB
miss.
80H 04H ICACHE.IFETCH_STALL
Cycles where a code-fetch stalled due to L1 instruction-cache miss or an iTLB
miss.
https://sites.utexas.edu/jdm4372/2014/06/04/counting-stall-cycles-on-the-intel-sandy-bridge-processor/
https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/506515
LATER:
- optimizing GHC, looking at:
- results of treemap analysis we performed
- think about how much of that work is duplicated when heavy inlining happens i.e. can we be faster by doing whole-program compilation?
- concurrent or improved GC
- implementing key changes we found in our GCC LTO experiment
- Evac.c:
- look at code hotspots, no inline cold paths
- port over Souper findings, maybe
- results of treemap analysis we performed
- Survey/explore additional backend project ideas:
- try Mesh allocator on GHC and other shit.
- try enabling/disabling THP (transparent huge pages) for TLB cache shit
- Look more carefully at LLVM passes: