Contents

Does profile guided optimisation work and is it worth the hassle?

Recently, while exploring Python’s code base I stumbled upon these cryptic lines in its build system. I’ve never used profile guided optimisations before so, that led me into a goose chase to learn more about them. This post is about how to use clang to employ these optimisation techniques in practice and how much performance gains can be achieved (if any).

What’s profile guided optimisations (PGO)?

The main principles behind PGO can be described as:

  1. Build your code with instrumentation in place
  2. Collect runtime profiling data from instrumented code
  3. Rebuild your code once again with collected profiling data to apply optimisations

Why is this effective? During runtime, the instrumentation can collect data about which branches in the code are more likely to be taken and apply optimisations accordingly to favour them. That’s, of course, only one of the techniques. There’s a lot more. In general, the compiler collects runtime data that later can be used to drive optimisation heuristics in subsequent compile runs.

LLVM documentation provides a great reference with a lot more details.

Is this effective at all?

Well… it depends. The instrumented application will generate profiling data which can be later used by the compiler to optimise the application itself. So, the more specific your use case is, when generating the profiling data, the better your application will be optimised for that specific use case. But, with the increase of code coverage during profiling data generation, the less specific it becomes so, the optimisations that the compiler might introduce later on, will be less effective.

Experiments

Just to find out myself, I’ve decided to play around with bash and lua source code. There’s no specific reason for the choice. I had the repos cloned since I’ve worked with this code before already.

Bash experiments

There isn’t really a bash benchmark suite as far as I know so, I decided just to test the waters with a couple of simple scripts. The first one just counts up to a million.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
counter() {
    local n="${1:-1}"
    local i=0
    for ((i=0; i<n; i++)); do
        true
    done
    echo $i
}

counter "${1:-1000000}"

The second one splits a large dataset (about 50MB of text data - I used GBvideos.txt from Trending Youtube videos dataset) by commas and concatenates all collected fields into a one large string using “|” delimiter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
declare -r IN="GBvideos.csv"

join_with() {
    local IFS="$1"
    shift
    echo "$*"
}

declare -a all_records=()

while IFS="," read -ra record; do
    all_records+=("${record[@]}")
done < "$IN"

all_records_str="$(join_with "|" "${all_records[@]}")"

echo "all records concatenated length: ${#all_records_str}"

The point was to test two non-overlapping functionalities from bash and see how PGO will affect the performance.

Building bash with PGO

First, bash needs to be built with PGO instrumentation enabled:

1
2
3
./configure CC=clang CFLAGS="-fprofile-instr-generate"
make -j
mv bash bash_instrumented

Now, it’s time for data collection with each of the test scripts:

1
2
LLVM_PROFILE_FILE="bash_counter-%p.profraw" hyperfine "./bash_instrumented counter.sh"
LLVM_PROFILE_FILE="bash_csv_split-%p.profraw" hyperfine "./bash_instrumented csv_split.sh"

I’m using hyperfine as a benchmarking tool but I don’t really care about the results in this run - I just use it to conveniently run the scripts 10 times in a row and generate data from the instrumentation. Having the instrumentation data, it’s time to merge the raw data into a profiling dataset:

1
2
3
llvm-profdata merge -o bash_counter.profdata bash_counter-*profraw
llvm-profdata merge -o bash_csv.profdata bash_csv_split-*profraw
llvm-profdata merge -o bash_both.profdata bash*profraw

With the profiling data in hand, I’ll rebuild bash three more times and actually use the data to apply optimisations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
./configure CC=clang CFLAGS="-fprofile-instr-use=$PWD/bash_counter.profdata -O2"
make -j
mv bash bash_pgo_counter

./configure CC=clang CFLAGS="-fprofile-instr-use=$PWD/bash_csv.profdata -O2"
make -j
mv bash bash_pgo_csv_split

./configure CC=clang CFLAGS="-fprofile-instr-use=$PWD/bash_both.profdata -O2"
make -j
mv bash bash_pgo_both

I have now four executables to test with:

  • bash_vanilla - vanilla build with no PGO,
  • bash_pgo_counter - build with PGO with data from counter.sh runs,
  • bash_pgo_csv_split - build with PGO with data from csv_split.sh runs,
  • bash_pgo_both - build with PGO with data from both counter.sh and csv_split.sh runs.

Testing bash binaries with PGO

Let’s generate some benchmark results now:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
hyperfine --export-csv bash_vanilla_counter.csv "./bash_vanilla counter.sh"
hyperfine --export-csv bash_vanilla_csv_split.csv "./bash_vanilla csv_split.sh"

hyperfine --export-csv bash_pgo_both_counter.csv "./bash_pgo_both counter.sh"
hyperfine --export-csv bash_pgo_both_csv_split.csv "./bash_pgo_both csv_split.sh"

hyperfine --export-csv bash_pgo_counter_counter.csv "./bash_pgo_counter counter.sh"
hyperfine --export-csv bash_pgo_csv_split_csv_split.csv "./bash_pgo_csv_split csv_split.sh"

hyperfine --export-csv bash_pgo_counter_csv_split.csv "./bash_pgo_counter csv_split.sh"
hyperfine --export-csv bash_pgo_csv_split_counter.csv "./bash_pgo_csv_split counter.sh"

Here’s the raw results for counter.sh:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Benchmark 1: ./bash_vanilla ~/testscripts/bash/counter.sh
  Time (mean ± σ):      3.124 s ±  0.034 s    [User: 3.120 s, System: 0.002 s]
  Range (min … max):    3.088 s …  3.210 s    10 runs

Benchmark 1: ./bash_pgo_counter ~/testscripts/bash/counter.sh
  Time (mean ± σ):      2.431 s ±  0.008 s    [User: 2.428 s, System: 0.002 s]
  Range (min … max):    2.413 s …  2.441 s    10 runs

Benchmark 1: ./bash_pgo_both ~/testscripts/bash/counter.sh
  Time (mean ± σ):      2.460 s ±  0.013 s    [User: 2.457 s, System: 0.001 s]
  Range (min … max):    2.441 s …  2.488 s    10 runs

Benchmark 1: ./bash_pgo_csv_split ~/testscripts/bash/counter.sh
  Time (mean ± σ):      2.850 s ±  0.016 s    [User: 2.847 s, System: 0.001 s]
  Range (min … max):    2.820 s …  2.872 s    10 runs

Same set of benchmark results for csv_split.sh

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
Benchmark 1: ./bash_vanilla ~/testscripts/bash/csv_split.sh
  Time (mean ± σ):     20.989 s ±  0.076 s    [User: 19.859 s, System: 1.201 s]
  Range (min … max):   20.881 s … 21.119 s    10 runs

Benchmark 1: ./bash_pgo_csv_split ~/testscripts/bash/csv_split.sh
  Time (mean ± σ):     19.776 s ±  0.044 s    [User: 18.669 s, System: 1.170 s]
  Range (min … max):   19.720 s … 19.836 s    10 runs

Benchmark 1: ./bash_pgo_both ~/testscripts/bash/csv_split.sh
  Time (mean ± σ):     19.698 s ±  0.088 s    [User: 18.559 s, System: 1.204 s]
  Range (min … max):   19.557 s … 19.815 s    10 runs

Benchmark 1: ./bash_pgo_counter ~/testscripts/bash/csv_split.sh
  Time (mean ± σ):     20.284 s ±  0.083 s    [User: 19.154 s, System: 1.201 s]
  Range (min … max):   20.183 s … 20.414 s    10 runs

Additionally, let’s graph these results.

/pgo/pgo_counter.png
counter.sh benchmark

/pgo/pgo_csv_split.png
csv_split.sh benchmark

The results look promising. PGO has managed to shave off ~21% of counter.sh and ~5% for csv_split.sh. An interesting thing is that there seems to be less of an impact when trying to run something else than the thing which was initially used for PGO. This is the case when running counter.sh with bash_pgo_csv_split or csv_split.sh with bash_pgo_counter. The results are still better than with vanilla bash but not as good as with the original code used for instrumentation.

Experimenting with Lua

Since my bash test programs were extremely trivial, I decided to try something more representative with Lua interpreter. I’ve chosen Lua, as it’s nice and small and easy to build. Also, I’ve found a site with programming languages and compiler benchmarks and it seems to include Lua there as well so, it’s a perfect opportunity to try out the test code used there.

Just as before, I’m gonna prepare a vanilla version of the interpreter, one with instrumentation enabled and an optimised version. Judging by the results from the previous experiment, it seems that it’s not worth to focus on collecting profiling data for individual test programs so, I’m gonna apply PGO from profiling data collected from all test programs at once.

The entire test suite contains only 5 lua test programs. Out of these five, I’m gonna use the following four, invoked with the following parameters:

1
2
3
4
binarytrees/1.lua 18
merkletrees/1.lua 17
nbody/4.lua 5000000
spectral-norm/1.lua 2000

This is to assure a sufficient running time. After running the benchmarks, I’m left with the following results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Benchmark 1: /home/tomasz/testscripts/lua/binarytree.sh ./lua_pgo
  Time (mean ± σ):     13.902 s ±  0.070 s    [User: 13.766 s, System: 0.110 s]
  Range (min … max):   13.821 s … 14.029 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/merkletree.sh ./lua_pgo
  Time (mean ± σ):     13.471 s ±  0.200 s    [User: 13.336 s, System: 0.103 s]
  Range (min … max):   13.232 s … 13.954 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/nbody.sh ./lua_pgo
  Time (mean ± σ):     17.965 s ±  0.467 s    [User: 17.944 s, System: 0.003 s]
  Range (min … max):   17.484 s … 18.814 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/spectral_norm.sh ./lua_pgo
  Time (mean ± σ):     11.486 s ±  0.038 s    [User: 11.473 s, System: 0.004 s]
  Range (min … max):   11.418 s … 11.535 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/binarytree.sh ./lua_vanilla
  Time (mean ± σ):     14.708 s ±  0.052 s    [User: 14.561 s, System: 0.123 s]
  Range (min … max):   14.623 s … 14.793 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/merkletree.sh ./lua_vanilla
  Time (mean ± σ):     15.046 s ±  0.260 s    [User: 14.920 s, System: 0.097 s]
  Range (min … max):   14.671 s … 15.510 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/nbody.sh ./lua_vanilla
  Time (mean ± σ):     14.872 s ±  0.448 s    [User: 14.857 s, System: 0.002 s]
  Range (min … max):   14.232 s … 15.717 s    10 runs

Benchmark 1: /home/tomasz/testscripts/lua/spectral_norm.sh ./lua_vanilla
  Time (mean ± σ):     10.195 s ±  0.117 s    [User: 10.181 s, System: 0.005 s]
  Range (min … max):   10.094 s … 10.491 s    10 runs

/pgo/lua_pgo.png
Lua benchmarks

These results are a bit surprising. PGO worked for the first two test programs but it made things worse for nbody and spectral_norm. This would indicate that the overall set of optimisations applied is more biased towards some of the test programs but not all of them.

As a last experiment, I wonder if applying PGO to profiling data collected only from programs that were initially negatively affected by optimisations would make any difference. Let’s find out.

After rebuilding and re-running the spectral_norm benchmark, the raw results are actually still slightly worse than for an unoptimised Lua build:

1
2
3
Benchmark 1: /home/tomasz/testscripts/lua/spectral_norm.sh ./lua_pgo_spectral
  Time (mean ± σ):     10.952 s ±  0.028 s    [User: 10.938 s, System: 0.004 s]
  Range (min … max):   10.904 s … 10.982 s    10 runs

My suspicion is that spectral_norm has substantial code coverage within Lua interpreter itself and there’s simply no PGOs that can make any positive impact. But, this is just a work in progress theory and I didn’t really explore it well enough to understand the details better.

Conclusion

PGO can be a mixed bag. It’s definitely not a one size fits all solution that will unequivocally improve the performance every time when applied. It might help to make things slightly better for a very particular use case but at the same time sacrifices might be made outside of that use-case. In order to achieve stable, reproducible results a lot of trial and error might be required.

Just for my own reference, here are the links to gnuplot scripts