Contents

Profiling and code instrumentation

I’ve recently watched a video from TheCherno about visual benchmarking.

In short, he’s using a simple timer class in C++ to collect execution timings of his code and then generate a json file containing profiling data which can be visualised using chrome tracing.

He’s using a set of macros to automatically get function names when instrumenting the code.

This gave me an idea to actually use gcc instrumentation to inject the timers globally into the entire program and collect the data without having to manually instrument the code. Additionally, I’m gonna experiment with LLVM Xray instrumentation which makes the whole process even more simpler.

Timers

Just like “Cherno” did, let’s start with an implementation of a simple timer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
template <typename ClockT>
class NamedTimer {    
 public:             
  using clock_t = ClockT;
  using time_point_t = typename std::chrono::time_point<ClockT>;
  using duration_t = typename ClockT::duration;
                                              
  explicit NamedTimer(std::string name)
      : startedAt{clock_t::now()},
        stoppedAt{startedAt},
        name{std::move(name)} {}

  time_point_t stop() {
    stoppedAt = clock_t::now();
    return stoppedAt;
  }

  duration_t elapsed() const { return stoppedAt - startedAt; }

  std::string getName() const { return name; } 

  time_point_t getStartedAt() const { return startedAt; }

  time_point_t getStoppedAt() const { return stoppedAt; }

 private:
  time_point_t startedAt;
  time_point_t stoppedAt;
  std::string name;
};

Additionally, I’m gonna write a simple class that uses this timer, RAII style. It takes a name (for the purpose of instantiating the timer) and two callbacks. First callback is called in constructor and the second one in destructor. It basically allows for probing the timestamps upon scope entry and exit. Here’s the implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
template <typename ClockT>
class ScopedProbe {
 public:
  using timer_t = NamedTimer<ClockT>;
  using callback_t =
      std::function<void(std::string name, typename timer_t::time_point_t ts)>;

  ScopedProbe(std::string name, callback_t entry_cb, callback_t exit_cb)
      : t{name}, entry_cb{std::move(entry_cb)}, exit_cb{std::move(exit_cb)} {
    this->entry_cb(t.getName(), t.getStartedAt());
  }

  ~ScopedProbe() { exit_cb(t.getName(), t.stop()); }

 private:
  timer_t t;
  callback_t entry_cb;
  callback_t exit_cb;
};

This can be used the following way (the callbacks are just to demonstrate the principle - they’re not exactly useful in their current shape and form):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
using HrScopedProbe = ScopedProbe<std:;chrono::high_resolution_clock>;

std::ostream &operator<<(
    std::ostream &os,
    const std::chrono::time_point<std::chrono::high_resolution_clock> &ts) {
  const std::time_t tc = std::chrono::system_clock::to_time_t(ts);
  os << std::put_time(std::localtime(&tc), "%F %T");
  return os;
}


void foo() {
    HrScopedProbe("foo", [](auto name, auto ts) {
        std::cout << "Entered: " << name << "@" << ts << std::endl;
    },
    [](auto name, auto ts) {
        std::cout << "Exited: " << name << "@" << ts << std::endl;
    });
    // foo's code
}

This a bit too involving and can be easily simplified to be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#define SCOPED_PROFILE()                                                   \
  ScopedProbe<std::chrono::high_resolution_clock> st_##__func__##__LINE__( \
      __func__, [](auto n, auto ts) {                                      \
        std::cout << "Entered: " << name << "@" << ts << std::endl;        \
        }                                                                  \
      [](auto n, auto ts) {                                                \
        std::cout << "Exited: " << name << "@" << ts << std::endl;         \
        })


void foo() {
    SCOPED_PROFILE();
    // foo's code
}

This simple code can measure the execution of any scope. Introducing this to any code base shouldn’t pose a challenge at all as there are literally no dependencies. Let’s take this a step further though.

Trace file format

The file format that about:tracing accepts is described in details here.

Now, I’m gonna need a couple of more functions to generate event data that chrome tracing tools can accept:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
using json = nlohmann::json;

json events;

std::string getTid() {
  std::stringstream ss;
  auto tid = std::this_thread::get_id();
  ss << tid;
  return ss.str();
}

template <typename TimePointT>
void emitEvent(std::string name, std::string phase, TimePointT ts) {
  using namespace std::chrono;

  // some hard-coded data
  std::string category = "PERF";

  json j;
  j["cat"] = category;
  j["name"] = name;
  j["ph"] = phase;
  j["pid"] = ::getpid();
  j["tid"] = getTid();
  j["ts"] = duration_cast<microseconds>(ts.time_since_epoch()).count();

  events.push_back(j);
}

Now, with emitEvent, ScopedProbe callbacks can actually do something useful.

1
2
3
4
#define SCOPED_PROFILE()                                                   \
  ScopedProbe<std::chrono::high_resolution_clock> st_##__func__##__LINE__( \
      __func__, [](auto n, auto ts) { emitEvent(n, "B", ts); },            \
      [](auto n, auto ts) { emitEvent(n, "E", ts); })

This will produce “B” and “E” events which indicate beginning and an end of an event. For the purpose of profiling, an event indicates the duration of function’s scope. I’m using json::nlohmann to generate the JSON array containing the event data. To test this, I’m gonna use this very simple program:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void func1() {
  SCOPED_PROFILE();
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
}

void func2() {
  SCOPED_PROFILE();
  std::this_thread::sleep_for(std::chrono::milliseconds(200));
}

void func3() {
  SCOPED_PROFILE();
  func1();
  func2();
}

int main() {
  {
    SCOPED_PROFILE();
    func3();
  }

  std::cout << events.dump() << std::endl;
  return 0;
};

The program should output a JSON array of event objects looking like so:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[                            
  {                       
    "cat": "PERF",
    "name": "main",
    "ph": "B",    
    "pid": 6542,    
    "tid": "127067738916096",
    "ts": 1712828677098843
  },                         
  {                       
    "cat": "PERF",
    "name": "func3",
    "ph": "B",    
    "pid": 6542,   
    "tid": "127067738916096",
    "ts": 1712828677098997
  },                         
  {                       
    "cat": "PERF",
    "name": "func1",
    "ph": "B",                         
    "pid": 6542,
    "tid": "127067738916096",
    "ts": 1712828677099066
  },
  {
    "cat": "PERF",
    "name": "func1",
    "ph": "E",
    "pid": 6542,
    "tid": "127067738916096",
    "ts": 1712828677199298
  },
...
]

When visualised using about:tracing this should resemble the following execution timeline:

/profiling/profile1.png

The code for this example can be found here.

Gcc function instrumentation

This is all cool when you want to focus on a specific subsystem. The problem is that it would require a lot of work to inject these profiling probes all over the place in a large code base. My first thought was to try gcc’s function instrumentation (after all this is how LTTng works). Gcc can insert special function calls to __cyg_profile_func_enter and __cyg_profile_func_exit during build time for every function in the program. All you have to do is to compile your code with -finstrument-functions. That way, execution timings can be collected without much effort. More information is available in gcc’s manuals.

Your program needs to define two functions:

1
2
3
4
5
6
extern "C" {

void __cyg_profile_func_enter(void *this_fn, void *call_site) {}
void __cyg_profile_func_exit(void *this_fn, void *call_site) {}

}

There are two important things to bear in mind though:

  1. Symbol names
  2. Infinite recursion

Symbol names

All we get in these callbacks is an address of a function which is about to be called. No symbols at all. This can be solved with libdl and dladdr. To do that, I’ve written a small helper function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
std::string resolveSym(void *addr) {
  ::Dl_info info;
  std::string sym;
  if (auto dlres = ::dladdr(addr, &info);
      dlres != 0 && info.dli_sname != nullptr) {
    sym = info.dli_sname;
  } else {
    std::stringstream ss;
    ss << std::hex << addr;
    sym = ss.str();
  }
  return sym;
}

There’s a small gotcha with dladdr. It works for shared libraries containing a dynamic symbol table. To make it somewhat work with standalone executables, inclusion of executable’s symbols can be forced with a linker flag:

-E

–export-dynamic

–no-export-dynamic

When creating a dynamically linked executable, using the -E option or the –export-dynamic option causes the linker to add all symbols to the dynamic symbol table. The dynamic symbol table is the set of symbols which are visible from dynamic objects at run time.

Having that, I’m ready to define the callbacks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
extern "C" {

void __cyg_profile_func_enter(void *this_fn, void *call_site) {
  auto ts = std::chrono::high_resolution_clock::now();
  emitEvent(::resolveSym(this_fn), "B", ts);
}

void __cyg_profile_func_exit(void *this_fn, void *call_site) {
  auto ts = std::chrono::high_resolution_clock::now();
  emitEvent(::resolveSym(this_fn), "E", ts);
}

} // extern "C"

Infinite recursion

The problem with -finstrument-functions is that it’s somewhat tedious to get it to work. What I mean by that? Well, as I said, the compiler will instrument all functions. That means __cyg_profile_func_enter and __cyg_profile_func_exit as well (unless you put these in a separate library which is built with instrumentation turned off). Which, of course, will lead to infinite recursion. Additionally, any function called within these callbacks will trigger a callback as well - another infinite recursion entry point. Fortunately, gcc provides some attributes and compiler flags to form an ignore list and break these cycles. Specifically:

  • __attribute__((no_instrument_function)) - which can be added to function declarations for functions that should not be instrumented,
  • -finstrument-functions-exclude-file-list=file,file,... - takes a comma separated list of file path patterns to ignore (all functions defined in files that match the pattern will not be instrumented),
  • -finstrument-functions-exclude-function-list=sym,sym,... - takes a comma separated list of function names to ignore (important detail: the symbol names listed here shouldn’t be mangled).

In my test project, I’ve put the instrumentation callbacks in a separate shared library (libcyg.so). The test executable will just link this library in. To make things simple, I’ve added the following compiler flags to ignore the symbols of functions used within instrumentation callbacks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
meson setup --buildtype debugoptimized bld

meson configure -Dstrip=false bld

meson configure \
    -Dcpp_args="-finstrument-functions \
    -finstrument-functions-exclude-file-list=/usr/lib,/usr/include,json,/lib" \
    bld

meson configure -Dcpp_link_args="-Wl,-E" bld

meson compile -C bld

Within the library, I’ve marked all functions that are used within __cyg* callbacks with __attribute__((no_instrument_function)).

After building the project, I’m left with the same results as with manual instrumentation, but I didn’t have to manually insert any profiling probes. I’m on the right path.

The code for this example can be found here.

Instrumenting bash

I was wondering what kind of problems and obstacles I’d have to deal with when applying this approach to a production software. It’s also a good test if this technique is any useful for a real life purpose. Completely arbitrarily I’ve chosen bash as my test subject to find out. I’m gonna use libcyg from my previous test project. To avoid compiling an extensive list of functions which shouldn’t be instrumented, I’m gonna build libcyg with instrumentation turned off. To inject the instrumentation, I’ll have to link with this library and add required compiler flags to bash’s build system.

In my test scenario, I’m gonna collect profiling data from a bash -c invocation. To do that, I’ll have to add dumpEvents function call (from libcyg) to shell.c. Here’s the patch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
diff --git a/shell.c b/shell.c
index ebd8965..2bb052c 100644
--- a/shell.c
+++ b/shell.c
@@ -25,6 +25,8 @@
 */
 #define INSTALL_DEBUG_MODE
 
+#include "cyg.hpp"
+
 #include "config.h"
 
 #include "bashtypes.h"
@@ -761,6 +763,7 @@ main (argc, argv, env)
 #if defined (ONESHOT)
       executing = 1;
       run_one_command (command_execution_string);
+      dumpEvents();
       exit_shell (last_command_exit_value);
 #else /* ONESHOT */
       with_input_from_string (command_execution_string, "-c");

I’ve copied both cyg.hpp and prebuilt libcyg.so from my previous experiment. bash doesn’t use automake (in a sense that there’s no Makefile.am). It comes with Makefile.in which is a template populated by configure. Initially I thought I’d have to inject the instrumentation flags there but these can just be given in a call to configure:

1
2
3
4
export LD_LIBRARY_PATH=$PWD:$LD_LIBRARY_PATH
export CFLAGS="-g -O0 -I$PWD -finstrument-functions -finstrument-functions-exclude-function-list=free,malloc,signal_is_trapped,pagealign,morecore,bsplit,calloc,bcoalesce -finstrument-functions-exclude-file-list=/usr,/usr/include,json,/lib,/lib64,libc,/usr/lib,libthread"
export LDFLAGS="-Wl,-E -L$PWD -lcyg"
./configure

As you can see, the exclude patterns list as well as the symbols lists are both quite elaborate. This is the biggest draw back of this technique. Determining and defining a set of functions which shouldn’t be instrumented requires some trail and error. I imagine that in a really complex projects where deployment is a multi-step process it might even be a deal breaker. Once I’ve managed to break all recursive loops, I was successfully able to collect some profiling data.

Here’s a birds eye view of ./bash -c "echo hello":

/profiling/bash_profile0.png

It’s possible to zoom into specific execution fragments.

/profiling/bash_profile1.png

Zoomed in even more:

/profiling/bash_profile2.png

In general, this technique is very useful to get a complete overview of what’s happening within the code. But… there’s something even more convenient.

clang XRay instrumentation

There’s actually a much easier and quicker way to instrument the code and collect profiling data. clang xray instrumentation is a batteries included all in one solution which can both instrument the code automatically and provides necessary tooling to work with collected data. LLVM provides a great example on how to get started with XRay and I recommend this document as a starting point very much so.

clang XRay bash

This is gonna be surprisingly easy.

1
2
./configure CC=clang CFLAGS="-g -O0 -fxray-instrument"
make

After rebuilding, I’m gonna check for XRay instrumentation section presence (just as the example suggests):

1
2
3
4
5
6
7
8
objdump -h -j xray_instr_map ./bash

./bash:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
 17 xray_instr_map 0000e400  000000000019ad60  000000000019ad60  0019ad60  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA

There are no further prerequisites. I just need to run the program:

1
2
3
4
$ XRAY_OPTIONS="patch_premain=true xray_mode=xray-basic verbosity=1" ./bash -c "echo hello"
==197281==XRay: Log file in 'xray-log.bash.oA36XX'
hello
==197281==Cleaned up log for TID: 197281

LLVM comes with some useful tools. It’s possible to get the classic gprof like output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
llvm-xray account ./xray-log.bash.oA36XX --instr_map=./bash --top=10 --sortorder=dsc 
Functions with latencies: 63
   funcid      count [      min,       med,       90p,       99p,       max]       sum  function    
      912          2 [ 0.000005,  0.000017,  0.000017,  0.000017,  0.000017]  0.000022  malloc.c:400:0: bcoalesce
      911          4 [ 0.000006,  0.000035,  0.000051,  0.000051,  0.000051]  0.000105  malloc.c:469:0: bsplit
      910          1 [ 0.000029,  0.000029,  0.000029,  0.000029,  0.000029]  0.000029  malloc.c:762:0: pagealign
      909        980 [ 0.000006,  0.000009,  0.000013,  0.000021,  0.000058]  0.009126  malloc.c:627:0: morecore
      908          6 [ 0.000005,  0.000006,  0.000007,  0.000007,  0.000007]  0.000035  malloc.c:954:0: internal_free
      907          1 [ 0.000013,  0.000013,  0.000013,  0.000013,  0.000013]  0.000013  malloc.c:1151:0: internal_realloc
      906        993 [ 0.000005,  0.000017,  0.000022,  0.000042,  0.000069]  0.016910  malloc.c:832:0: internal_malloc
      708          3 [ 0.000007,  0.000009,  0.000017,  0.000017,  0.000017]  0.000033  mbschr.c:49:0: mbschr
      655          1 [ 0.000039,  0.000039,  0.000039,  0.000039,  0.000039]  0.000039  pathcanon.c:107:0: sh_canonpath
      648          1 [ 0.000007,  0.000007,  0.000007,  0.000007,  0.000007]  0.000007  getenv.c:52:0: getenv

… or to generate a detailed call graph in dot format:

1
llvm-xray graph --instr_map=./bash xray-log.bash.oA36XX >callgraph.dot

/profiling/xray_callgraph.png

Or even directly generate trace event compatible execution profile that can be viewed in chrome tracing tool:

1
llvm-xray convert --instr_map=./bash --symbolize --output-format=trace_event xray-log.bash.oA36XX >trace_events.json

/profiling/xray_profile.png

The output is even better than with gcc’s instrumentation as clang has resolved all symbols for us!

Conclusion

chrome tracing tool is a great data visualiser and I’m happy that completely accidentally I’ve learned about it. This is something I’d definitely recommend to try out when attempting not only to measure program execution details but as a quick way to understand the program flow in general. I’ve barely scratched the surface with both chrome’s tracing tool and clang’s xray instrumentation but hopefully I’ve managed to provide good example of how useful these tools might be.

Additionally, it seems that chrome tracing tool accepts timing data in ftrace format. This is a kernel specific tool but accordingly to some discussions referring to Android it’s possible to use it in user space as well.