Perf - Perfect Profiling of C/C++ on Linux

#profiling #linux #cpp #c

I've seen power-Visual Studio users do amazing things with the profiler and debugger. It looks amazing, but there is one small problem. I'm on Linux. No visual studio for me. How do I profile now? There are some profilers out there for Linux too, each with varying degrees of usability.

Perf is a neat little tool that I just found for profiling programs. Perf uses statistical profiling, where it polls the program and sees what function is working. This is less accurate, but has less of a performance hit than something like Callgrind, which tracks every call. The results are still reasonable accurate, and even with fewer samples, it will show which functions are taking a lot of time, even if it misses functions that are very fast (which are probably not the ones you are looking for while profiling anyway).

With a rather contrived example, we implement the infamous Fibonacci sequence in the most pedagogical way possible. Basic Recursion.



// fib.c
#include <stdio.h>
#include <stdlib.h>

int fib(int x) {
  if (x == 0) return 0;
  else if (x == 1) return 1;
  return fib(x - 1) + fib(x - 2);
}

int main(int argc, char *argv[]) {

  for (size_t i = 0; i < 45; ++i) {
    printf("%d\n", fib(i));
  }
  return 0;
}

Running the program with time gives us:



$ time ./fib

...

./fib  15.05s user 0.00s system 99% cpu 15.059 total

It seems that the program stops outputting fast enough shortly after 40 numbers. My machine is able to run 40 numbers in ~ 2 seconds, but if you bump that up to 45, it takes a whopping 15 seconds. But being completely naive, I can't spot the issue. I need to profile it.



$ perf record ./fib

Running the program again using this command generates the perf.data file, which contains all of the time information of our program. Other than creating this new file, nothing exciting shows up. It says how many times perf woke up, the number of samples, and the size of the perf.data file.



[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 2.336 MB perf.data (60690 samples) ]

Things get interesting when we run:



perf report

which gives us the following view

Okay, great. So basically, it doesn't tell us too much. It does say that we're spending all of our time in the fib function, but not much else. We can hit enter with the fib function highlighted, we get a few options. One is to annotate the function. This shows us a disassembly of the instructions in the function, with the percentage of time being taken by each one. So... we can optimize at the assembly level I guess? Seems a bit extreme. That's a little too low-level for most of us mortals.

There is actually some helpful information in this view if we're clever though. We'll get back on that later.

By default, perf only collects time information. We probably want to see a callgraph since we're making calls. Maybe the base case is the culprit (returning a single number is always suspicious after all (laugh with me here, that's a joke)).

To get the call graph, we pass the -g option to perf record.



$ perf record -g ./fib

And wait for it to run again... Then run perf report again to see the new view.

Not super different. Neat! Okay, so there are some differences between the first run and the second. Namely, instead of having an overhead column, it now has a children and self column, as well as some + signs next to some function calls.

Children refers to the overhead of subsequent calls, while self refers to the cost of the function itself. A high children overhead, and a low self overhead indicates that the function makes a call that is expensive, but a high self overhead indicates that the function itself is expensive.

The + indicates that the function makes a call that is accessible. Not all functions are follow-able, likely due to the optimizer striping the function names (or strip -d which strips debugging symbols) to shrink the binary.

Highlighting a row and hitting enter will expand it, showing the functions called. Note that some older versions of perf show the callees by default, not the the functions being called from the function we're looking at. If you are using an older version it might be necessary to change this behaviour by calling record with perf record -g 'graph,0.5,caller'.

In my case, I expanded the call to the main function, which doesn't spend much time inside of itself. Inside, there are a bunch of calls the the 'fib' function. The bottom one has a + symbol next to it, indicating that it can be expanded. expanding it shows another call to fib, with a + symbol.

So.. eventually, we get this big long chain of calls to fib from fib. So recursion strikes again!
Going back to the disassembly, most of the time is spent doing the mov %rsp,%rbp which are the registers responsible for the stack and frame pointers in the x86 architecture. Since we see that there is a chain of calls to the fib function, and this particular instruction is quite hot, there's a good indication that something about the recursion is to blame. Now we know what to fix. Fixing it is up to you. Some solutions might include memoization, or emulating recursion with a stack and a while loop, or even a combination of both, but that is beyond the scope of this post.

While this is a contrived example, it gives an introduction to profiling using the perf tool. There are many additional features that are not covered here, but this should be enough to get you started with profiling using perf. The results from perf are not perfect, but they are usable. There are other tools out there, but I rather enjoy the interface with perf, and that it doesn't hurt runtime performance very much. What do you think? What are your favourite profilers and tools for profiling?

Top comments (4)

Jules Pénuchot • Nov 18 '17

Intel IACA is the tool I used the most to develop high-performance SIMD optimized kernels. It calculates the throughput of a portion of code and shows critical sections of your resulting assembly (saturated execution ports, register usage graph etc.).

It's a great way to compare two versions of a same code, especially when you want to fine tune your program by trying to push your compiler to use more optimized instructions.

software.intel.com/en-us/articles/...

But let's be honest: at the end, only benchmark numbers matter, which is why I also use Google Benchmark.

github.com/google/benchmark

Evan Wilde • Nov 21 '17

I was not aware of these tools and they look pretty good. I'll have to take a look into them.

Jules Pénuchot • Nov 26 '17

If you don't want to rely on Intel you can just surround the part of your program you want to inspect with a few asm("nop"); in your source and look for a bunch of nop instructions in your disassembled program. However you won't get any information about the throughput of your program.

I made a quick project to compare ASM programs very fast. The shell and CMake scripts take care of building all the programs in every "src__*" folder they find, disassemble them and eventually do the IACA inspection and dependency graph generation in one command. It's not a big thing but it helps when you want to inspect more than one program using IACA without modifying your program, etc...