I am currently working on a C++ project and suddenly encountering a strange performance behavior.
A gist of the behavior is that two very similar codes of mine performed differently. The major dissimilarity is that I encapsulate one of them as a function. This makes me suspect that it might be because of the overheads from function calls.
If you have dumped your C++ code or learned about basic assembly before, you may know that there will be a bunch of stack pushes and pops before and after each function call. To visualize them, let me show you an example below.
void func(int *x)
{
(*x)++;
}
Given the function func
above. When I used objdump -D
to dump its assembly out, I will get the following code.
push %rbp
mov %rsp,%rbp
mov %rdi,-0x8(%rbp)
mov -0x8(%rbp),%rax
mov (%rax),%eax
lea 0x1(%rax),%edx
mov -0x8(%rbp),%rax
mov %edx,(%rax)
nop
pop %rbp
retq
In short, those stack instructions are used for saving data in registers like data in the main function that we don't want this function to overwrite them.
You can see that in order to do (*x)++
, which is represented as lea 0x1(%rax),%edx
, we need to do a lot of unnecessary instructions, which I will consider as overheads.
My machine's microarchitecture is Haswell. According to Agner's Instruction Tables, each mov
typically consumes 1 cycle in my machine. However, it can be amortized (with pipelining, superscalar, OoO) if there is no data hazard. In this case, mov %rdi,-0x8(%rbp)
, mov -0x8(%rbp),%rax
, and mov (%rax),%eax
are dependent and can create data hazards. So, the mov
s before lea
will consume roughly 3 cycles. Other mov
s seem to be independent (for RAW hazards); so let's amortize them as 0 cycles. For lea
, it consumes 1 cycle and cannot be amortized since it depends on its previous mov
. Therefore, this function call requires at least 4 cycles.
How about without the function call? Well, it is just a code below.
x++;
The binary of this can be dumped as the following assembly.
addl $0x1,-0x4(%rbp)
Pretty impressive. Only a single instruction without overheads!
According to Agner's Instruction Tables, each addl
typically consumes 1 cycle.
From both analyses, we found that the without-function-call method is theoretically faster than the function-call method by 4 times approximately.
Let's see how this works in action!
I simply implemented the code below.
for (size_t i = 0; i < N; i++)
{
#ifdef USE_FUNC
func(&x);
#else
x++;
#endif
}
Also, there are timing codes before and after this for loop. Then, I compiled with -D USE_FUNC
and without where N
is 1000000 and ran both of them. As a result, I got 1.26 times speedup with the without-function-call method. To be precise, the numbers below show each execution time in microsecond.
Method | Execution Time (in Ξs) |
---|---|
Function Calls | ~2900 |
Without Function Calls | ~2300 |
From the number, the next question is why not 4 times?
The reason is the loop overheads. When I timed only the loop, I got an estimately 2300 ns. The number is super close to the without-function-call method. The reason for this is that x++;
gets amortized by i++
, which is the loop counter.
It seems like the benchmark is not accurate. We need to make sure that x++;
does not get amortized. One of the ways is to not use loops.
To be honest, it is hard for me to duplicate x++;
or func(&x);
1000000 times. So, I will reduce it to 1000 times. The result is that I got 3.35 times speedup with the without-function-call method. The table below shows more precise numbers in nanosecond (it is not Ξs).
Method | Execution Time (in ns) |
---|---|
Function Calls | ~9400 |
Without Function Calls | ~2800 |
Since I timed it in nanoseconds, there was a timing overhead, which is about 800 ns. Therefore, I got approximately 4.3 times speedup with the without-function-call method, as shown in the table.
Method | Execution Time (in ns) |
---|---|
Function Calls | ~8600 |
Without Function Calls | ~2000 |
Now, this is making sense since we expect to see at least 4 times speedup from the without-function-call method according to the overheads.
Then, let's recap all of these using the question.
Do function calls hinder performance?
Yes, they are. The reason is the overheads of the function calls. However, if the function does a lot of instructions, they can amortize those overheads. Also you can basically reduce these overheads by increasing the optimization level. ð
Erratum
Thanks to Pierre Gradot for pointing out this
- All of these numbers are gathered by
g++
with-O0
(without compiler optimization). However,g++
compilers actually optimize the functionfunc
in the example into a singleadd
instruction, which is the same as the without-function-call method. So, if you run the example with compiler optimization, you will see no difference between with and without function calls.
And this should conclude this blog...
Top comments (3)
What option are using to control you compiler's optimization level?
I believe your code is optimized at all. See on Compiler Explorer.
Thank you for pointing this out. I was using
-O0
when I wrote this blog.When I applied
-O1
, the compiler optimizedfunc
to just a singleadd
instruction as on the Compiler Explorer you attached.Honestly, the flag
-O1
also changes the bunch offunc(&x)
andx++
in both cases into a singlemov
instruction with an immediate, in which both cases use the same instructions with-O1
(and so are-O2
and-O3
, I believe) at the end of the day.You should do your timing measures again, I guess you see almost no difference with
-O1