DEV Community

Yuttapichai Kerdcharoen
Yuttapichai Kerdcharoen

Posted on • Edited on

Do function calls hinder performance?

I am currently working on a C++ project and suddenly encountering a strange performance behavior.

A gist of the behavior is that two very similar codes of mine performed differently. The major dissimilarity is that I encapsulate one of them as a function. This makes me suspect that it might be because of the overheads from function calls.

If you have dumped your C++ code or learned about basic assembly before, you may know that there will be a bunch of stack pushes and pops before and after each function call. To visualize them, let me show you an example below.

void func(int *x)
{
  (*x)++;
}
Enter fullscreen mode Exit fullscreen mode

Given the function func above. When I used objdump -D to dump its assembly out, I will get the following code.

push   %rbp
mov    %rsp,%rbp
mov    %rdi,-0x8(%rbp)
mov    -0x8(%rbp),%rax
mov    (%rax),%eax
lea    0x1(%rax),%edx
mov    -0x8(%rbp),%rax
mov    %edx,(%rax)
nop
pop    %rbp
retq   
Enter fullscreen mode Exit fullscreen mode

In short, those stack instructions are used for saving data in registers like data in the main function that we don't want this function to overwrite them.

You can see that in order to do (*x)++, which is represented as lea 0x1(%rax),%edx, we need to do a lot of unnecessary instructions, which I will consider as overheads.

My machine's microarchitecture is Haswell. According to Agner's Instruction Tables, each mov typically consumes 1 cycle in my machine. However, it can be amortized (with pipelining, superscalar, OoO) if there is no data hazard. In this case, mov %rdi,-0x8(%rbp), mov -0x8(%rbp),%rax, and mov (%rax),%eax are dependent and can create data hazards. So, the movs before lea will consume roughly 3 cycles. Other movs seem to be independent (for RAW hazards); so let's amortize them as 0 cycles. For lea, it consumes 1 cycle and cannot be amortized since it depends on its previous mov. Therefore, this function call requires at least 4 cycles.

How about without the function call? Well, it is just a code below.

x++;
Enter fullscreen mode Exit fullscreen mode

The binary of this can be dumped as the following assembly.

addl   $0x1,-0x4(%rbp)
Enter fullscreen mode Exit fullscreen mode

Pretty impressive. Only a single instruction without overheads!

According to Agner's Instruction Tables, each addl typically consumes 1 cycle.

From both analyses, we found that the without-function-call method is theoretically faster than the function-call method by 4 times approximately.

Let's see how this works in action!

I simply implemented the code below.

for (size_t i = 0; i < N; i++)
{
#ifdef USE_FUNC
    func(&x);
#else
    x++;
#endif
}
Enter fullscreen mode Exit fullscreen mode

Also, there are timing codes before and after this for loop. Then, I compiled with -D USE_FUNC and without where N is 1000000 and ran both of them. As a result, I got 1.26 times speedup with the without-function-call method. To be precise, the numbers below show each execution time in microsecond.

Method Execution Time (in Ξs)
Function Calls ~2900
Without Function Calls ~2300

From the number, the next question is why not 4 times?

The reason is the loop overheads. When I timed only the loop, I got an estimately 2300 ns. The number is super close to the without-function-call method. The reason for this is that x++; gets amortized by i++, which is the loop counter.

It seems like the benchmark is not accurate. We need to make sure that x++; does not get amortized. One of the ways is to not use loops.

To be honest, it is hard for me to duplicate x++; or func(&x); 1000000 times. So, I will reduce it to 1000 times. The result is that I got 3.35 times speedup with the without-function-call method. The table below shows more precise numbers in nanosecond (it is not Ξs).

Method Execution Time (in ns)
Function Calls ~9400
Without Function Calls ~2800

Since I timed it in nanoseconds, there was a timing overhead, which is about 800 ns. Therefore, I got approximately 4.3 times speedup with the without-function-call method, as shown in the table.

Method Execution Time (in ns)
Function Calls ~8600
Without Function Calls ~2000

Now, this is making sense since we expect to see at least 4 times speedup from the without-function-call method according to the overheads.

Then, let's recap all of these using the question.

Do function calls hinder performance?

Yes, they are. The reason is the overheads of the function calls. However, if the function does a lot of instructions, they can amortize those overheads. Also you can basically reduce these overheads by increasing the optimization level. 🙂

Erratum

Thanks to Pierre Gradot for pointing out this

  • All of these numbers are gathered by g++ with -O0 (without compiler optimization). However, g++ compilers actually optimize the function func in the example into a single add instruction, which is the same as the without-function-call method. So, if you run the example with compiler optimization, you will see no difference between with and without function calls.

And this should conclude this blog...

Top comments (3)

Collapse
 
pgradot profile image
Pierre Gradot

What option are using to control you compiler's optimization level?

I believe your code is optimized at all. See on Compiler Explorer.

Collapse
 
pnx profile image
Yuttapichai Kerdcharoen

Thank you for pointing this out. I was using -O0 when I wrote this blog.

When I applied -O1, the compiler optimized func to just a single add instruction as on the Compiler Explorer you attached.

Honestly, the flag -O1 also changes the bunch of func(&x) and x++ in both cases into a single mov instruction with an immediate, in which both cases use the same instructions with -O1 (and so are -O2 and -O3, I believe) at the end of the day.

Collapse
 
pgradot profile image
Pierre Gradot

You should do your timing measures again, I guess you see almost no difference with -O1