<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yuttapichai Kerdcharoen</title>
    <description>The latest articles on DEV Community by Yuttapichai Kerdcharoen (@pnx).</description>
    <link>https://dev.to/pnx</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F932390%2F25bd5198-d962-44ba-9cd7-a2b37a8704a1.jpeg</url>
      <title>DEV Community: Yuttapichai Kerdcharoen</title>
      <link>https://dev.to/pnx</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pnx"/>
    <language>en</language>
    <item>
      <title>Do function calls hinder performance?</title>
      <dc:creator>Yuttapichai Kerdcharoen</dc:creator>
      <pubDate>Mon, 26 Sep 2022 02:04:56 +0000</pubDate>
      <link>https://dev.to/pnx/do-function-calls-hinder-performance-a1j</link>
      <guid>https://dev.to/pnx/do-function-calls-hinder-performance-a1j</guid>
      <description>&lt;p&gt;I am currently working on a C++ project and suddenly encountering a strange performance behavior.&lt;/p&gt;

&lt;p&gt;A gist of the behavior is that two very similar codes of mine performed differently. The major dissimilarity is that I encapsulate one of them as a function. This makes me suspect that it might be because of the overheads from &lt;strong&gt;function calls&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you have dumped your C++ code or learned about basic assembly before, you may know that there will be a bunch of stack pushes and pops before and after each function call. To visualize them, let me show you an example below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Given the function &lt;code&gt;func&lt;/code&gt; above. When I used &lt;code&gt;objdump -D&lt;/code&gt; to dump its assembly out, I will get the following code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;push   %rbp
mov    %rsp,%rbp
mov    %rdi,-0x8(%rbp)
mov    -0x8(%rbp),%rax
mov    (%rax),%eax
lea    0x1(%rax),%edx
mov    -0x8(%rbp),%rax
mov    %edx,(%rax)
nop
pop    %rbp
retq   
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In short, those stack instructions are used for saving data in registers like data in the main function that we don't want this function to overwrite them.&lt;/p&gt;

&lt;p&gt;You can see that in order to do &lt;code&gt;(*x)++&lt;/code&gt;, which is represented as &lt;code&gt;lea 0x1(%rax),%edx&lt;/code&gt;, we need to do a lot of unnecessary instructions, which I will consider as overheads.&lt;/p&gt;

&lt;p&gt;My machine's microarchitecture is &lt;a href="https://en.wikipedia.org/wiki/Haswell_(microarchitecture)"&gt;Haswell&lt;/a&gt;. According to &lt;a href="https://www.agner.org/optimize/instruction_tables.pdf"&gt;Agner's Instruction Tables&lt;/a&gt;, each &lt;code&gt;mov&lt;/code&gt; typically consumes 1 cycle in my machine. However, it can be amortized (with pipelining, superscalar, OoO) if there is no &lt;a href="https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Data_hazards"&gt;data hazard&lt;/a&gt;. In this case, &lt;code&gt;mov %rdi,-0x8(%rbp)&lt;/code&gt;, &lt;code&gt;mov -0x8(%rbp),%rax&lt;/code&gt;, and &lt;code&gt;mov (%rax),%eax&lt;/code&gt; are dependent and can create data hazards. So, the &lt;code&gt;mov&lt;/code&gt;s before &lt;code&gt;lea&lt;/code&gt; will consume roughly 3 cycles. Other &lt;code&gt;mov&lt;/code&gt;s seem to be independent (for &lt;a href="https://en.wikipedia.org/wiki/Hazard_(computer_architecture)#Read_after_write_(RAW)"&gt;RAW&lt;/a&gt; hazards); so let's amortize them as 0 cycles. For &lt;code&gt;lea&lt;/code&gt;, it consumes 1 cycle and cannot be amortized since it depends on its previous &lt;code&gt;mov&lt;/code&gt;. Therefore, this function call requires at least 4 cycles.&lt;/p&gt;

&lt;p&gt;How about without the function call? Well, it is just a code below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The binary of this can be dumped as the following assembly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;addl   $0x1,-0x4(%rbp)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pretty impressive. Only a single instruction without overheads!&lt;/p&gt;

&lt;p&gt;According to &lt;a href="https://www.agner.org/optimize/instruction_tables.pdf"&gt;Agner's Instruction Tables&lt;/a&gt;, each &lt;code&gt;addl&lt;/code&gt; typically consumes 1 cycle.&lt;/p&gt;

&lt;p&gt;From both analyses, we found that the without-function-call method is theoretically faster than the function-call method by 4 times approximately.&lt;/p&gt;

&lt;p&gt;Let's see how this works in action!&lt;/p&gt;

&lt;p&gt;I simply implemented the code below.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="cp"&gt;#ifdef USE_FUNC
&lt;/span&gt;    &lt;span class="n"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;    &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Also, there are timing codes before and after this for loop. Then, I compiled with &lt;code&gt;-D USE_FUNC&lt;/code&gt; and without where &lt;code&gt;N&lt;/code&gt; is 1000000 and ran both of them. As a result, I got 1.26 times speedup with the without-function-call method. To be precise, the numbers below show each execution time in &lt;strong&gt;microsecond&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Execution Time (in μs)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Function Calls&lt;/td&gt;
&lt;td&gt;~2900&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Without Function Calls&lt;/td&gt;
&lt;td&gt;~2300&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From the number, the next question is &lt;strong&gt;why not 4 times?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The reason is the loop overheads. When I timed only the loop, I got an estimately 2300 ns. The number is super close to the &lt;strong&gt;without-function-call&lt;/strong&gt; method. The reason for this is that &lt;code&gt;x++;&lt;/code&gt; gets amortized by &lt;code&gt;i++&lt;/code&gt;, which is the loop counter.&lt;/p&gt;

&lt;p&gt;It seems like the benchmark is not accurate. We need to make sure that &lt;code&gt;x++;&lt;/code&gt; does not get amortized. One of the ways is to not use loops.&lt;/p&gt;

&lt;p&gt;To be honest, it is hard for me to duplicate &lt;code&gt;x++;&lt;/code&gt; or &lt;code&gt;func(&amp;amp;x);&lt;/code&gt; 1000000 times. So, I will reduce it to 1000 times. The result is that I got 3.35 times speedup with the without-function-call method. The table below shows more precise numbers in &lt;strong&gt;nanosecond&lt;/strong&gt; (it is not μs).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Execution Time (in ns)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Function Calls&lt;/td&gt;
&lt;td&gt;~9400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Without Function Calls&lt;/td&gt;
&lt;td&gt;~2800&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Since I timed it in nanoseconds, there was a timing overhead, which is about 800 ns. Therefore, I got approximately 4.3 times speedup with the without-function-call method, as shown in the table.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Execution Time (in ns)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Function Calls&lt;/td&gt;
&lt;td&gt;~8600&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Without Function Calls&lt;/td&gt;
&lt;td&gt;~2000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now, this is making sense since we expect to see at least 4 times speedup from the without-function-call method according to the overheads.&lt;/p&gt;

&lt;p&gt;Then, let's recap all of these using the question.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Do function calls hinder performance?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes, they are. The reason is the overheads of the function calls. However, if the function does a lot of instructions, they can amortize those overheads. Also you can basically reduce these overheads by increasing the optimization level. 🙂&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Erratum
&lt;/h2&gt;

&lt;p&gt;Thanks to Pierre Gradot for pointing out this&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All of these numbers are gathered by &lt;code&gt;g++&lt;/code&gt; with &lt;code&gt;-O0&lt;/code&gt; (without compiler optimization). However, &lt;code&gt;g++&lt;/code&gt; compilers actually optimize the function &lt;code&gt;func&lt;/code&gt; in the example into a single &lt;code&gt;add&lt;/code&gt; instruction, which is the same as the &lt;strong&gt;without-function-call&lt;/strong&gt; method. So, if you run the example with compiler optimization, you will see no difference between with and without function calls.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And this should conclude this blog...&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>assembly</category>
    </item>
  </channel>
</rss>
