Introduction
Sometimes when writing code we have to decide between using loop or recursion. How do both work under the hood? In terms of performance which one is the best to pick and why?
PS.: CPU architecture -> x86_64 | RAM -> DDR4 16G
1. Study Case đź“ť
Let's implement a function that receives a number and adding it one by one, arrives at the value received.
1.1. Recursion
int sum(int n) {
if (n == 0) {
return 0;
}
return 1 + sum(n - 1);
}
1.2. Loop
int sum(int n) {
int result = 0;
for (int i = 1; i <= n; i++) {
result += 1;
}
return result;
}
1.3. Time each function
#include <stdio.h>
#include <time.h>
// sum functions here
int main() {
int n = 70000;
clock_t start = clock();
sum(n);
clock_t end = clock();
double cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("sum(%d) Time elapsed = %.7f\n", n, cpu_time_used);
return 0;
}
2. Compiling the code
Let's compile but not assemble it. We can do it by using the gcc
compiler with the flag -S. Since we are interested in the recursive and loop functions, we will only examine these assembly instructions:
2.1. Recursion
sum:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl %edi, -4(%rbp)
cmpl $0, -4(%rbp)
jne .L2
movl $0, %eax
jmp .L3
.L2:
movl -4(%rbp), %eax
subl $1, %eax
movl %eax, %edi
call sum
addl $1, %eax
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size sum, .-sum
.section .rodata
2.2. Loop
sum:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -20(%rbp)
movl $0, -8(%rbp)
movl $1, -4(%rbp)
jmp .L2
.L3:
addl $1, -8(%rbp)
addl $1, -4(%rbp)
.L2:
movl -4(%rbp), %eax
cmpl -20(%rbp), %eax
jle .L3
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size sum, .-sum
.section .rodata
3. Executing both programs
Execution of sum(70000) using both approaches 5 times each. Elapsed time in seconds:
Loop | Recursion |
---|---|
0.0001380 | 0.0012100 |
0.0001230 | 0.0011630 |
0.0001370 | 0.0011390 |
0.0001710 | 0.0012150 |
0.0001380 | 0.0011810 |
Why is the loop approach 10 times faster?
4. Cause of the performance penalty 📉
At the assembly code, we have to pay attention to some instructions in order to understand the overhead of the recursion approach. These are:
pushq %rbp
movq %rsp, %rbp
-
leave
orpopq %rbp
In short, pushq %rbp
saves the base pointer of the previous function. movq %rsp, %rbp
initializes the base pointer to the current function, subq $16, %rsp
alocates 16 bytes on the top of the stackframe - by subtracting 16 bytes from the initial %rsp
, since it is initialy equal to the %rbp
: the stack grows from high memory addresses to the lower ones. leave
sets %rsp
to %rbp
, then pops the top of the stack, thus restoring the function that called it.
This process is repeated in each function call. It significantly increases the interaction with memory - L caches cannot help that much in this case - the performance penalty comes precisely from that.
5. Comparing the instructions in both cases
5.1. Loop
movl %edi, -20(%rbp)
movl $0, -8(%rbp)
movl $1, -4(%rbp)
These three instructions load into memory the initial values of n
, result
and i
. %edi
is the register used to pass the argument to the function, so the value of n
is saved into the address -20(%rbp)
. i
and result
are initially set to 1 and 0, that is why $1
and $0
are loaded into memory addresses -4(%rbp)
and -8(%rbp)
, respectively.
The loop starts at jmp .L2
. movl -4(%rbp), %eax
loads data into %eax
register so it can be used to compare in cmpl -20(%rbp), %eax
. jle .L3
jumps to .L3
if %eax
is less or equal to -20(%rbp)
, namely i <= n;
.
In L3.
, addl $1, -8(%rbp)
will increase -8(%rbp)
by one - result -, then do the same to i
: addl $1, -4(%rbp)
, i.e., i++
This process is executed until the comparison is false, then popq %rbp
and ret
are executed, respectively poping the top of the stack and returning sum
.
5.2. Recursion
Recursion instructions:
movl %edi, -4(%rbp)
takes the argument - sent via %edi
register - and saves it 4 bytes bellow the value of the pointer saved in %rbp
. cmpl $0, -4(%rbp)
takes the value and compares it to 0, jne .L2
sets that, if the value in -4(%rbp)
is not equal to 0, the code jumps to .L2
block.
At .L2
, movl -4(%rbp), %eax
will load the value of -4(%rbp)
to %eax
, then subl $1, %eax
subtract it by 1 and save it in the register itself. In movl %eax, %edi
, the value of register %eax
is loaded into another register: %edi
. As we saw, this register is responsible for passing arguments to functions. The argument is passed to call sum
, so allocating more addresses onto the top of the stackframe and repeting the whole process of recursion. addl $1, %eax
will increase by 1 and save the value in the register %eax
. When recursion reach the base case, movl $0, %eax
- so placing 0 as return of sum(n - 1)
- will be executed, then jmp.L3
will jump to .L3
and execute leave
and ret
.
6. Segmentation Fault using recursion
Once it is clear how recursion works under the hood, it is much easier to see how a Segmentation Fault can take place. Each process allocates a finite space in memory, since recursions can grown the stackframe indefinitely, therefore Segmentation Fault can occur.
E.g., let's set int n = 1000000;
and see how it performs:
linux@x86_64:~$ ./recursive_sum
Segmentation fault (core dumped)
Once using loop approach there is no indefinite function calls, there is no risk of the Stack Overflow that happens under recursion.
Top comments (2)
It would be interesting to see an extra take on recursion employing tail call optimization.
Thanks for the sugestion. As soon as possible I will include this case.