Introduction
While all code should be efficient, code for library-like components, especially involving loops, should be as efficient as possible since such code is often widely used.
In my A Simple Dynamic Array for C, I included the source code for a function to clean-up a dynamic array:
void array_cleanup( array_t *array, array_free_fn_t free_fn ) {
if ( array == NULL )
return;
if ( free_fn != NULL ) {
char *element = array->elements;
for ( size_t i = 0; i < array->len; ++i ) {
(*free_fn)( element );
element += array->esize;
}
}
free( array->elements );
array_init( array, array->esize );
}
While this code is correct and good enough for pedagogical purposes, it’s not optimal. Before I tell you why, see if you can figure out why. Hint: it has to do with the use of both the array pointer and the function call in the loop.
For those who might not get the reference in this article’s title, it’s a play on a line from The Wizard of Oz. The original line was “Lions and tigers and bears! Oh, my!”
Loop Refresher
In C (and languages inspired by C including C++, C#, Go, and Java), for loop conditions are reevaluated on every loop iteration. For example, in:
for ( int i = 0; i < n; ++i )
the condition i < n is evaluated on every iteration. That means if n is changed within the loop, it could terminate either earlier or later than it was initially supposed to — and this is well-defined behavior.
This is in contrast to some older languages like Fortran and Pascal as well as some newer languages like Rust where the loop’s termination value is evaluated once just prior to the start of the loop. For example, in Pascal:
for i := 1 to n do
is actually treated as if it were this:
limit := n;
for i := 1 to limit do
Note that if the loop condition is a more complicated expression, such as calling a function:
for ( int i = 0; i < f(x); ++i )
then the function will be called on every loop iteration.
Except if the function is marked as
unsequencedin C23 in which case the compiler can hoist its call out of the loop.
(Have you figured out the “why” yet?)
Pointers
Wherever pointers are involved, things get more complicated. From an optimization perspective, pointers are like sand in your gears. Returning to the loop in array_cleanup:
for ( size_t i = 0; i < array->len; ++i ) {
(*free_fn)( element );
element += array->esize;
}
That gets compiled by clang into this x86-64 assembly:
.LBB0_4:
mov rdi, r15
call r14 ; (*free_fn)( element )
add r15, qword ptr [rbx + 8] ; element += array->esize
inc r12 ; ++i
cmp r12, qword ptr [rbx + 16]; i < array->len
jb .LBB0_4
The things to notice are the qword ptr lines which means the code has to dereference array (which means read from memory) twice on every iteration. That’s slow.
The problem is that the compiler can’t know for sure that the array object pointed to by array isn’t modified by the function pointed to by free_fn. For all the compiler knows, the function has access to a global copy of the pointer and could modify the array. Compilers have to play it safe.
If there were no function call in the loop, then the compiler could safely hoist both len and esize outside the loop as if the code were:
size_t const esize = array->esize;
size_t const len = array->len;
for ( size_t i = 0; i < len; ++i ) {
// ... do something other than call a function ...
element += esize;
}
At best, the compiler can put esize and len into registers; but even at worst, if the compiler puts them into local variables, it would still save two pointer dereferences per iteration.
restrict Revisited
Is there any way to tell the compiler that free_fn won’t modify the array? Why yes, there is: restrict:
void array_cleanup( array_t *restrict array,
array_free_fn_t free_fn ) {
// ... otherwise the same as before ...
}
(If you are unfamiliar with restrict, you should read my previous article about it first.)
Even if you are familiar with restrict, you may be surprised to learn that it can help in this case. The canonical use case for restrict is memcpy:
void* memcpy( void *restrict dst, void const *restrict src, size_t n );
where there are two pointers marked restrict which means that they do not point to overlapping regions of memory, i.e., they don’t overlap with each other.
But in the case of array, how can restrict help when there is no other pointer in array_cleanup for array not to overlap with?
It turns out that restrict is more general than you might think. What the restrict in the modified array_cleanup means is that, during its execution, the dynamic array pointed to be array will not be modified by any pointer anywhere in the program, even if free_fn has a copy of such a pointer. Using restrict is a promise you make to the compiler. (If you break that promise, even unintentionally, the result is undefined behavior.)
With restrict, the assembly becomes:
.LBB0_4:
mov rdi, r12
call r14 ; (*free_fn)( element )
add r12, qword ptr [rbx + 8] ; element += array->esize
dec r13 ; --i
jne .LBB0_4 ; i > 0
The compiler has eliminated array->len dereference and is now counting down to zero.
But why didn’t the compiler optimize the remaining dereference (qword ptr) for esize?
First, similar to inline, restrict can be ignored by the compiler. Second, every CPU has only finite resources, most notably registers. In this case, the compiler presumably thought that the overall performance would still be better even if esize were not put into a register. If it were to be put into a register, then the compiler would have to ensure that it’s value is the same after free_fn is called as before — and presumably that was more costly than a dereference.
As I noted in my restrict article, restrict isn’t yet part of standard C++; however, many compilers support __restrict__ as an extension.
Explicit Caching
Of course you can forget about restrict and just use local variables to cache the values as was done initially:
size_t const esize = array->esize;
size_t const len = array->len;
for ( size_t i = 0; i < len; ++i ) {
(*free_fn)( element );
element += esize;
}
Then, despite the function call in the loop, the assembly becomes:
.LBB0_4:
mov rdi, rbx
call r14
add rbx, r13
dec r15
jne .LBB0_4
with no dereferences.
Conclusion
If you want to optimize code using pointers, functions, and loops, consider using either restrict or explicit caching. But you should always check the generated assembly to ensure your tuning is having the desired effect.
Epilogue
If you’ve never heard of the Compiler Explorer, aka, godbolt.org, it’s an extremly useful in-browser tool for compiling C or C++ code with virtually any compiler you’ve ever heard of, at any optimization level, and showing you a mapping from source lines to their corresponding assembly lines for most any CPU architecture. (Why is it called godbolt? Because its creator is Matt Godbolt.)
Top comments (0)