Paul J. Lucas

Posted on Jun 18

Pointers and Tuning and Loops! Oh My!

#c #cpp

Introduction

While all code should be efficient, code for library-like components, especially involving loops, should be as efficient as possible since such code is often widely used.

In my A Simple Dynamic Array for C, I included the source code for a function to clean-up a dynamic array:

void array_cleanup( array_t *array, array_free_fn_t free_fn ) {
  if ( array == NULL )
    return;
  if ( free_fn != NULL ) {
    char *element = array->elements;
    for ( size_t i = 0; i < array->len; ++i ) {
      (*free_fn)( element );
      element += array->esize;
    }
  }
  free( array->elements );
  array_init( array, array->esize );
}

While this code is correct and good enough for pedagogical purposes, it’s not optimal. Before I tell you why, see if you can figure out why. Hint: it has to do with the use of both the array pointer and the function call in the loop.

For those who might not get the reference in this article’s title, it’s a play on a line from The Wizard of Oz. The original line was “Lions and tigers and bears! Oh, my!”

Loop Refresher

In C (and languages inspired by C including C++, C#, Go, and Java), for loop conditions are reevaluated on every loop iteration. For example, in:

for ( int i = 0; i < n; ++i )

the condition i < n is evaluated on every iteration. That means if n is changed within the loop, it could terminate either earlier or later than it was initially supposed to — and this is well-defined behavior.

This is in contrast to some older languages like Fortran and Pascal as well as some newer languages like Rust where the loop’s termination value is evaluated once just prior to the start of the loop. For example, in Pascal:

for i := 1 to n do

is actually treated as if it were this:

limit := n;
for i := 1 to limit do

Note that if the loop condition is a more complicated expression, such as calling a function:

for ( int i = 0; i < f(x); ++i )

then the function will be called on every loop iteration.

Except if the function is marked as unsequenced in C23 in which case the compiler can hoist its call out of the loop.

(Have you figured out the “why” yet?)

Pointers

Wherever pointers are involved, things get more complicated. From an optimization perspective, pointers are like sand in your gears. Returning to the loop in array_cleanup:

for ( size_t i = 0; i < array->len; ++i ) {
  (*free_fn)( element );
  element += array->esize;
}

That gets compiled by clang into this x86-64 assembly:

.LBB0_4:
    mov  rdi, r15
    call r14                       ; (*free_fn)( element )
    add  r15, qword ptr [rbx + 8]  ; element += array->esize
    inc  r12                       ; ++i
    cmp  r12, qword ptr [rbx + 16] ; i < array->len
    jb   .LBB0_4

The things to notice are the qword ptr lines which means the code has to dereference array (which means read from memory) twice on every iteration. That’s slow.

The problem is that the compiler can’t know for sure that the array object pointed to by array isn’t modified by the function pointed to by free_fn. For all the compiler knows, the function has access to a global copy of the pointer and could modify the array. Compilers have to play it safe.

If there were no function call in the loop, then the compiler could safely hoist both len and esize outside the loop as if the code were:

size_t const esize = array->esize;
size_t const len = array->len;
for ( size_t i = 0; i < len; ++i ) {
  // ... do something other than call a function ...
  element += esize;
}

At best, the compiler can put esize and len into registers; but even at worst, if the compiler puts them into local variables, it would still save two pointer dereferences per iteration.

`restrict` Revisited

Is there any way to tell the compiler that free_fn won’t modify the array? Why yes, there is: restrict:

void array_cleanup( array_t *restrict array,
                    array_free_fn_t free_fn ) {
  // ... otherwise the same as before ...
}

(If you are unfamiliar with restrict, you should read my previous article about it first.)

Even if you are familiar with restrict, you may be surprised to learn that it can help in this case. The canonical use case for restrict is memcpy:

void* memcpy( void *restrict dst, void const *restrict src, size_t n );

where there are two pointers marked restrict which means that they do not point to overlapping regions of memory, i.e., they don’t overlap with each other.

But in the case of array, how can restrict help when there is no other pointer in array_cleanup for array not to overlap with?

It turns out that restrict is more general than you might think. What the restrict in the modified array_cleanup means is that, during its execution, the dynamic array pointed to by array will not be modified by any pointer anywhere in the program, even if free_fn has a copy of such a pointer. Using restrict is a promise you make to the compiler. (If you break that promise, even unintentionally, the result is undefined behavior.)

With restrict, the assembly becomes:

.LBB0_4:
    mov  rdi, r12
    call r14                      ; (*free_fn)( element )
    add  r12, qword ptr [rbx + 8] ; element += array->esize
    dec  r13                      ; --i
    jne  .LBB0_4                  ; i > 0

The compiler has eliminated the array->len dereference and is now counting down to zero.

But why didn’t the compiler optimize the remaining dereference (qword ptr) for esize?

First, similar to inline, restrict can be ignored by the compiler. Second, every CPU has only finite resources, most notably registers. In this case, the compiler presumably thought that the overall performance would still be better even if esize were not put into a register. If it were to be put into a register, then the compiler would have to ensure that it’s value is the same after free_fn is called as before — and presumably that was more costly than a dereference.

As I noted in my restrict article, restrict isn’t yet part of standard C++; however, many compilers support __restrict__ as an extension.

Explicit Caching

Of course you can forget about restrict and just use local variables to cache the values as was done initially:

size_t const esize = array->esize;
size_t const len = array->len;
for ( size_t i = 0; i < len; ++i ) {
  (*free_fn)( element );
  element += esize;
}

Then, despite the function call in the loop, the assembly becomes:

.LBB0_4:
    mov  rdi, rbx
    call r14      ; (*free_fn)( element )
    add  rbx, r13 ; element += esize
    dec  r15      ; --i
    jne  .LBB0_4  ; i > 0

with no dereferences.

In this case, the explicit use of local variables gave the compiler a “nudge” in what turns out to be the right direction for optimization. It’s entirely possible that the compiler could have decided to eliminate the local variables and return to doing a dereference if it concluded that was really better. But by using local variables, you’ve at least told the compiler that the dynamic array pointed to by array will not be modified in the loop — you’ve given the compiler more information to make a better informed decision.

Conclusion

If you want to optimize code using pointers, functions, and loops, consider using either restrict or explicit caching. But you should always check the generated assembly to ensure your tuning is having the desired effect.

Epilogue

If you’ve never heard of the Compiler Explorer, aka, godbolt.org, it’s an extremly useful in-browser tool for compiling C or C++ code with virtually any compiler you’ve ever heard of, at any optimization level, and showing you a mapping from source lines to their corresponding assembly lines for most any CPU architecture. (Why is it called godbolt? Because its creator is Matt Godbolt.)

DEV Community

Pointers and Tuning and Loops! Oh My!

Introduction

Loop Refresher

Pointers

`restrict` Revisited

Explicit Caching

Conclusion

Epilogue

Top comments (0)

Introduction

Loop Refresher

Pointers

restrict Revisited

Explicit Caching

Conclusion

Epilogue

`restrict` Revisited