DEV Community

Cover image for Branchless programming. Does it really matter?
Jobin Johnson
Jobin Johnson

Posted on • Edited on

26 4

Branchless programming. Does it really matter?

Branchless programming is a programming technique that eliminates the branches (if, switch, and other conditional statements) from the program. Although this is not much relevant these days with extremely powerful systems and usage of interpreted languages( especially dynamic typed ones). These are very critical when writing high-performance, low latency software. 

Modern compilers can identify and optimize your code to branchless machine code to some extent, but they can't detect all patterns. 

In this article, I will be explaining in detail branching, why branching causes performance impacts, how to optimize your code to reduce this effect, some good implementations, and my thoughts on branchless coding

What is Branching?

Branching is when we split the flow of the program into two parts based on a runtime condition check. Consider the scenario of executing a simple if statement.



int main() {
    int a = 5;
    int b = 20;
    if (a < b) {
        a = 10;
    }
    return a;
}


Enter fullscreen mode Exit fullscreen mode

In the above program the until the if statement is evaluated the program follows a single control path. After executing the condition "a<b" the program is divided into two branches based on the expression evaluation result.

The above translates to machine code as follows.



0000000000401110 <main>:
  401110:   55                      push   %rbp
  401111:   48 89 e5                mov    %rsp,%rbp
  401114:   c7 45 fc 00 00 00 00    movl   $0x0,-0x4(%rbp)
  40111b:   c7 45 f8 05 00 00 00    movl   $0x5,-0x8(%rbp)
  401122:   c7 45 f4 14 00 00 00    movl   $0x14,-0xc(%rbp)
  401129:   8b 45 f8                mov    -0x8(%rbp),%eax
  40112c:   3b 45 f4                cmp    -0xc(%rbp),%eax
  40112f:   0f 8d 07 00 00 00       jge    40113c <main+0x2c>
  401135:   c7 45 f8 0a 00 00 00    movl   $0xa,-0x8(%rbp)
  40113c:   8b 45 f8                mov    -0x8(%rbp),%eax
  40113f:   5d                      pop    %rbp
  401140:   c3                      retq   
  401141:   66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
  401148:   00 00 00 
  40114b:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)


Enter fullscreen mode Exit fullscreen mode

Note: The above output is without applying any optimizations by the compiler. Most compilers have really good optimizers which generate better code than this in most cases depending upon the target ISA, your compiler, the level of optimization, etc.

Yes it works..! What is wrong with it?

The above code will work fine. But the addition of a simple if statement added a conditional jump to the machine code. As we argue this is normal behavior and a language cannot be called a programming language without conditionals which makes it Turing complete.

Instruction Pipelining

CPUs use a technique called Instruction Pipelining. This is a way to implement instruction-level parallelism on a single processor. The CPUs will fetch the upcoming instructions from the memory when executing the current one. This will speeds up the processor as all stages of the CPU are always functioning regardless of the stage of the current instruction.
Executing an instruction consists of 4 stages. 

  • Fetch: Fetches the next instruction needs to be excited from the memory (usually available in processor cache itself.)
  • Decode: After the fetch stage is completed, the instruction will be decoded
  • Execute: The decoder will identify the instruction and issues control signals to execute the instruction
  • Write Back: After execution, the results will be written back

Execution without instruction pipeline

This is the normal way it is expected to work. The issue with this is when fetching the instruction the Decode, Execute and Writeback stages are idle. In order to solve this instruction pipelining was introduced which parallelizes the operation such that all stages are operational at the same time.

Execution with instruction pipeline

Instruction pipelining and branching
When the CPU decodes a branch instruction it has to flush the stages and restart the pipeline. After executing a branch instruction the instruction pointer changes which makes the subsequent instructions that are already fetched invalid.

Effect of jump instruction on the instruction pipeline.

As you can see in the above figure the instruction is invalidated and restated and the pipeline is restarted. In some cases, the new location may not be available in the CPU cache. in the case of a cache miss, the new block needs to be moved from memory to the CPU cache. This adds latency to the processing. Modern CPUs use some branch prediction techniques to deal with these scenarios, but still not that efficient (Spectre vulnerability, which once shook the internet came from this part.).

Apart from instruction pipelining, this has effects on modern CPU capabilities like Out of Order Execution (OOE), vector processing, etc.

How to optimize?

This is really a tough question. Conditionals are the inevitable part of a program. Not every scenario can be optimized. A very few conditionals can be replaced with some better patterns. In some cases replacing branches will degrade the performance as it might introduce more additional instructions. Modern compiles recognize some patterns and optimize them with branchless code.

Not all use cases have a branchless counterpart, but some can be replaced with a branchless alternative.

Utilizing booleans in Mathematical formulas

eg. 1) Maximum of 2 numbers


if(a > b) {
 return a;
} else {
 return b;
}


Enter fullscreen mode Exit fullscreen mode

The above code snippet can be written as



return (a > b) * a + (a <= b) * b;


Enter fullscreen mode Exit fullscreen mode

This can be further optimized by using a sign mask method.



int fast_max(int a, int b) {
  int diff = a - b;
  int dsgn = diff >> 31;
  return a - (diff & dsgn);
}


Enter fullscreen mode Exit fullscreen mode

Code snippet from CPUEngine

eg. 2) Conditional assignments


x = ( a > b ) ? C : D;


Enter fullscreen mode Exit fullscreen mode


x = (a > b) * C + (a <= b) * D; 


Enter fullscreen mode Exit fullscreen mode
eg. 3) Accessing array element


int x[] = [3,4,5,6,7,8];
int lenX = 6;

inline int getItemAtPosition(int i){
    if(i>=lenX)
        return x[i-lenX];
    return x[i];
}


Enter fullscreen mode Exit fullscreen mode

This is a common scenario. This can be optimized as



inline int getItemAtPosition(int i){
    return x[i % lenX];
}


Enter fullscreen mode Exit fullscreen mode

Bit level operations

eg. 1) Absolute Value


inline long abs(int a ){
    if(a>=0)
        return a;
    else
        return -a;
}


Enter fullscreen mode Exit fullscreen mode

This can be optimized to.



inline long abs(int a) {
   int mask = a >> 31;
   a ^= mask;
   a -= mask;
   return a;
}


Enter fullscreen mode Exit fullscreen mode

In some cases, shift operations can be tricky. Different CPU architectures implement shift operations in different ways. In some cases, the shift operation causes more cycle time.

Is it real? Does it really matter?

If you are writing code using some interpreted, dynamically typed languages it doesn't matter much because there are so many conditionals getting executed even for a simple statement you write(like checks for handling the type system, storage flags, etc). So this doesn't really add a performance boost to it. If you are writing low latency high-performance software, this can be helpful. This can give you a noticeable performance improvement especially if you use conditionals iteratively.

Modern compilers have built-in optimizers which can recognize some patterns and replace them with branch-less counterparts. But these are limited, so it's always better to write optimized code.

All the above snippets do not give you the best results in all the cases. Some CPU architectures have special instructions for handling these things.

Cover image by Photo by Jeremy Waterhouse from Pexels

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (4)

Collapse
 
phlash profile image
Phil Ashby

Good article! I like the explanations of why branchless execution matters, and the mention of branch prediction leading to vulnerabilities such as Spectre, however you make a very bold statement :-)

Modern compilers have built-in optimizers which can recognize some patterns and replace them with branch-less counterparts. But these are limited, so it's always better to write optimized code.

I would be delighted to see evidence of this from performance testing, as my reading in this space tends towards leaving low-level optimisation to the compiler/CPU and concentrating on structural things (such as hoisting complex method calls out of loops)?

Collapse
 
pgradot profile image
Pierre Gradot
Comment hidden by post author
Collapse
 
pgradot profile image
Pierre Gradot

I have tried to change the compiler to "x86-64 clang (trunk)" as the results are quite different for some snippets.

This confirms something important: don't assume, measure, try, measure again.

Collapse
 
pokemaster7 profile image
pokemaster7

Example 3 is not correct. The modulo operator is not branchless. The only way to make this branchless is to have the second argument be a power of 2 (in this case, modulos can be calculated by bit-shifting). Instead, do something like:

return x[i - (i >= lenX) * lenX]

Of course, we are assuming i is less than lenX

Some comments have been hidden by the post's author - find out more

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay