Aleksandar Milenkovic

Posted on Mar 27 • Edited on Mar 31

S2S Compilers: Understanding Switch Case Statements

#programming #assembly #c #performance

S2S Compilers: Understanding Switch Case Statements

	This blog is sponsored and supported by Voxgig.

	Motivated and inspired by Decl. In development and soon to be open-sourced.

S2S Compilers: Understanding Switch Case Statements

About Author

My name is Aleksandar Milenkovic. I am a senior software engineer at Voxgig and a self-driven coder, an autodidact, that creates software and improves its quality at any level.

If you have any feedback or questions, feel free to reach out to me via LinkedIn.

Introduction

Source-to-source compilers transpile the source code of a program and generate it into another language (target language). The target is typically a high-level language that is supported by all platforms so that the compatibility is not a problem.

If you are turning your source code into languages such as C or C++, it is required to have great understanding and knowledge of C/C++. Since these languages also have compilers be it GNU Compiler Collection or Clang, we have to do a lot of digging and researching around their features and functionalities. There is a lot of benefit in that once the target codebase grows and developers start reusing the target source as a component rather than referring back to the source language code. For more context, also check out Wikipedia | Source-to-source compiler | Programming language implementations.

In a lot of cases, we start benchmarking a lot of the target source code generated by our S2S compiler and comparing to the plain C/C++. We start finding a lot of improvements we can make for the target source itself. The goal is the quality and performance of the target source.

As the first instalment of this series, we dive into the usage of the switch case, optimizing it, comparing it to if-else, implementing our own switch case with all of the assembly conversions and benchmarks to back it all up. We use GCC (GNU Compiler Collection) and cover some of its commands in detail.

Even if you are not interested in S2S Compilers as a C/C++ developer, this blog still comes in handy with all of the examples and information.

This blog series is named S2S (Source-to-source) Compilers as all of the code and content is motivated by building a S2S Compiler that takes in the source language and translates it into C++. By benchmarking and analyzing the target source (C++), I have realized how many improvements there are to make. More on the motivation in the following section.

Motivation

As a senior software engineer for Voxgig, we have been building SDKs in various programming languages. For our utilities, this brought about a lot of performance concerns as we port from one language to another, we would try to make the two and more implementations as close as possible side by side. That led us into the waters of language-specific implementations where we have pushed to use the most out of each language so as to make it as performant as possible for the developer using the SDK in the respective language. For example, the C++ implementation is going to have more focus on the performance since it is harder to optimize code compared to Python, Ruby, or JavaScript.

At the same point in time while at Voxgig, I have been building a dynamic programming language that transpiles into C++, I have been doing a lot of research and have benchmarked plenty of C++ code in order to hit the maximum runtime performance of the target language (C++).

All of this work was motivated by the dynamic programming language called, Decl; soon to be open-sourced.

To be frank, the reason for C++ as the target language is that it is much easier to build a garbage collector by just implementing reference counter in the wrapper class. The state of dynamically allocated objects is determined by implicit destructors of C++, again, with the reference counting implementation. However, we won't cover any of that in this blog but stay ready for the next instalment :).

All of the knowledge that I have acquired turned out to be very useful for me and, most of all, our team.

If you are interested in building SDKs for one or even more programming languages, please contact Voxgig.

Again, huge thanks to Voxgig for enabling me to put in the time for this research and writing the blog.

Source Code

The source code of this blog is all in one file, at the Github repository where it is also hosted. The code contains the main function running all the benchmarks in order.

Prerequisites

Background

This blog is intended for C/C++ programmers who have the essential knowledge of assembly. However, if you haven't had any experience of assembly before, read on and see how you get on!

GCC Options

You can use the following gcc command to dissect assembly code:

gcc src/main.cpp -g -o output.s -masm=intel -fverbose-asm -S -Werror

As the Using the GNU Compiler Collection Documentation indicates, we use -fverbose-asm to "put extra commentary information in the generated assembly code to make it more readable".

For example,

#include <stdio.h>


int main() {
  int i = 0;


  printf("%d\n", i);
  printf("%d%d\n", i, i);

  return 0;
}

Generates the following assembly output with comments indicating the line number and its content.

main:
.LFB0:
        .file 1 "main.cpp"
        .loc 1 4 12
        .cfi_startproc  
        endbr64 
        push    rbp     #
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        mov     rbp, rsp        #,
        .cfi_def_cfa_register 6
        sub     rsp, 16 #,
# main.cpp:5:   int i = 0;
        .loc 1 5 7
        mov     DWORD PTR -4[rbp], 0    # i,
# main.cpp:8:   printf("%d\n", i);
        .loc 1 8 9
        mov     eax, DWORD PTR -4[rbp]  # tmp84, i
        mov     esi, eax        #, tmp84
        lea     rax, .LC0[rip]  # tmp85,
        mov     rdi, rax        #, tmp85
        mov     eax, 0  #,
        call    printf@PLT      #
# main.cpp:9:   printf("%d%d\n", i, i);
        .loc 1 9 9
        mov     edx, DWORD PTR -4[rbp]  # tmp86, i
        mov     eax, DWORD PTR -4[rbp]  # tmp87, i
        mov     esi, eax        #, tmp87
        lea     rax, .LC1[rip]  # tmp88,
        mov     rdi, rax        #, tmp88
        mov     eax, 0  #,
        call    printf@PLT      #
# main.cpp:11:   return 0;
        .loc 1 11 10
        mov     eax, 0  # _5,
# main.cpp:12: }
        .loc 1 12 1
        leave
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

We also use -masm=dialect with intel to produce code optimized for the most current Intel processors but in this case we are using Intel ASM Syntax for the assembly generation. I do personally prefer this syntax branch.

For the assembly examples, we use the Compiler Explorer as demangling identifiers for the assembly comes in built-in and makes the code more readable and easier to follow. The Complier Explorer also allows you to follow the code by hovering over a line or scrolling to source, pointing you directly to the assembly conversion. Anyway, see what works best for you!

We will have two snippets where we introduce the C high-level source code and show the generated assembly code so we can make comparisons and draw conclusions - even so subtle in some cases. But don't forget that any assembly instruction can make a difference.

Headers and Macros

For benchmarking, we pretty much use printf and clock() defined as the following macros.

#include <stdio.h>
#include <time.h>

#define start_time clock_t s_t_a_r_t = clock();
#define end_time printf("[CPU Time Used: %f]\n", (double)(clock() -          s_t_a_r_t) / CLOCKS_PER_SEC);

Switch case VS If-else

Let's start by comparing the if-else to the switch statement. Note that in most of the examples we don't use any optimization flags by default since we need the most accurate assembly translation line by line, keeping our code "clean" in that we can also accurately benchmark our implementations.

For example, if we turn on the -O1 flag, the compiler, if finds appropriate, will generate a jump table even in the if-else example as it determines the values are densely packed. For more information, please refer to Using the GNU Compiler Collection | 3.12 Options That Control Optimization.

🚧 Terminology

Beware that the "packed" word is used in relation to jump tables and has nothing to do with C Packed Structures.

You might find the following code examples, for a lack of a better word, dumb since we are just having integers that are assigned based on their value so why can't we just go with:

r = type;

or just use the Using the GNU Compiler Collection | 6.18.11 Case Ranges.

Well, that wouldn't be the point of this blog. The individual r=0, r=1, ..., r=n are supposed to represent our code blocks executed at the point they are reached. Another reason is that we wanted to trick the compiler into generating separate, most simple, labels for each of the conditions.

The whole point is to make assembly output smaller so that we can focus on the important assembly instructions at hand such as CALL, CMP, JMP, etc.

The reason behind naming our parameter type is that in the context of creating a dynamic programming language where the current member value is wrapped in a union, you would want an enum in the switch case to most efficiently determine which member is active to apply certain operations against. Check out Union-like classes.

If-else statement

The following is the if-else chain of statements comparing the unsigned integers 0 through 10.

📘 Note

In the examples like this, we use the volatile keyword in order to prevent the compiler from eliminating our code in some cases where the -O3 or -O2 optimizations are added, which is a collection of optimization flags for the C/C++ code. For more information, see Using the GNU Compiler Collection | 3.12 Options That Control Optimization.

We are not going to go into too much details on these flags in this blog but we will cover them in the next where we show how we can carefully pick and use them according to our needs, ensuring there is more control over our flags rather than using a whole collection of them.

void _if_run(unsigned int type) {

  volatile int r; // NOTE: preventing compiler from optimizing it out

  if(type == 0) {
    r = 0;
  } else if(type == 1) {
    r = 1;
  } else if(type == 2) {
    r = 2;
  } else if(type == 3) {
    r = 3;
  } else if(type == 4) {
    r = 4;
  } else if(type == 5) {
    r = 5;
  } else if(type == 6) {
    r = 6;
  } else if(type == 7) {
    r = 7;
  } else if(type == 8) {
    r = 8;
  } else if(type == 9) {
    r = 9;
  } else if(type == 10) {
    r = 10;
  } else {
    r = 11;
  }

}

To stay consistent with the switch case example, we use the unsigned int that represents the range of 0 to 4,294,967,295. The reason being is the switch case must start from 0 for the jump table to get generated effectively. Since in C, it is not possible to have array subscripts below 0. This also helps with the switch case in order to have only one CMP instruction before the JMP instructions so that it can decide whether to jump straight to the default or go on to the jump table where goto jumps to the corresponding label.

For example, this code takes two CMP instructions.

void _cmp(int t) {
    if(t < 0 || t > 5) {
        return;
    }
}

_cmp:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        cmp     DWORD PTR [rbp-4], 0
        js      .L19
        cmp     DWORD PTR [rbp-4], 5
        jmp     .L16
.L19:
        nop
.L16:
        pop     rbp
        ret

Whereas

void _cmp(unsigned int t) {
    if(t > 5) {
        return;
    }
}

Only one:

_cmp:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], edi
        cmp     DWORD PTR [rbp-4], 5
        pop     rbp
        ret

In the assembly output of the C high-level code, we can see that for each statement the GCC compiler uses CMP, which is of course O(1) in the best case given that we match the first condition. However, if none match - the point to get to else equates to the time complexity of O(n) being the worst case. The average time complexity would really depend on what the code does and how frequently the value to compare proceeds further into the statement, which means more CMPs before the condition is met.

This is where the switch case comes in to save the day with the average time complexity of O(1) but we have to make sure that the switch case is optimized into a jump table, which we cover in the next section.

_if_run:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 0
        jne     .L32
        mov     DWORD PTR [rbp-4], 0
        jmp     .L44
.L32:
        cmp     DWORD PTR [rbp-20], 1
        jne     .L34
        mov     DWORD PTR [rbp-4], 1
        jmp     .L44
.L34:
        cmp     DWORD PTR [rbp-20], 2
        jne     .L35
        mov     DWORD PTR [rbp-4], 2
        jmp     .L44
.L35:
        cmp     DWORD PTR [rbp-20], 3
        jne     .L36
        mov     DWORD PTR [rbp-4], 3
        jmp     .L44
.L36:
        cmp     DWORD PTR [rbp-20], 4
        jne     .L37
        mov     DWORD PTR [rbp-4], 4
        jmp     .L44
.L37:
        cmp     DWORD PTR [rbp-20], 5
        jne     .L38
        mov     DWORD PTR [rbp-4], 5
        jmp     .L44
.L38:
        cmp     DWORD PTR [rbp-20], 6
        jne     .L39
        mov     DWORD PTR [rbp-4], 6
        jmp     .L44
.L39:
        cmp     DWORD PTR [rbp-20], 7
        jne     .L40
        mov     DWORD PTR [rbp-4], 7
        jmp     .L44
.L40:
        cmp     DWORD PTR [rbp-20], 8
        jne     .L41
        mov     DWORD PTR [rbp-4], 8
        jmp     .L44
.L41:
        cmp     DWORD PTR [rbp-20], 9
        jne     .L42
        mov     DWORD PTR [rbp-4], 9
        jmp     .L44
.L42:
        cmp     DWORD PTR [rbp-20], 10
        jne     .L43
        mov     DWORD PTR [rbp-4], 10
        jmp     .L44
.L43:
        mov     DWORD PTR [rbp-4], 11
.L44:
        nop
        pop     rbp
        ret

Switch Case

In case(no pun intended) you are wondering why we are not using the enum keyword and the enumeration values, that is to demonstrate that those enums are essentially unsigned int and are starting from 0. So if you are using enum, the assembly outputs are gonna be exactly the same when it comes to the switch statement. If you are looking for this enum version of the code, please check out _switch_enum_run at the Github repository source code or Default Case: Can we do without it section.

For example,

#include <stdio.h>
#include <assert.h>

enum N {
    N0,
    N1,
    N2,
    N3,
};

int main()
{

    assert(sizeof(unsigned int) == sizeof(N0));
    // Cast to unsigned int
    assert((unsigned int)N0 == 0);
    // No need to cast if not necessary
    assert(N0 == 0);

    return 0;
}

🚧 Note

Please understand that I am not advising casting enums in any shape or form.

If you are using func(enum T), always use one of the enum values directly - not unsigned int unless your function is func(unsigned int). It doesn't necessarily take a toll on the runtime performance but it is a bit unconventional in the high-level code.

The following is a simple switch case of 0 through 10 with the default case.

void _switch_run(unsigned int type) {

  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    case 2: {
      r = 2;
      break;
    }
    case 3: {
      r = 3;
      break;
    }
    case 4: {
      r = 4;
      break;
    }
    case 5: {
      r = 5;
      break;
    }
    case 6: {
      r = 6;
      break;
    }
    case 7: {
      r = 7;
      break;
    }
    case 8: {
      r = 8;
      break;
    }
    case 9: {
      r = 9;
      break;
    }
    case 10: {
      r = 10;
      break;
    }
    default: {
      r = 11;
      break;
    }
  }

}

_switch_run:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L2
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L4[0+rax*8]
        jmp     rax
.L4:
        .quad   .L14
        .quad   .L13
        .quad   .L12
        .quad   .L11
        .quad   .L10
        .quad   .L9
        .quad   .L8
        .quad   .L7
        .quad   .L6
        .quad   .L5
        .quad   .L3
.L14:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L15 # break
.L13:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L15
.L12:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L15
.L11:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L15
.L10:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L15
.L9:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L15
.L8:
        mov     DWORD PTR [rbp-4], 6
        jmp     .L15
.L7:
        mov     DWORD PTR [rbp-4], 7
        jmp     .L15
.L6:
        mov     DWORD PTR [rbp-4], 8
        jmp     .L15
.L5:
        mov     DWORD PTR [rbp-4], 9
        jmp     .L15
.L3:
        mov     DWORD PTR [rbp-4], 10
        jmp     .L15
.L2: # Default Case Label
        mov     DWORD PTR [rbp-4], 11
        nop
.L15: # Break Label
        nop
        pop     rbp
        ret

These instruction are pretty straightforward to comprehend. First, we have the CMP instruction that checks if the unsigned integer is in the bounds of 10. As noted above, this is the deciding instruction for whether to jump straight to the default case.

cmp     DWORD PTR [rbp-20], 10
ja      .L2

JA is a jump when the comparison is unsigned. This makes sense given that we are using unsigned int to compare.

Roughly, the C equivalent of this code is the following:

void _switch_run(unsigned int type) {
  if(type > 10) {
    goto L2;
  }
  // ...
}

Finally, we get to the jump table:

        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L4[0+rax*8]
        jmp     rax
.L4:
        .quad   .L14
        .quad   .L13
        .quad   .L12
        .quad   .L11
        .quad   .L10
        .quad   .L9
        .quad   .L8
        .quad   .L7
        .quad   .L6
        .quad   .L5
        .quad   .L3

We can see that the compiler has essentially generated the .L4, a static array of labels that represent the code blocks that we defined in our switch case. The cost of this JMP is constant O(1).

In C, this code is roughly equivalent to the following, which is covered thoroughly in the Implementing Switch Case section.

static void* cases[] = {
  &&l00,
  &&l01, 
  &&l02,
  &&l03,
  &&l04,
  &&l05,
  &&l06,
  &&l07,
  &&l08,
  &&l09,
  &&l10,
};


// CMP For the default case: "cmp     DWORD PTR [rbp-20], 10"
if(v > 10) {
  goto _default;
}

goto *cases[v];

volatile int r;

l00: {
   r = 0;
   goto defer; // break
 }
l01: {
   r = 1;
   goto defer; // break
 }
l02: {
   r = 2;
   goto defer; // break
 }
/* ... */
l10: {
     r = 10;
     goto defer;
   }


_default: {
   r = 11;
   goto defer; // break
}

// For us, defer serves as the alternative to where the "break" keyword jumps to as far as the switch case and jump table
defer: {
     return;
   }

Benchmarking the two

Finally, let's put our code to use and benchmark to demonstrate our assembly breakdown at runtime.

We are going to use the value of 10 since that is the last condition of the two statements. That way, we are forcing the worst case of if-else and confirming that the switch case is the constant time complexity: O(1).

int main() {
  volatile unsigned int value = 10; // NOTE: preventing compiler from optimizing it out
  {
    printf("_switch_run\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        _switch_run(value);
      }

      end_time;
    }
    printf("_switch_run\n");
  }

  {
    printf("_if_run\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        _if_run(value);
      }

      end_time;
    }
    printf("_if_run\n");
  }

  return 0;
}

The code produces the following output:

_switch_run
[CPU Time Used: 0.002417]
[CPU Time Used: 0.002171]
[CPU Time Used: 0.002241]
[CPU Time Used: 0.002250]
[CPU Time Used: 0.002212]
[CPU Time Used: 0.002244]
[CPU Time Used: 0.002201]
[CPU Time Used: 0.002209]
[CPU Time Used: 0.002329]
[CPU Time Used: 0.002238]
_switch_run
_if_run
[CPU Time Used: 0.005600]
[CPU Time Used: 0.005801]
[CPU Time Used: 0.005637]
[CPU Time Used: 0.005700]
[CPU Time Used: 0.005804]
[CPU Time Used: 0.005658]
[CPU Time Used: 0.005693]
[CPU Time Used: 0.005757]
[CPU Time Used: 0.005792]
[CPU Time Used: 0.005832]
_if_run

Out-of-Bounds Benchmark

Let's see what's the case if we provide the value that matches none of our conditions.

With the switch case, that would be the default label and with the if-else set: the else statement.

int main() {
  volatile unsigned int value = 11; // NOTE: preventing compiler from optimizing it out
  {
    printf("_switch_run\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        _switch_run(value);
      }

      end_time;
    }
    printf("_switch_run\n");
  }

  {
    printf("_if_run\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        _if_run(value);
      }

      end_time;
    }
    printf("_if_run\n");
  }

  return 0;
}

In this particular scenario, the switch comes out victorious again but even by a larger margin, as we can see from the following results.

_switch_run
[CPU Time Used: 0.001884]
[CPU Time Used: 0.001813]
[CPU Time Used: 0.001726]
[CPU Time Used: 0.002118]
[CPU Time Used: 0.001847]
[CPU Time Used: 0.001746]
[CPU Time Used: 0.001817]
[CPU Time Used: 0.001675]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001679]
_switch_run
_if_run
[CPU Time Used: 0.005802]
[CPU Time Used: 0.005815]
[CPU Time Used: 0.005619]
[CPU Time Used: 0.005913]
[CPU Time Used: 0.005581]
[CPU Time Used: 0.005722]
[CPU Time Used: 0.005725]
[CPU Time Used: 0.005770]
[CPU Time Used: 0.005779]
[CPU Time Used: 0.005683]
_if_run

This is due to the fact that out of all this code for the _switch_run:

_switch_run:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L18
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L20[0+rax*8]
        jmp     rax
.L20:
        .quad   .L30
        .quad   .L29
        .quad   .L28
        .quad   .L27
        .quad   .L26
        .quad   .L25
        .quad   .L24
        .quad   .L23
        .quad   .L22
        .quad   .L21
        .quad   .L19

It is only this portion of the code that actually runs:

_switch_run:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L18
.L18:
        mov     DWORD PTR [rbp-4], 11
        nop

Once the CMP matches and JA jumps to .L18, that is the end of our function lifetime. In other words, it doesn't even use the jump table.

Turning a switch case into a jump table

The GCC compiler is not always going to turn a switch case into a jump table. Only when there are multiple cases does the compiler (even consider to) make that optimization.

For example, if we take this code into consideration with its respective assembly output.

void _switch_test(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    case 2: {
      r = 2;
      break;
    }
    case 3: {
        r = 3;
        break;
    }
    default: {
        r = 11;
        break;
    }
  }
}

This is basically an if-else statement, which we can tell by the CMP and JE(Jump if Equal) instructions. There is an interesting detail that all of these values are compared backwards - from 3 rather than from 0 as opposed to the if-else example.

_switch_test:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 3
        je      .L18
        cmp     DWORD PTR [rbp-20], 3
        ja      .L19
        cmp     DWORD PTR [rbp-20], 2
        je      .L20
        cmp     DWORD PTR [rbp-20], 2
        ja      .L19
        cmp     DWORD PTR [rbp-20], 0
        je      .L21
        cmp     DWORD PTR [rbp-20], 1
        je      .L22
        jmp     .L19
.L21:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L23
.L22:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L23
.L20:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L23
.L18:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L23
.L19:
        mov     DWORD PTR [rbp-4], 11
        nop
.L23:
        nop
        pop     rbp
        ret

Even by enabling the -O3 optimizations, this code is not turned into a jump table.

The GCC compiler decides whether to generate a jump table given there is enough densely packed values.

So for this code, we do get the jump table generated.

void _switch_test(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    case 2: {
      r = 2;
      break;
    }
    case 3: {
        r = 3;
        break;
    }
    case 4: {
        r = 4;
        break;
    }
    default: {
        r = 11;
        break;
    }
  }
}

There is our jump table by the name of .L20!

_switch_test:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 4
        ja      .L18
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L20[0+rax*8]
        jmp     rax
.L20:
        .quad   .L24
        .quad   .L23
        .quad   .L22
        .quad   .L21
        .quad   .L19
.L24:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L25
.L23:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L25
.L22:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L25
.L21:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L25
.L19:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L25
.L18:
        mov     DWORD PTR [rbp-4], 11
        nop
.L25:
        nop
        pop     rbp
        ret

With all of that being said, there is a way to force generating a jump at all times by specifying --param case-values-threshold=1 as one of the GCC command options.

According to Using the GNU Compiler Collection Documentation:

case-values-threshold
- The smallest number of different values for which it is best to use a jump-table instead of a tree of conditional branches. If the value is 0, use the default for the machine.
-fno-jump-tables
- Do not use jump tables for switch statements even where it would be more effi- cient than other code generation strategies. This option is of use in conjunction with ‘-fpic’ or ‘-fPIC’ for building code that forms part of a dynamic linker and cannot reference the address of a jump table. On some targets, jump tables do not require a GOT and this option is not needed.

📘 Note

The closest "opposite" to case-values-threshold probably is -fno-jump-tables where no matter the number of values/cases, that block is turned into a series of CMP and JMP instructions.

As a side note, there is the option called -fjump-tables that I managed to dig out of Using the GNU Compiler Collection Documentation but I am assuming it is the default option since I couldn't find any description of it and it is the -fno-jump-tables, if specified, that overrides it.

gcc src/main.cpp -Werror -g -o output.s -masm=intel -fverbose-asm -S --param case-values-threshold=1

Even if we have a switch case such as the following.

void _switch_test(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    default: {
        r = 11;
        break;
    }
  }
}

The output is:

_switch_test:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 1
        ja      .L18
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L20[0+rax*8]
        jmp     rax
.L20:
        .quad   .L21
        .quad   .L19
.L21:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L22
.L19:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L22
.L18:
        mov     DWORD PTR [rbp-4], 11
        nop

From this, we can conclude that using a jump table for as few as 1, 2, or 3 values is redundant and rather impractical at the same time.

Is there a limit for the values of a switch case?

Since a jump table is basically a static array, there should be some limit of how many values we can pack into that jump table.

There is no particular way to explicitly test when the jump table isn't used in case there are thousands upon thousands of cases - unless we want to torture our computer and buy a new one any time soon so we can use a script like this.

if __name__ == '__main__':
  code = '''
void __switch_test(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
  ''';
  for i in range(0, 100000):
      code += 'case ' + str(i) + ':{ \n' + ' r=' + str(i) + ';\n break;\n}\n'

  code += 'default: {\n r = 11;\n break;\n}\n'
  code += '}'
  code += '}'

  print(code)

python3 generate.py > _out.c && gcc _out.c -g -o output.s -masm=intel -fverbose-asm -S

To answer the question in the title, there is no limit for the GCC compiler not to make it a jump table given that all the values are densely packed.

As long as there is memory, the output is going to be a jump table. Please, check out Using the GNU Compiler Collection | 4.12 Statements.

There are some options to consider for restricting the usage of jump tables.

-fno-jump-tables
- Do not use jump tables for switch statements even where it would be more efficient than other code generation strategies. This option is of use in conjunction with ‘-fpic’ or ‘-fPIC’ for building code that forms part of a dynamic linker and cannot reference the address of a jump table. On some targets, jump tables do not require a GOT and this option is not needed.
jump-table-max-growth-ratio-for-size
- The maximum code size growth ratio when expanding into a jump table (in percent). The parameter is used when optimizing for size.
jump-table-max-growth-ratio-for-speed
- The maximum code size growth ratio when expanding into a jump table (in percent). The parameter is used when optimizing for speed.

Same as the compiler not using the jump table in the example above, note that with -fno-jump-tables, the GCC compiler creates a set of CMP and JMP instructions in reverse order to the original values defined.

Breaking switch case: Non-packed values

This is a rather weird use case for the switch statement but worth exploring to see what happens if we get "out of the way of the packed values".

void _switch_unpacked(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    case 2: {
      r = 2;
      break;
    }
    case 3: {
        r = 3;
        break;
    }
    case 4: {
        r = 4;
        break;
    }
    case 5: {
        r = 5;
        break;
    }
    case 30: {
        r = 30;
        break;
    }
    default: {
        r = 11;
        break;
    }
  }
}

The GCC compiler handles this code in a very interesting way. It fills all the cases in between 4 and 30 with blanks. These all lead to the same .L18 default case jump.

_switch_unpacked:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 30
        ja      .L18
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L20[0+rax*8]
        jmp     rax
.L20:
        .quad   .L26
        .quad   .L25
        .quad   .L24
        .quad   .L23
        .quad   .L22
        .quad   .L21
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L18
        .quad   .L19
.L26:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L27
.L25:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L27
.L24:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L27
.L23:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L27
.L22:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L27
.L21:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L27
.L19:
        mov     DWORD PTR [rbp-4], 30
        jmp     .L27
.L18:
        mov     DWORD PTR [rbp-4], 11
        nop
.L27:
        nop
        pop     rbp
        ret

It is rational to think that if we say 100 instead of 30, the compiler doesn't waste too much space with all the blanks that point to the default case and instead needs to use another CMP instruction. This is exactly the case.

void _switch_unpacked(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    /* ... */
    case 5: {
        r = 5;
        break;
    }
    case 100: {
        r = 100;
        break;
    }
    default: {
        r = 11;
        break;
    }
  }
}

We have two CMP:

First CMP: The value is above 5 and jump to check (CMP) if it is 100

            cmp     DWORD PTR [rbp-20], 5
            ja      .L18

    # ...
    .L18:
            cmp     DWORD PTR [rbp-20], 100
            je      .L27
            jmp     .L19
    ```



- Second `CMP`: Checks if the value is above `5` - if that's _true(sets the zero flag)_, jump to the `default`. Otherwise, proceed with the jump table where all of the values are _still_ packed.

  -

 ```assembly
            cmp     DWORD PTR [rbp-20], 5
            ja      .L19
            mov     eax, DWORD PTR [rbp-20]
            mov     rax, QWORD PTR .L21[0+rax*8]
            jmp     rax
    .L21:
            .quad   .L26
            .quad   .L25
            .quad   .L24
            .quad   .L23
            .quad   .L22
            .quad   .L20
    # ...
    .L19:
            mov     DWORD PTR [rbp-4], 11
            nop
    ```






```assembly
_switch_unpacked:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 5
        ja      .L18
        cmp     DWORD PTR [rbp-20], 5
        ja      .L19
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L21[0+rax*8]
        jmp     rax
.L21:
        .quad   .L26
        .quad   .L25
        .quad   .L24
        .quad   .L23
        .quad   .L22
        .quad   .L20
.L18:
        cmp     DWORD PTR [rbp-20], 100
        je      .L27
        jmp     .L19
.L26:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L28
.L25:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L28
.L24:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L28
.L23:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L28
.L22:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L28
.L20:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L28
.L27:
        mov     DWORD PTR [rbp-4], 100
        jmp     .L28
.L19:
        mov     DWORD PTR [rbp-4], 11
        nop
.L28:
        nop
        pop     rbp
        ret

These two CMPs may appear to be a little unoptimized since they do pretty much the same thing. So let's go ahead and turn on the -O1 optimizations:

_switch_unpacked:
        cmp     edi, 5
        ja      .L2
        ja      .L3
        mov     edi, edi
        jmp     [QWORD PTR .L5[0+rdi*8]]
.L5:
        .quad   .L10
        .quad   .L9
        .quad   .L8
        .quad   .L7
        .quad   .L6
        .quad   .L4
.L2:
        cmp     edi, 100
        jne     .L3
        mov     DWORD PTR [rsp-4], 100
        ret
.L3:
        mov     DWORD PTR [rsp-4], 11
        ret

Multiple Jump Tables

If we think more and more about the case of "Using the non-packed values", the compiler cannot keep using the CMP instructions in the ranges such as [0, 1, 2, 3, 4, 5, 100, 101, 102, 103, 104, 105]. We can deduce that there are actually two sets of packed values there: 0...5 and 100...105.

However, because they are two entirely separate sets, there must be an additional CMP instruction between them to check the out-of-bounds ranges and whether to jump to the default case.

Let's back up our thinking in the following example.

void _switch_unpacked(unsigned int type) {
  volatile int r; // NOTE: preventing compiler from optimizing it out

  switch(type) {
    case 0: {
      r = 0;
      break;
    }
    case 1: {
      r = 1;
      break;
    }
    case 2: {
      r = 2;
      break;
    }
    case 3: {
        r = 3;
        break;
    }
    case 4: {
        r = 4;
        break;
    }
    case 5: {
        r = 5;
        break;
    }
    case 100: {
        r = 100;
        break;
    }
    case 101: {
        r = 101;
        break;
    }
    case 102: {
        r = 102;
        break;
    }
    case 103: {
        r = 103;
        break;
    }
    case 104: {
        r = 104;
        break;
    }
    case 105: {
        r = 105;
        break;
    }
    default: {
        r = 11;
        break;
    }
  }
}

In this assembly output, there are two jump tables - separated by their respective CMP instructions.

.L29: 0...5 range jump table
.L22: 100...105 range jump table

_switch_unpacked:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 5
        ja      .L18
        jmp     .L37
.L35:
        mov     eax, DWORD PTR [rbp-20]
        sub     eax, 100
        cmp     eax, 5
        ja      .L20
        mov     eax, eax
        mov     rax, QWORD PTR .L22[0+rax*8]
        jmp     rax
.L22:
        .quad   .L27
        .quad   .L26
        .quad   .L25
        .quad   .L24
        .quad   .L23
        .quad   .L21
.L37:
        cmp     DWORD PTR [rbp-20], 5
        ja      .L20
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L29[0+rax*8]
        jmp     rax
.L29:
        .quad   .L34
        .quad   .L33
        .quad   .L32
        .quad   .L31
        .quad   .L30
        .quad   .L28
.L18:
        cmp     DWORD PTR [rbp-20], 105
        ja      .L20
        cmp     DWORD PTR [rbp-20], 100
        jnb     .L35
        jmp     .L20
.L34:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L36
.L33:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L36
.L32:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L36
.L31:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L36
.L30:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L36
.L28:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L36
.L27:
        mov     DWORD PTR [rbp-4], 100
        jmp     .L36
.L26:
        mov     DWORD PTR [rbp-4], 101
        jmp     .L36
.L25:
        mov     DWORD PTR [rbp-4], 102
        jmp     .L36
.L24:
        mov     DWORD PTR [rbp-4], 103
        jmp     .L36
.L23:
        mov     DWORD PTR [rbp-4], 104
        jmp     .L36
.L21:
        mov     DWORD PTR [rbp-4], 105
        jmp     .L36
.L20:
        mov     DWORD PTR [rbp-4], 11
        nop
.L36:
        nop
        pop     rbp
        ret

If-else as a jump table

With no extra optimizations such as -O1, -O2 or -O3, the if-else statement is a set of CMP and JMP instructions checking each value as detailed out above.

Much to our surprise, if the conditional values are compared via the equal-to operator and are densely packed, the -O1 optimizations can do their magic and turn the if-else statements into a jump table.

gcc src/main.cpp -g -o output.s -masm=intel -fverbose-asm -S -O1

void _if_run(unsigned int type) {

  volatile int r; // NOTE: preventing compiler from optimizing it out

  if(type == 0) {
    r = 0;
  } else if(type == 1) {
    r = 1;
  } else if(type == 2) {
    r = 2;
  } else if(type == 3) {
    r = 3;
  } else if(type == 4) {
    r = 4;
  } else if(type == 5) {
    r = 5;
  } else if(type == 6) {
    r = 6;
  } else if(type == 7) {
    r = 7;
  } else if(type == 8) {
    r = 8;
  } else if(type == 9) {
    r = 9;
  } else if(type == 10) {
    r = 10;
  } else {
    r = 11;
  }

}

_if_run:
        cmp     edi, 10
        ja      .L85
        mov     edi, edi
        jmp     [QWORD PTR .L87[0+rdi*8]]
.L87:
        .quad   .L97
        .quad   .L96
        .quad   .L95
        .quad   .L94
        .quad   .L93
        .quad   .L92
        .quad   .L91
        .quad   .L90
        .quad   .L89
        .quad   .L88
        .quad   .L86
.L97:
        mov     DWORD PTR [rsp-4], 0
        ret
.L96:
        mov     DWORD PTR [rsp-4], 1
        ret
.L95:
        mov     DWORD PTR [rsp-4], 2
        ret
.L94:
        mov     DWORD PTR [rsp-4], 3
        ret
.L93:
        mov     DWORD PTR [rsp-4], 4
        ret
.L92:
        mov     DWORD PTR [rsp-4], 5
        ret
.L91:
        mov     DWORD PTR [rsp-4], 6
        ret
.L90:
        mov     DWORD PTR [rsp-4], 7
        ret
.L89:
        mov     DWORD PTR [rsp-4], 8
        ret
.L88:
        mov     DWORD PTR [rsp-4], 9
        ret
.L86:
        mov     DWORD PTR [rsp-4], 10
        ret
.L85:
        mov     DWORD PTR [rsp-4], 11
        ret

In my opinion, this optimization may be overkill if you are an old-school person who looks at their assembly after compiling this and saying, "Wait! I didn't use a switch case. Why is there a jump table?". For that reason, we must always stay up-to-date with the features of GCC.

Implementing Switch Case

In this section, we are gonna cover two possible implementations in C but not use the switch keyword.

For the sake of simplicity, we won't be embedding assembly via 6.17.2 Extended Asm | Assembler Instructions with C Expression Operands. Even though we don't use assembly directly, we can come pretty close to the jump table implementation as the GCC compiler sees it.

The first example is taken from the Wikipedia | Branch table.

"Another simple example, this time demonstrating a jump table rather than a mere branch table. This allows program blocks outside of the currently active procedure/function to be called:"

Jump Table with function pointers

#include <stdio.h>
#include <stdlib.h>

typedef void (*Handler)(void);    /* A pointer to a handler function */

/* The functions */
void func3 (void) { printf( "3\n" ); }
void func2 (void) { printf( "2\n" ); }
void func1 (void) { printf( "1\n" ); }
void func0 (void) { printf( "0\n" ); }

Handler jump_table[4] = { func0, func1, func2, func3 };

int main (int argc, char **argv) {
    int value;

    /* Convert first argument to 0-3 integer (modulus) */
    value = atoi(argv[1]) % 4;

    /* Call appropriate function (func0 thru func3) */
    jump_table[value]();

    return 0;
}

As it implies, this is a jump table rather than a mere branch table. This jump table consists of function pointers instead labels. Labels directly use JMP instructions while the function call like this (since function pointers cannot be inlined) uses the CALL instruction. In the simplest terms, CALL pushes the address of the instruction onto the stack. This value is often referred to as the return address. Once the CALL instruction is executed, the RET instruction pops the return address off the stack. As a result, JMP instructions are faster than CALL since jumps don't involve pushing and popping via the LIFO (Last-In-First-Out) stack. For more information, check out x86 assembly language.

🚧 Note

Another major disadvantage is that, compared to labels, we cannot pass variables from the outer scopes that easily unless we pass them to function arguments as pointers, which generates into more MOV instructions before the function CALL.

It is important to note that this kind of function pointer call cannot be inlined even with the -O1, -O2 or -O3 optimizations.

To prove this point, let's time the following implementation.

Implementation

typedef void (*Handler)(volatile int*);

void func3 (volatile int* r) { *r = 3; }
void func2 (volatile int* r) { *r = 2; }
void func1 (volatile int* r) { *r = 1; }
void func0 (volatile int* r) { *r = 0; }


void func_pointer_jump_table(unsigned int type) {

  // NOTE: we use static here since we don't want this array to be initialized per call. Like this, it is preserved between function invocations.
  static Handler jump_table[4] = { func0, func1, func2, func3 };

  type = type % 4;

  volatile int r;

  jump_table[type](&r);

}

int main() {
  {
    std::cout << "func_pointer_jump_table" << std::endl;
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        func_pointer_jump_table(0);
      }

      end_time;
    }
    std::cout << "func_pointer_jump_table" << std::endl;

  }
  return 0;
}

Benchmark Result 1

func_pointer_jump_table
[CPU Time Used: 0.005723]
[CPU Time Used: 0.005588]
[CPU Time Used: 0.005661]
[CPU Time Used: 0.005637]
[CPU Time Used: 0.005622]
[CPU Time Used: 0.005693]
[CPU Time Used: 0.005615]
[CPU Time Used: 0.005656]
[CPU Time Used: 0.005703]
[CPU Time Used: 0.005668]
func_pointer_jump_table

We can deduce that this performance is almost identical to the worst case of if-else statements. However, it can be a bit improved with the modulo operator replaced with an if statement where we can check the index (unsigned int) value is in bounds.

Benchmark Result 2

void func_pointer_jump_table(unsigned int type) {

  // NOTE: We use static here since we don't want this array to be initialized per call. Like this, it is preserved between function invocations.
  static Handler jump_table[4] = { func0, func1, func2, func3 };

  // type = type % 4; // Replace with the CMP instruction
  if(type > 3) {
    // ... Default case
    return;
  }

  int r;

  jump_table[type](&r);

}

int main() {
  {
    std::cout << "func_pointer_jump_table" << std::endl;
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        func_pointer_jump_table(0);
      }

      end_time;
    }
    std::cout << "func_pointer_jump_table" << std::endl;

  }
  return 0;
}

func_pointer_jump_table
[CPU Time Used: 0.005522]
[CPU Time Used: 0.005371]
[CPU Time Used: 0.005440]
[CPU Time Used: 0.005628]
[CPU Time Used: 0.005685]
[CPU Time Used: 0.005464]
[CPU Time Used: 0.005547]
[CPU Time Used: 0.005558]
[CPU Time Used: 0.005492]
[CPU Time Used: 0.005472]
func_pointer_jump_table

Given the benchmark results, we are slightly faster than the modulo operator.

In assembly,

type = type % 4;

translates into

and     DWORD PTR [rbp-20], 3

if(type > 3) {
  // ... Default case
  return;
}

translates into

cmp     DWORD PTR [rbp-20], 3
ja      .L48

In the end, this is a good example to showcase the concept on a higher-level but definitely not the the main choice when it comes to implementing our own switch case.

Covered in the next section, the closest implementation to the switch case is using Using the GNU Compiler Collection | 6.18.3 Labels as Values.

Computed GOTO: Goto Jump Table with Labels

According to Using the GNU Compiler Collection | 6.18.3 Labels as Values, we have the luxury of actually making labels part of the data structures as long as they are the void* pointers. We can reference them with the unary operator &&.

Let's look at the following implementation:

void _switch_go(unsigned int v) {

  // As mentioned before, we keep the array preserved between function invocations.
  static void* cases[] = { // Jump Table
    &&l00,
    &&l01, 
    &&l02,
    &&l03,
    &&l04,
    &&l05,
    &&l06,
    &&l07,
    &&l08,
    &&l09,
    &&l10,
  };


  // CMP For the default case - also implemented by the switch case
  if(v > 10) {
    goto _default;
  }

  // Jumps to the entry
  goto *cases[v];

  volatile int r;

  l00: {
       r = 0;
       goto defer; // break
     }
  l01: {
       r = 1;
       goto defer; // break
     }
  l02: {
       r = 2;
       goto defer; // break
     }
  l03: {
         r = 3;
         goto defer; // break
       }
  l04: {
         r = 4;
         goto defer; // break
       }
  l05: {
         r = 5;
         goto defer;
       }
  l06: {
         r = 6;
         goto defer;
       }
  l07: {
         r = 7;
         goto defer;
       }
  l08: {
         r = 8;
         goto defer;
       }
  l09: {
         r = 9;
         goto defer;
       }
  l10: {
         r = 10;
         goto defer;
       }


  _default: {
       r = 11;
       goto defer; // break
  }

  // For us, defer serves as the alternative to where the "break" keyword jumps to as far as the switch case and jump table
  defer: {
         return;
       }
}

The assembly of the code above is:

cases.1:
        .quad   .L49
        .quad   .L51
        .quad   .L52
        .quad   .L53
        .quad   .L54
        .quad   .L55
        .quad   .L56
        .quad   .L57
        .quad   .L58
        .quad   .L59
        .quad   .L60
_switch_go:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L63
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR cases.1[0+rax*8]
        nop
        jmp     rax
.L49:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L50
.L51:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L50
.L52:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L50
.L53:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L50
.L54:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L50
.L55:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L50
.L56:
        mov     DWORD PTR [rbp-4], 6
        jmp     .L50
.L57:
        mov     DWORD PTR [rbp-4], 7
        jmp     .L50
.L58:
        mov     DWORD PTR [rbp-4], 8
        jmp     .L50
.L59:
        mov     DWORD PTR [rbp-4], 9
        jmp     .L50
.L60:
        mov     DWORD PTR [rbp-4], 10
        jmp     .L50
.L63:
        nop
        mov     DWORD PTR [rbp-4], 11
        nop
.L50:
        nop
        pop     rbp
        ret

If we compare the switch case assembly from the example above, they are pretty identical in essence:

_switch_run:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L2
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L4[0+rax*8]
        jmp     rax
.L4:
        .quad   .L14
        .quad   .L13
        .quad   .L12
        .quad   .L11
        .quad   .L10
        .quad   .L9
        .quad   .L8
        .quad   .L7
        .quad   .L6
        .quad   .L5
        .quad   .L3
.L14:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L15 # break
.L13:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L15
.L12:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L15
.L11:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L15
.L10:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L15
.L9:
        mov     DWORD PTR [rbp-4], 5
        jmp     .L15
.L8:
        mov     DWORD PTR [rbp-4], 6
        jmp     .L15
.L7:
        mov     DWORD PTR [rbp-4], 7
        jmp     .L15
.L6:
        mov     DWORD PTR [rbp-4], 8
        jmp     .L15
.L5:
        mov     DWORD PTR [rbp-4], 9
        jmp     .L15
.L3:
        mov     DWORD PTR [rbp-4], 10
        jmp     .L15
.L2: # Default Case Label
        mov     DWORD PTR [rbp-4], 11
        nop
.L15: # Break Label
        nop
        pop     rbp
        ret

After all, the _switch_go function becomes difficult to maintain once we start having nested switch cases. This is when the programmer needs to be extremely cautious with the goto statements. If you insist on it, you can always write additional unit tests!

No optimization flags

_switch_go
[CPU Time Used: 0.002202]
[CPU Time Used: 0.002200]
[CPU Time Used: 0.002201]
[CPU Time Used: 0.002200]
[CPU Time Used: 0.002362]
[CPU Time Used: 0.002200]
[CPU Time Used: 0.002203]
[CPU Time Used: 0.002283]
[CPU Time Used: 0.002202]
[CPU Time Used: 0.002202]
_switch_go

With `-O2` optimizations

_switch_go
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001662]
[CPU Time Used: 0.001691]
[CPU Time Used: 0.001976]
[CPU Time Used: 0.001709]
[CPU Time Used: 0.001746]
[CPU Time Used: 0.001774]
[CPU Time Used: 0.001760]
[CPU Time Used: 0.001691]
_switch_go

Switch Case with `-O1` optimizations

If we inline the _switch_run and optimize it with -O1, we are able to spot the significance.

void _switch_run(unsigned int); // Declaration for when the function cannot be inlined
inline void _switch_run(unsigned int type) {
    /* ... */
}

_switch_run
[CPU Time Used: 0.000762]
[CPU Time Used: 0.000784]
[CPU Time Used: 0.000733]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000734]
[CPU Time Used: 0.000772]
[CPU Time Used: 0.000733]
[CPU Time Used: 0.000734]
_switch_run

No Inline

Without inline, we get the performance similar to the _switch_go.

_switch_run
[CPU Time Used: 0.001754]
[CPU Time Used: 0.001754]
[CPU Time Used: 0.001702]
[CPU Time Used: 0.001661]
[CPU Time Used: 0.001987]
[CPU Time Used: 0.001691]
[CPU Time Used: 0.001739]
[CPU Time Used: 0.001660]
[CPU Time Used: 0.001660]
[CPU Time Used: 0.001770]
_switch_run

This, however, depends if the compiler is able to inline a function.

The fact of the matter is it is possible for the compiler to inline the switch case function as opposed to the computed goto, which is covered in the next section.

Inlining Computed GOTO

🚧 Optimization Problems

The most critical downside of this implementation is that it is hard for the GCC compiler to optimize _switch_go.

To attempt to inline the Computed GOTO, we can use static inline in the function definition.

static inline void _switch_go(unsigned int v) {
  /* ... */
}

After benchmarking this code with the -O1 to try to inline it, the performance is, more or less, the same as the function without the inline.

int main() {
  volatile unsigned int value = 10;
  {
    printf("_switch_go\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        _switch_go(value);
      }

      end_time;
    }
    printf("_switch_go\n");
  }
}

_switch_go
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001804]
[CPU Time Used: 0.001731]
[CPU Time Used: 0.001754]
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001721]
[CPU Time Used: 0.001756]
[CPU Time Used: 0.001711]
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001727]
_switch_go

.L112:
        mov     edi, DWORD PTR [rsp+28]
        call    _switch_go
        sub     ebx, 1
        jne     .L112
        call    clock

The bottom line is it cannot be inlined.

According to 6.18.3 Labels as Values,

The &&foo expressions for the same label might have different values if the containing function is inlined or cloned. If a program relies on them being always the same, `attribute((noinline,noclone)) should be used to prevent inlining and cloning. If &&foo` is used in a static variable initializer, inlining and cloning is forbidden.

We ca verify this by the following code:

static inline void _switch_go (unsigned int) __attribute__((always_inline));

error: function 'void _switch_go(unsigned int)' can never be copied because it saves address of local label in a static variable

error: function 'void _switch_go(unsigned int)' can never be inlined because it contains a computed goto

Therefore, we have to consider other approaches such as 6.18.2 Locally Declared Labels and 6.18.1 Statements and Declarations in Expressions in case we need to get the value out of the nested block scope.

Switch Go Macro

Let's wrap the _switch_go inside of the macro with local labels.

#define __switch_go(type) { \
    __label__ l00, l01, l02, _default, defer; \
    static void* jtable[] = {&&l00, &&l01, &&l02}; \
    if((unsigned int)type > 2) goto _default; \
    goto* jtable[type]; \
    volatile int r; \
    l00: { \
        r = 0; \
        goto defer; \
    } \
    l01: { \
        r = 1; \
        goto defer; \
    } \
    l02: { \
        r = 2; \
        goto defer; \
    } \
    _default: { \
        r = 11; \
        goto defer; \
    } \
    defer: { \
    } \
    }

int main() {
  volatile unsigned int value3 = 2;
  {
    printf("__switch_go\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        __switch_go(value3);
      }

      end_time; 
    }
    printf("__switch_go\n");
  }
}

After benchmarking it, there is a significant speed up.

__switch_go
[CPU Time Used: 0.000851]
[CPU Time Used: 0.000759]
[CPU Time Used: 0.000760]
[CPU Time Used: 0.000714]
[CPU Time Used: 0.000795]
[CPU Time Used: 0.000771]
[CPU Time Used: 0.000789]
[CPU Time Used: 0.000781]
[CPU Time Used: 0.000733]
[CPU Time Used: 0.000771]
__switch_go

Getting the value out of the statement expression while using local labels is a bit tricky but possible:

#define switch_go(type) ({ \
    __label__ l01, l02, l03, _default, defer; \
    static void* jtable[] = { &&l01, &&l02, &&l03 }; \
    volatile int r; \
    if((unsigned int)type > 2) goto _default; \
    goto *jtable[type]; \
    l01: {\
       r = 1; \
       goto defer; \
    } \
    l02: {\
      r = 2; \
      goto defer; \
    } \
    l03: {\
     r = 3; \
     goto defer; \
    } \
    _default: { \
      r = 11; \
      goto defer; \
    } \
    defer: { \
    } \
    r; \
})

Computed GOTO: C++ Statement Expression

According to 6.18.1 Statements and Declarations in Expressions,

In G++, the result value of a statement expression undergoes array and function pointer decay, and is returned by value to the enclosing expression. For instance, if A is a class, then

        A a;

        ({a;}).Foo ()

constructs a temporary A object to hold the result of the statement expression, and that is used to invoke Foo. Therefore the this pointer observed by Foo is not the address of a.

class Foo {
    public:
        Foo() = default;


        Foo(const Foo&) {
            std::cout << "Foo()::COPY" << std::endl;
        }

        Foo(Foo&&) {
            std::cout << "Foo()::MOVE" << std::endl;
        }

        Foo&operator=(const Foo&) {
            std::cout << "Foo::operator=()::COPY" << std::endl;
            return *this;
        }
        Foo&operator=(Foo&&) {
            std::cout << "Foo::operator=()::MOVE" << std::endl;
            return *this;
        }


};

#define switch_go(type) ({ \
    __label__ l01, l02, l03, _default, defer; \
    static void* jtable[] = { &&l01, &&l02, &&l03 }; \
    Foo r; \
    if((unsigned int)type > 2) goto _default; \
    goto *jtable[type]; \
    l01: {\
       r = Foo(); \
       goto defer; \
    } \
    l02: {\
      r = Foo(); \
      goto defer; \
    } \
    l03: {\
     r = Foo(); \
     goto defer; \
    } \
    _default: { \
      r = Foo(); \
      goto defer; \
    } \
    defer: { \
    } \
    r; \
})

// Output:
// ---switch Go---
// Foo::operator=()::MOVE
// Foo()::COPY
// ---switch Go---

Local Label Concern

Local Labels, as the name suggests, are local. Meaning that a jump table is generated per individual block - each time we use our macro. The code above generates one jump table since we are in the loop.

The following isn't ideal unless we have a particular use case for it. For example, using it in extensive loops or function calls is a great trade-off.

int main() {
  __switch_go(value);
  __switch_go(value);
  __switch_go(value);
  return 0;
}

We get three jump tables.

jtable.1:
        .quad   .L124
        .quad   .L125
        .quad   .L126
jtable.2:
        .quad   .L120
        .quad   .L121
        .quad   .L122
jtable.3:
        .quad   .L116
        .quad   .L117
        .quad   .L118

Inlining Switch Case

A similar thing occurs when inlining the switch case. When the compiler finds inappropriate to inline the switch case (and, thus, creating another jump table), it replaces it with the CALL instruction.

🚧 Note

In the case of the __switch_go macro, we are forcing the creation of a jump table each time since we are explicitly using local labels whereas with the switch case inlining - the compiler makes a better decision and doesn't waste the data segment memory with jump tables at all times. This is particularly significant if the jump table is very large.

However, as mentioned if we use the __switch_go only when we need to - it is a great trade-off.

Consider the following code.

int main() {
  {
    printf("_switch_run\n");
    for(int i = 0; i < 10; i++) {
      start_time;

      for(int i = 0; i < 1000000; i++) {
        _switch_run(value);
      }

      end_time;
    }

    _switch_run(value);
    _switch_run(value);
    _switch_run(value);

    printf("_switch_run\n");
  }
  return 0;
}

The _switch_run within the loop generates:

.L125:
        mov     eax, DWORD PTR [rsp+4]
        cmp     eax, 10
        ja      .L111
        jmp     [QWORD PTR .L113[0+rax*8]]
.L113:
        .quad   .L123
        .quad   .L122
        .quad   .L121
        .quad   .L120
        .quad   .L119
        .quad   .L118
        .quad   .L117
        .quad   .L116
        .quad   .L115
        .quad   .L114
        .quad   .L112

Whereas for the three _swith_run functions outside of the loop, we have

_switch_run:
        cmp     edi, 10
        ja      .L36
        mov     edi, edi
        jmp     [QWORD PTR .L38[0+rdi*8]]
.L38:
        .quad   .L48
        .quad   .L47
        .quad   .L46
        .quad   .L45
        .quad   .L44
        .quad   .L43
        .quad   .L42
        .quad   .L41
        .quad   .L40
        .quad   .L39
        .quad   .L37
.L143:
        mov     edi, DWORD PTR [rsp+28]
        call    _switch_run
        mov     edi, DWORD PTR [rsp+28]
        call    _switch_run
        mov     edi, DWORD PTR [rsp+28]
        call    _switch_run
        mov     edi, OFFSET FLAT:.LC3
        call    puts
        mov     edi, OFFSET FLAT:.LC4
        call    puts
        mov     ebp, 10

Customizing _switch_go

🚧 Note

This section is purely experimental and substandard.

You are free to skip it and read on if you are not interested.

What if we can find a way to optimize this implementation even more?

The only instruction that comes to mind is CMP before the assembly gets to the jump table. This has previously been mentioned in the Switch Case in the beginning of this blog.

cmp     DWORD PTR [rbp-20], 10
ja      .L17

Roughly being equivalent to:

if(type > 10) {
    goto L17;
}

Now, if we just remove that if statement - we remove the CMP. This can result in faster code.

void _switch_go(unsigned int v) {

  // As mentioned before, we keep the array preserved between function invocations.
  static void* cases[] = {
    &&l00,
    &&l01,
    &&l02,
    &&l03,
    &&l04,
    &&l05,
    &&l06,
    &&l07,
    &&l08,
    &&l09,
    &&l10,
  };


  // CMP For the default case - also implemented by the switch case
  /*
  if(v > 10) {
    goto _default;
  }
  */

  goto *cases[v];

  volatile int r;
  // ...
}

The output code now contains no CMP instructions.

_switch_go:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR cases.1[0+rax*8]
        nop
        jmp     rax

Compared to:

_switch_go:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 10
        ja      .L64
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR cases.1[0+rax*8]
        nop
        jmp     rax

Benchmark

The benchmark, without the CMP instruction, gives us about 22% faster code with no optimizations.

_switch_go
[CPU Time Used: 0.001753]
[CPU Time Used: 0.001714]
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001755]
[CPU Time Used: 0.001713]
[CPU Time Used: 0.001712]
[CPU Time Used: 0.001747]
[CPU Time Used: 0.001711]
_switch_go

🚧 Warning

Removing the if statement and CMP is more performant but the cost on the programmer's side of things is that the programmer has to make their code very strict with covering all the possible values and cases since this eliminates the default case. Note that if the subscript is out of bounds - the programmer is going to end up with Segmentation fault (core dumped) at runtime. All in all, this can lead to premature optimization.

Here is the exact log from valgrind, showing that the program cannot jump to the invalid address.
valgrind -s ./out.out
_switch_go
==49387== Jump to the invalid address stated on the next line
==49387==    at 0x0: ???
==49387==    by 0x1097BC: main (in /src/out.out)
==49387==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==49387== 
==49387== 
==49387== Process terminating with default action of signal 11 (SIGSEGV)
==49387==  Bad permissions for mapped region at address 0x0
==49387==    at 0x0: ???
==49387==    by 0x1097BC: main (in /src/out.out)

Default Case: Can we do without it?

🚧 Note

This section is purely experimental and substandard.

You are free to skip it and read on if you are not interested.

In the Customizing _switch_go section, we talk about the CMP instruction that determines whether the value is out of bounds and jumps to the default case. To optimize even more, we removed that instruction by implementing a switch case of our own using labels but let's attempt to do the same for the switch statement. If you are still confused, please make sure to read Implementing Switch Case.

Our strategy here is to cover all the enumerations values so that the default case does not exist and, perhaps, no CMP instruction.

enum N {
    N0,
    N1,
    N2,
    N3,
    N4,
    N5,
    N6,
    N7
};

void _switch_default_case(enum N type) {
    volatile int r;

    switch(type) {
        case N0: {
            r = 0;
            break;
        }
        case N1: {
            r = 1;
            break;
        }
        case N2: {
            r = 2;
            break;
        }
        case N3: {
            r = 3;
            break;
        }
        case N4: {
            r = 4;
            break;
        }
        case N5: case N6: case N7: {
            r = 1000;
            break;
        }
    }
}

Even with practically covering all of the enumerations, we still end up with the CMP.

cmp     DWORD PTR [rbp-20], 7
ja      .L22

Without this CMP, the code is too obscure (prone to security breach) where someone can technically use unsigned int to get the subscript out of bounds. This results in Segmentation fault (core dumped) at runtime. This is explained in great detail in the Computed GOTO: Goto Jump Table with Labels and Customizing _switch_go.

_switch_default_case:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        cmp     DWORD PTR [rbp-20], 7
        ja      .L22
        mov     eax, DWORD PTR [rbp-20]
        mov     rax, QWORD PTR .L16[0+rax*8]
        jmp     rax
.L16:
        .quad   .L21
        .quad   .L20
        .quad   .L19
        .quad   .L18
        .quad   .L17
        .quad   .L15
        .quad   .L15
        .quad   .L15
.L21:
        mov     DWORD PTR [rbp-4], 0
        jmp     .L14
.L20:
        mov     DWORD PTR [rbp-4], 1
        jmp     .L14
.L19:
        mov     DWORD PTR [rbp-4], 2
        jmp     .L14
.L18:
        mov     DWORD PTR [rbp-4], 3
        jmp     .L14
.L17:
        mov     DWORD PTR [rbp-4], 4
        jmp     .L14
.L15:
        mov     DWORD PTR [rbp-4], 1000
        nop
.L14:
.L22:
        nop
        pop     rbp
        ret

Also note that for:

case N5: case N6: case N7: {
    r = 1000;
    break;
}

The GCC compiler recognizes the same label (.L15) and uses it to fill the three entries in the jump table:

.L16:
        .quad   .L21
        .quad   .L20
        .quad   .L19
        .quad   .L18
        .quad   .L17
        .quad   .L15
        .quad   .L15
        .quad   .L15

Conclusion

Although there is a lot to unpack in this blog, the best approach is to use the switch statement, as the GCC compiler has a hard time optimizing the high-level C implementations as discussed above. If you're determined to optimize further and want to remove any redundant ASM instructions that may impact your code, consider using 6.17.2 Extended Asm | Assembler Instructions with C Expression Operands, which allows you to craft your assembly from the ground up - untouched by the optimizations such as -O1, -O2, or -O3.

If you are interested, we can try to do that as a follow-up blog. But keep in mind that your codebase then becomes a nightmare to maintain, especially with nested switch statements.

Thank you for reading! Feel free to leave any feedback in the comments!

References

Built for developers, by developers.

Whether you're building a simple prototype or a business-critical product, Heroku's fully-managed platform gives you the simplest path to delivering apps quickly — using the tools and languages you already love!

Learn More

S2S Compilers: Understanding Switch Case Statements

Table Of Contents

About Author

Introduction

Motivation

Source Code

Prerequisites

Background

GCC Options

Headers and Macros

Switch case VS If-else

If-else statement

Switch Case

Benchmarking the two

Out-of-Bounds Benchmark

Turning a switch case into a jump table

Is there a limit for the values of a switch case?

Breaking switch case: Non-packed values

Multiple Jump Tables

If-else as a jump table

Implementing Switch Case

Jump Table with function pointers

Implementation

Benchmark Result 1

Benchmark Result 2

Computed GOTO: Goto Jump Table with Labels

No optimization flags

With -O2 optimizations

Switch Case with -O1 optimizations

No Inline

Inlining Computed GOTO

Switch Go Macro

Computed GOTO: C++ Statement Expression

Local Label Concern

Inlining Switch Case

Customizing _switch_go

Benchmark

Default Case: Can we do without it?

Conclusion

References

Built for developers, by developers.

A Workflow Copilot. Tailored to You.

Okay

With `-O2` optimizations

Switch Case with `-O1` optimizations