During the last couple of months I learned a lot about what it takes to optimize code for speed. However, I noticed that most of this knowledge was fairly abstract and it was time to put it to the test. Therefore, I picked a simple and well-defined function that I wanted to optimize to the best of my current ability. This exercise turned out to be very interesting, which is why I'm writing my first post about it.

The following topics are explored:

- How to process multiple bytes at the same time using larger integer types and SIMD instructions?
- How making assumptions about the input can help writing faster code?
- How giving the compiler more information can help it produce more efficient code?
- How the CPU branch prediction can become a bottleneck?
- How to deal with the case when a single simple branch becomes a bottleneck?
- How to combine different algorithms to be efficient in more cases?

# Task

The task is to **find the sorted indices of all non-zero bytes in a buffer**.

More specifically, the function has the following signature:

```
uint32_t find_non_zero_indices(uint8_t *in_begin,
uint8_t *in_end,
uint32_t *out_begin);
```

The first two parameters specify the input byte buffer. The third parameter points to the position where the output should be stored. The function assumes that the output buffer is large enough. The returned value indicates how many non-zero bytes have been found.

For example, when the input is the buffer `0 0 1 0 1 0 1 1 0`

(whereby `0`

is 0x00 and `1`

is `0x01`

), the output buffer should contain `2 4 6 7`

.

I want to optimize single thread performance. All memory allocation is done outside of this function.

# Input Data

In my tests I am using a byte array with 10,000,000 elements. The array contains randomly distributed zeros and ones. The proportion will vary between different tests. I'll start optimizing the case when there are very few non-zeros. Later other distributions will be checked. However, I limit myself to uniform distribution of specific amounts of non-zeros. I'm aware that other distributions might need different algorithms.

# Baseline

Before doing any optimization, you should always measure how fast the simplest implementation is. Otherwise you won't know if it actually became faster.

```
uint32_t find_non_zero_indices__baseline(uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
uint32_t *out_current = out_begin;
for (uint8_t *current = in_begin; current != in_end; current++) {
if (*current != 0) {
uint32_t index = current - in_begin;
*out_current++ = index;
}
}
return out_current - out_begin;
}
```

This code simply iterates over the input buffer. Whenever a non-zero element is found, its index is written into the output buffer. Note, the result of subtracting two pointers is the number of elements of the specific type between the pointers.

This code takes about 6 ms to iterate over all bytes to figure out that there are only zeros. That is 0.6 ns per element in average, or about 2 CPU cycles on a 3 GHz processor. Now, let's see how we can be faster when there are mostly zeros.

# Grouping Elements

The first optimization will make use of a simple trick. Instead of comparing every byte to zero individually, we can easily compare 2, 4 or 8 bytes to zero at the same time. This dramatically cuts down the number of comparisons in the benchmark.

To simplify our life a bit, we assume that the length of the input array is divisible by the size of the integer type we use. Also, we assume that the code is running on a little-endian system. Otherwise the order of the bytes might have to be changed.

## Groups of Two

In the simplest scenario we use `uint16_t`

instead of `uint8_t`

to compare two bytes to zero at the same time.

```
uint32_t find_non_zero_indices__grouped_2(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
assert((in_end - in_begin) % 2 == 0);
uint32_t *out_current = out_begin;
for (uint8_t *current = in_begin; current != in_end; current += 2) {
uint16_t group = *(uint16_t *)current;
if (group != 0) {
uint32_t index = current - in_begin;
if (current[0]) {
*out_current++ = index;
}
if (current[1]) {
*out_current++ = index + 1;
}
}
}
return out_current - out_begin;
}
```

Now, when a group is zero, we still have to figure out which of the two bytes is actually zero (or both). However, when we assume that the number of non-zeros is small, this optimization is well worth it. By just using groups of two, the execution time halved right away.

## Larger Groups

On a 64 bit CPU, we can easily extend the same concept to groups of four and eight using `uint32_t`

and `uint64_t`

respectively.

```
static uint32_t find_non_zero_indices__grouped_4(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
assert((in_end - in_begin) % 4 == 0);
uint32_t *out_current = out_begin;
for (uint8_t *current = in_begin; current != in_end; current += 4) {
uint32_t group = *(uint32_t *)current;
if (group != 0) {
for (uint32_t i = 0; i < 4; i++) {
if (current[i] != 0) {
*out_current++ = (current - in_begin) + i;
}
}
}
}
return out_current - out_begin;
}
static uint32_t find_non_zero_indices__grouped_8(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
assert((in_end - in_begin) % 8 == 0);
uint32_t *out_current = out_begin;
for (uint8_t *current = in_begin; current != in_end; current += 8) {
uint64_t group = *(uint64_t *)current;
if (group != 0) {
for (uint32_t i = 0; i < 8; i++) {
if (current[i] != 0) {
*out_current++ = (current - in_begin) + i;
}
}
}
}
return out_current - out_begin;
}
```

In the execution time almost halves every time. That also indicates that we are currently CPU and not memory bound. If the performance of the baseline version was bound by the memory bandwidth, we would not be able to see this speedup. This is great, because it gives us more room for improvement.

The next chapter will show how even larger groups can be used by making use of SIMD instructions.

# Larger Groups with SIMD

Single Instruction Multiple Data (SIMD) is a concept in which the same operation is performed on more than one piece of data at the same time. For example, many modern CPUs have special hardware allowing them to do 4 additions at the same time in a single thread.

Compilers can generate these instructions when the correct flags are passed to it. However, to make use of these instructions, the compiler also has to understand when "vectorization" can be performed. Optimizing compilers are able to detect it in many cases, but for this exercise I don't want to rely on the compiler for this. Instead I'm explicitly using SIMD instructions in the C++ code.

For this and the following sections I assume that the code is running on a relatively modern x86-64 CPU. Otherwise, the instructions might not exist.

## Using SIMD Instructions

To tell the CPU that some operations should be performed on multiple elements, we have to use specific types that can hold multiple values of the same type. For example, the `__m128i`

type can hold two `uint64_t`

, or four `uint32_t`

, all the way down to 16 `uint8_t`

. Types like `__m128i`

and `__m256i`

are just normal structs. Their main purpose is to help the developer enforce type constraints and alignment requirements.

Depending on the CPU, the largest SIMD type you can use is 64, 128, 256 or even 512 bits large. The largest type I can use has 256 bits.

Here is a simple example showing the basic usage of SIMD instructions. There are many libraries that simplify working with SIMD a bit. However, for this exercise, I won't use those libraries.

```
#include <emmintrin.h>
/* Initialize two 128 bit registers with four 32 bit integers each. */
__m128i a = _mm_set_epi32(1, 2, 3, 4);
__m128i b = _mm_set_epi32(30, 40, 50, 60);
/* Do four additions with a single instruction. */
__m128i c = _mm_add_epi32(a, b);
/* Store the result in a normal C array. */
/* Note: This only works when pointer is 16 byte aligned. */
int32_t result[4];
_mm_store_si128((__m128i *)result, c);
/* Note: The order of elements is reversed.
* That's just something to keep in mind when using _mm_set_XXX.
* result[0] = 64
* result[1] = 53
* result[2] = 42
* result[3] = 31 */
```

A full list of instructions can be found in the Intel Intrinsics Guide.

## Groups of 32

I'll skip the implementation using groups of 16 and jump directly to groups of 32. To operate on 32 byte at the same time, we have to use `__m256i`

. Now, the task is to figure out if there is a non-zero in a `__m256i`

. I found the following approach to work well.

```
__m256i group = ...;
/* Initialize with zeros to compare against. */
__m256i zero_bytes = _mm256_set1_epi8(0);
/* Create a mask that contains 0xFF for every byte
* that is zero and 0x00 for the others. */
__m256i is_zero_byte_mask = _mm256_cmpeq_epi8(group, zero_bytes);
/* Take one bit of every byte to create bit mask. */
uint32_t is_zero_mask = _mm256_movemask_epi8(is_zero_byte_mask);
/* Invert the bit mask. Now zero means there was a zero
* and one means there was a non-zero. */
uint32_t is_non_zero_mask = ~is_zero_mask;
/* This comparison now checks all 32 bytes at once. */
if (is_non_zero_mask != 0) {
...
}
```

Here is how the content of the variables might look like. The first three rows are in hex while the last two rows are binary.

```
group 00 00 01 00 01 01 00 01 ...
zero_bytes 00 00 00 00 00 00 00 00 ...
is_zero_byte_mask FF FF 00 FF 00 00 FF 00 ...
is_zero_mask 1 1 0 1 0 0 1 0 ...
is_non_zero_mask 0 0 1 0 1 1 0 1 ...
```

By putting everything together, we end up with this algorithm. I kept the inner loop simple for now. It will be optimized later.

```
uint32_t find_non_zero_indices__grouped_32(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
assert((in_end - in_begin) % 32 == 0);
uint32_t *out_current = out_begin;
__m256i zeros = _mm256_set1_epi8(0);
for (uint8_t *current = in_begin; current != in_end; current += 32) {
__m256i group = _mm256_loadu_si256((__m256i *)current);
__m256i is_zero_byte_mask = _mm256_cmpeq_epi8(group, zeros);
uint32_t is_zero_mask = _mm256_movemask_epi8(is_zero_byte_mask);
uint32_t is_non_zero_mask = ~is_zero_mask;
if (is_non_zero_mask != 0) {
for (uint32_t i = 0; i < 32; i++) {
if (current[i] != 0) {
*out_current++ = (current - in_begin) + i;
}
}
}
}
return out_current - out_begin;
}
```

The algorithms that use grouping are much faster compared to the baseline as one would expect.

One thing all algorithms have in common is that they get slower the more non-zeros there are. Why is that? As me, you might be thinking that it is obvious, because the algorithm has to do more when there are more non-zeros. As you'll see in the next section that is not the reason, not at all.

# Other Data Distributions

In the previous sections we developed an algorithm that was very fast when the array consists mostly of zeros. In this chapter we will take at look how it performs when the percentage of ones is larger.

When running the benchmark on input arrays with more ones, we get the following picture.

Four aspects of this graph are interesting.

- The execution time is the slowest when there are 5,000,000 non-zeros.
- The max of the Groups of 2 algorithm is at another point than for the other algorithms.
- When almost every byte is a non-zero, the running time of the baseline algorithm is pretty much exactly the same on the left and right side of the graph.
- The algorithms that employ grouping are faster than the baseline algorithm, even though their code suggests that they are doing more work.

**1. Peak at 5,000,000.**

What we see here is how bad the performance can get, when the CPU is not able to use branch prediction effectively. At 5,000,000 non-zeros, every byte has a 50% change of being a non-zero, there is no way to predict this. On the other hand, when there are more or less than 5,000,000 non-zeros, the probability that a good branch predictor predicts correctly goes up; resulting in better overall running time. We will deal with this problem in a later section.

What I find most fascinating about this graph is just how clearly one can see the branch predictor getting worse at first and then getting better again.

**2. Different peak of Groups of 2 algorithm.**

This is most probably caused by the fact that this algorithm has another branch that is checked very often. The worst time for the branch that checks whether two consecutive bytes are zero is when 1/3 of all bytes are non-zero (I think?). Mixing this with the bad predictability of the other branches at 5,000,000 non-zeros, we get a worst case at about 4,000,000 non-zeros. Feel free to calculate that number exactly.

**3. Same running time at start and end.**

From the grouping algorithms we know already that the baseline algorithm is bound by the CPU and not by the memory read bandwidth. Otherwise, we would not be able to become faster when processing elements faster.

The fact that the running time is still the same on the right side shows us, that the write bandwidth is not limiting us either. Otherwise writing the entire output array with 40 MB (`10,000,000 * sizeof(uint32_t)`

) would have to result in a slowdown.

**4. Baseline is still slower at end.**

Let's take the Groups of 4 as an example. Why is it faster than the baseline algorithm? It has the same inner loop and an additional check every 4 elements. So in theory it should do more work?

I think it is faster because we actually gave the compiler more information. The inner loop has a constant number of iterations. This allows the compiler to unroll it more easily, producing better and faster code.

We can see that this is indeed the case by running the benchmark with less compiler optimizations (this is with `-O1`

instead of `-O3`

). In this case the algorithms doing more work are actually slower.

# Branch Prediction

In the last section, we have seen the importance of branch prediction and what impact it can have on performance. One approach to help the CPU to be faster is to have fewer or no branches.

## Branch Removal

Compilers try to remove branches in many cases. Sometimes they convert branches into arithmetic expressions. For example clang does the following transformation to get rid of a branch.

```
int original(int a) {
return (a >= 0) ? a : a + 1;
}
int generated(int a) {
// Instead of doing a branch, add the most significant bit to a.
// When a is negative, this bit will be 1, otherwise it is 0.
return a + ((unsigned int)a >> 31);
}
```

Another method is to use special CPU instructions which do a conditional assignment without a branch. However, the compiler cannot find every opportunity to remove branches, because it must not change the semantic of the code. Fortunately, as programmers we know the problem we are solving better and can sometimes change the semantics without affecting the final output.

## Branchless Algorithm

Okay, we cannot remove the outer loop in our algorithms which necessarily contains a branch. Fortunately, that branch is not an issue for us, because it will be predicted incorrectly only once at the end.

I'm not aware of a generic approach to writing branchless algorithms. It just takes some time to move instructions around a bit. That way I removed the inner branch from the baseline algorithm.

```
uint32_t find_non_zero_indices__branch_free(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
uint32_t *out_current = out_begin;
uint32_t amount = in_end - in_begin;
for (uint32_t index = 0; index < amount; index++) {
*out_current = index;
bool is_non_zero = in_begin[index] != 0;
out_current += is_non_zero;
}
return out_current - out_begin;
}
```

To make this algorithm work, we have to assume that the output buffer is at least one element larger than the total number of non-zeros. This algorithm writes every index to the output array, but only increments the `out_current`

variable when the value was a non-zero.

There are three main things to note in this graph:

- The branchfree algorithm has the same running time regardless of the input.
- It is much faster when the branch prediction fails for the other algorithms.
- It is slower in the extreme cases. Especially it is much worse when the number of non-zeros is small, which is the case I care about the most.

# Combining Algorithms

We have used two optimization approaches by now: grouping and branch removal. Both have strong and weak characteristics. Naturally, we'd like to have an algorithm that combines the strong properties of both.

The following algorithm combines the 32 byte grouping with the branchfree algorithm from before.

```
uint32_t find_non_zero_indices__grouped_branch_free(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
assert((in_end - in_begin) % 32 == 0);
uint32_t *out_current = out_begin;
__m256i zeros = _mm256_set1_epi8(0);
for (uint8_t *current = in_begin; current != in_end; current += 32) {
/* Check if an entire group contains a non-zero. */
__m256i group = _mm256_loadu_si256((__m256i *)current);
__m256i is_zero_byte_mask = _mm256_cmpeq_epi8(group, zeros);
uint32_t is_zero_mask = _mm256_movemask_epi8(is_zero_byte_mask);
uint32_t is_non_zero_mask = ~is_zero_mask;
if (is_non_zero_mask != 0) {
/* If there is a non-zero, use the branch free algorithm. */
uint32_t start_index = current - in_begin;
uint32_t end_index = start_index + 32;
for (uint32_t index = start_index; index < end_index; index++) {
*out_current = index;
bool is_non_zero = in_begin[index] != 0;
out_current += is_non_zero;
}
}
}
return out_current - out_begin;
}
```

The resulting running times are as expected. When there are mostly zeros, it is as fast as the Groups of 32 algorithm. When there are more ones, the performance is similar to the branch free algorithm. It is just a little bit slower because now it is really doing more work for every 32 byte block.

# Improving the Branchless Algorithm

This is the old branchless algorithm again:

```
uint32_t start_index = current - in_begin;
uint32_t end_index = start_index + 32;
for (uint32_t index = start_index; index < end_index; index++) {
*out_current = index; /* <- memory write */
bool is_non_zero = in_begin[index] != 0; /* <- memory read */
out_current += is_non_zero;
}
```

As you can see there are two memory accesses for every element in the byte array. We cannot really get rid of the first one, since this is the heart of the branchless algorithm. However, the second access can be avoided since the information is in a CPU register already. It is stored in the `is_non_zero_mask`

that was computed before. We can just access the bit for the current index instead of doing a memory access.

```
uint32_t start_index = current - in_begin;
for (uint32_t offset = 0; offset < 32; offset++) {
*out_current = start_index + offset;
out_current += (is_non_zero_mask & (1 << offset)) >> offset;
}
```

This can further be improved with the following modification.

```
uint32_t index = current - in_begin;
for (uint32_t i = 0; i < 32; i++) {
*out_current = index;
out_current += is_non_zero_mask & 1;
index++;
is_non_zero_mask >>= 1;
}
```

Another benefit of this change is that the loop is quite easy to unroll for the compiler now.

# Mostly Ones

The first optimization we did was for the case when there are almost only zeros. Now it is time to optimize the case when there are almost only ones. We can't really hope for the same performance, because the output array still has to be filled. We can do better than the current version though. The main idea is to take a special code path when the mask contains ones only. In this case, no further comparisons have to be performed.

```
if (is_non_zero_mask != 0xFFFFFFFF) {
uint32_t index = current - in_begin;
for (uint32_t i = 0; i < 32; i++) {
*out_current = index;
out_current += is_non_zero_mask & 1;
index++;
is_non_zero_mask >>= 1;
}
}
else {
uint32_t index_start = current - in_begin;
for (uint32_t i = 0; i < 32; i++) {
out_current[i] = index_start + i;
}
out_current += 32;
}
```

We could also use SIMD instructions to optimize this new loop, but the compiler is probably smart enough to it for us. Even if not, this loop is very predictable, so the CPU should be able to run it quickly anyway.

Putting everything together, and changing the mask handling a little bit, we get his:

```
uint32_t find_non_zero_indices__mostly_ones(
uint8_t * in_begin,
uint8_t *in_end,
uint32_t * out_begin)
{
assert((in_end - in_begin) % 32 == 0);
uint32_t *out_current = out_begin;
__m256i zeros = _mm256_set1_epi8(0);
for (uint8_t *current = in_begin; current != in_end; current += 32) {
__m256i group = _mm256_loadu_si256((__m256i *)current);
__m256i is_zero_byte_mask = _mm256_cmpeq_epi8(group, zeros);
uint32_t is_zero_mask = _mm256_movemask_epi8(is_zero_byte_mask);
if (is_zero_mask != 0xFFFFFFFF) {
if (is_zero_mask != 0x00000000) {
/* Group has zeros and ones, use branchless algorithm. */
uint32_t is_non_zero_mask = ~is_zero_mask;
uint32_t index = current - in_begin;
for (uint32_t i = 0; i < 32; i++) {
*out_current = index;
out_current += is_non_zero_mask & 1;
index++;
is_non_zero_mask >>= 1;
}
}
else {
/* All ones, set output directly. */
uint32_t index_start = current - in_begin;
for (uint32_t i = 0; i < 32; i++) {
out_current[i] = index_start + i;
}
out_current += 32;
}
}
}
return out_current - out_begin;
}
```

As you can see, the performance improved a bit when there are many ones. However, we do pay for that with an additional branch for every 32 elements, so the algorithm became a bit slower for the other cases.

# Counting Bits

We can start to see a more general pattern now. First we create a mask and then we pick different algorithms depending on it. So far we use three different methods to handle groups of 32 byte under different circumstances: only zeros, only ones and when there is a mix of both.

We already optimized the extreme cases, the normal case still iterates over all elements every time, though. Next we want to optimize the algorithm for the case when there is only one or two non-zeros in a group. For that we first have to know how many non-zeros are actually in the mask. Fortunately, there is handy intrinsic called `_mm_popcnt_u32`

which counts the ones in a 32 bit integer. Since our mask is inverted currently, we'll actually count the number of zeros instead.

```
uint32_t zero_amount = _mm_popcnt_u32(is_zero_mask);
```

To optimize the cases with one or two bits in a group, special cpu instructions can be used again. How to use these instructions depends on which platform you are on and which compiler you use. Sometimes I had to use `_BitScanForward/_BitScanReverse`

and sometimes `__builtin_ctz/__builtin_clz`

. They give us the highest and lowest set bit in a 32 bit integer. This way we do not need the loop at all.

```
/* Group has zeros and ones. */
uint32_t zero_amount = _mm_popcnt_u32(is_zero_mask);
uint32_t is_non_zero_mask = ~is_zero_mask;
uint32_t index_start = current - in_begin;
if (zero_amount == 31) {
/* Only a single non-zero. */
uint32_t index_start = current - in_begin;
uint32_t index_offset = find_lowest_set_bit_index(
is_non_zero_mask);
*out_current = index_start + index_offset;
out_current++;
}
else if (zero_amount == 30) {
/* Exactly two non-zeros. */
uint32_t index_start = current - in_begin;
uint32_t index_offset1 = find_lowest_set_bit_index(
is_non_zero_mask);
uint32_t index_offset2 = find_highest_set_bit_index(
is_non_zero_mask);
out_current[0] = index_start + index_offset1;
out_current[1] = index_start + index_offset2;
out_current += 2;
}
else {
/* More than two non-zeros. */
uint32_t index = index_start;
for (uint32_t i = 0; i < 32; i++) {
*out_current = index;
out_current += is_non_zero_mask & 1;
index++;
is_non_zero_mask >>= 1;
}
}
```

As you can see there is a big win when the number of non-zeros is small. However, we pay for that with a little bit slower code when there are many non-zeros.

# Future Improvements

One interesting improvement that can be made is to iterate over all ones in a bitset directly. This can be very efficient when there are e.g. 5 non-zeros in every 32 byte block. However, iterating over the set bits is more expensive when the number of non-zeros is large. In that case, the branchless algorithm or even the baseline algorithm might be best again. I tested this approach with different thresholds.

To pick the right threshold, further investigation is necessary. Furthermore, other data distributions (other than uniform distribution) should be analysed.

# The End

That is it for now. I hope you found some aspects of that exercise interesting. Did anyone do a similar exercise before?

Constructive feedback on the style of this post is welcome as well because this is my first.

Posted on by:

## Discussion