It all started when I was watching a video about optimization and came across a familiar textbook example demonstrating the importance of branch prediction.
The idea is simple: when a function contains a branch and the branch predictor guesses wrong, many CPU cycles are wasted. The optimized, branchless version always computes both results but avoids branch misprediction penalties.
I had some time to kill, so I decided to test how this actually works in practice. I used Google Benchmark and wrote two identical tests for each function. BM_GetProduct1 is the branching version, BM_GetProduct2 is branchless (the "optimized" one).
Unexpected Results
When I ran the benchmark, the results surprised me.
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_GetProduct1/16 9.12 ns 9.11 ns 62193800
BM_GetProduct1/64 36.6 ns 36.5 ns 19171034
BM_GetProduct1/512 418 ns 417 ns 1682832
BM_GetProduct1/4096 3605 ns 3598 ns 196197
BM_GetProduct1/32768 104858 ns 104677 ns 6754
BM_GetProduct1/65536 211673 ns 211332 ns 3209
BM_GetProduct2/16 9.22 ns 9.21 ns 75412476
BM_GetProduct2/64 44.7 ns 44.7 ns 15766219
BM_GetProduct2/512 437 ns 437 ns 1613444
BM_GetProduct2/4096 3592 ns 3587 ns 195040
BM_GetProduct2/32768 28929 ns 28882 ns 24380
BM_GetProduct2/65536 57684 ns 57610 ns 12128
The supposedly slow version consistently beats the "optimized" one for arrays up to 4096 elements! But why?
Let's dig in!
Here's the branching version:
double get_product1(std::span<const double> input, double threshold) {
double p = 1.0;
for (auto i : input) {
if (i < threshold) {
p *= 2 * i;
} else {
p *= 1.5 * i;
}
}
return p;
}
The branchless version always computes both results (which is its drawback), but it doesn't break the CPU pipeline — there's no conditional jump for the predictor to guess:
double get_product2(std::span<const double> input, double threshold) {
double p = 1.0;
for (auto i : input) {
std::array arr = {1.5 * i, 2.0 * i};
p *= arr[i < threshold];
}
return p;
}
The benchmarks look identical for both functions, so I'll show just one:
static void BM_GetProduct1(benchmark::State &state) {
const auto size = state.range(0);
auto data = generateData(size);
double p = 0.0;
for (auto _ : state) {
benchmark::ClobberMemory();
benchmark::DoNotOptimize(p += get_product1(data, 0.0));
}
}
BENCHMARK(BM_GetProduct1)->Range(16, 16 * 4096);
The data is generated randomly, so roughly half the values are positive and half are negative:
std::vector<double> generateData(int size) {
std::random_device rd;
std::mt19937_64 gen(rd());
std::uniform_real_distribution<double> dist(-1000.0, 1000.0);
std::vector<double> data;
data.resize(size);
for (auto &d : data) {
d = dist(gen);
}
return data;
}
So get_product1 receives random data, which means the branch predictor should only guess correctly about 50% of the time. But why does the branching version win on small data sizes and then suddenly fall behind?
One might conclude that it depends on vector length... but both versions receive the same data... Maybe there's some overhead that doesn't scale linearly?
The Sorted Data Experiment
Let's try sorting the array and running the test again!
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_GetProduct1/16 8.96 ns 8.94 ns 80669588
BM_GetProduct1/64 35.8 ns 35.7 ns 19709667
BM_GetProduct1/512 422 ns 422 ns 1388010
BM_GetProduct1/4096 3563 ns 3559 ns 196867
BM_GetProduct1/32768 28644 ns 28614 ns 24503
BM_GetProduct1/65536 57497 ns 57432 ns 12093
BM_GetProduct2/16 9.04 ns 9.03 ns 76644990
BM_GetProduct2/64 44.4 ns 44.3 ns 15828673
BM_GetProduct2/512 433 ns 433 ns 1613180
BM_GetProduct2/4096 3578 ns 3574 ns 196333
BM_GetProduct2/32768 28688 ns 28650 ns 24450
BM_GetProduct2/65536 57435 ns 57345 ns 12213
The branching version now performs as well as or even better than the "optimized" version. First, the function processes all negative values (the if branch), then all positive values (the else branch), and the branch predictor only mispredicts once — at the transition point.
But why does get_product1 still win on unsorted random data when the array size is less than 4K?
The Real Culprit
Let's recall how the branch predictor works. It has a history buffer where it remembers previous branch outcomes and identifies patterns. For example, "always true" or a pattern like "true-true-false". But in our case, there are no patterns because the data is random!
Or is it?
The answer is that Google Benchmark runs millions of iterations with the same data, so the branch predictor simply learns the sequence! When we start a new iteration, it "recognizes" the sequence and starts choosing the correct branch based on its "experience". But if the sequence is too long, it can't memorize it, and the data becomes random again from its perspective.
The Fixed Benchmark
To verify this hypothesis, let's ensure that short sequences don't repeat across iterations:
static void BM_GetProduct1(benchmark::State &state) {
auto data = generateData(max_data_size);
const auto size = state.range(0);
std::vector<std::span<double>> sets;
for (int i = 0; i + size <= max_data_size; i += size) {
sets.push_back(std::span(&data[i], size));
}
double p = 0.0;
int i = 0;
for (auto _ : state) {
benchmark::ClobberMemory();
benchmark::DoNotOptimize(p += get_product1(sets[i], 0.0));
if (++i >= sets.size()) {
i = 0;
}
}
}
And here are the results:
---------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------
BM_GetProduct1/16 58.1 ns 57.9 ns 11410584
BM_GetProduct1/64 214 ns 214 ns 3277196
BM_GetProduct1/512 1649 ns 1646 ns 419308
BM_GetProduct1/4096 13133 ns 13118 ns 52947
BM_GetProduct1/32768 104506 ns 104381 ns 6656
BM_GetProduct1/65536 208442 ns 208213 ns 3307
BM_GetProduct2/16 11.0 ns 11.0 ns 63463067
BM_GetProduct2/64 46.4 ns 46.3 ns 15161975
BM_GetProduct2/512 438 ns 438 ns 1608315
BM_GetProduct2/4096 3589 ns 3584 ns 195693
BM_GetProduct2/32768 28825 ns 28785 ns 24373
BM_GetProduct2/65536 57613 ns 57544 ns 12100
Now that's a different story! I almost concluded that branching is better for small data! The branchless version is definitively faster across all data sizes, as long as the data is truly random.
Conclusion
Here's a comparison table showing how misleading incorrect benchmarks can be:
| Size | Appeared | Actual | Error |
|---|---|---|---|
| 16 | 9.12 ns | 58.1 ns | 6.4x |
| 64 | 36.6 ns | 214 ns | 5.8x |
| 512 | 418 ns | 1649 ns | 3.9x |
| 4096 | 3605 ns | 13133 ns | 3.6x |
| 32768 | 104858 ns | 104506 ns | ~1x |
| 65536 | 211673 ns | 208442 ns | ~1x |
The takeaway: be careful with benchmark results — sometimes your benchmark measures itself, not your code.
Top comments (1)
This is mistagged. It only should be
#cppfor C++ — there's no C code here.