The Benchmarking Trap

#c #cpp #programming

It all started when I was watching a video about optimization and came across a familiar textbook example demonstrating the importance of branch prediction.

The idea is simple: when a function contains a branch and the branch predictor guesses wrong, many CPU cycles are wasted. The optimized, branchless version always computes both results but avoids branch misprediction penalties.

I had some time to kill, so I decided to test how this actually works in practice. I used Google Benchmark and wrote two identical tests for each function. BM_GetProduct1 is the branching version, BM_GetProduct2 is branchless (the "optimized" one).

Unexpected Results

When I ran the benchmark, the results surprised me.

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_GetProduct1/16          9.12 ns         9.11 ns     62193800
BM_GetProduct1/64          36.6 ns         36.5 ns     19171034
BM_GetProduct1/512          418 ns          417 ns      1682832
BM_GetProduct1/4096        3605 ns         3598 ns       196197
BM_GetProduct1/32768     104858 ns       104677 ns         6754
BM_GetProduct1/65536     211673 ns       211332 ns         3209
BM_GetProduct2/16          9.22 ns         9.21 ns     75412476
BM_GetProduct2/64          44.7 ns         44.7 ns     15766219
BM_GetProduct2/512          437 ns          437 ns      1613444
BM_GetProduct2/4096        3592 ns         3587 ns       195040
BM_GetProduct2/32768      28929 ns        28882 ns        24380
BM_GetProduct2/65536      57684 ns        57610 ns        12128

The supposedly slow version consistently beats the "optimized" one for arrays up to 4096 elements! But why?

Let's dig in!

Here's the branching version:

double get_product1(std::span<const double> input, double threshold) {
  double p = 1.0;
  for (auto i : input) {
    if (i < threshold) {
      p *= 2 * i;
    } else {
      p *= 1.5 * i;
    }
  }
  return p;
}

The branchless version always computes both results (which is its drawback), but it doesn't break the CPU pipeline — there's no conditional jump for the predictor to guess:

double get_product2(std::span<const double> input, double threshold) {
  double p = 1.0;
  for (auto i : input) {
    std::array arr = {1.5 * i, 2.0 * i};
    p *= arr[i < threshold];
  }
  return p;
}

The benchmarks look identical for both functions, so I'll show just one:

static void BM_GetProduct1(benchmark::State &state) {
  const auto size = state.range(0);
  auto data = generateData(size);

  double p = 0.0;
  for (auto _ : state) {
    benchmark::ClobberMemory();
    benchmark::DoNotOptimize(p += get_product1(data, 0.0));
  }
}

BENCHMARK(BM_GetProduct1)->Range(16, 16 * 4096);

The data is generated randomly, so roughly half the values are positive and half are negative:

std::vector<double> generateData(int size) {
  std::random_device rd;
  std::mt19937_64 gen(rd());
  std::uniform_real_distribution<double> dist(-1000.0, 1000.0);

  std::vector<double> data;
  data.resize(size);
  for (auto &d : data) {
    d = dist(gen);
  }
  return data;
}

So get_product1 receives random data, which means the branch predictor should only guess correctly about 50% of the time. But why does the branching version win on small data sizes and then suddenly fall behind?

One might conclude that it depends on vector length... but both versions receive the same data... Maybe there's some overhead that doesn't scale linearly?

The Sorted Data Experiment

Let's try sorting the array and running the test again!

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_GetProduct1/16          8.96 ns         8.94 ns     80669588
BM_GetProduct1/64          35.8 ns         35.7 ns     19709667
BM_GetProduct1/512          422 ns          422 ns      1388010
BM_GetProduct1/4096        3563 ns         3559 ns       196867
BM_GetProduct1/32768      28644 ns        28614 ns        24503
BM_GetProduct1/65536      57497 ns        57432 ns        12093
BM_GetProduct2/16          9.04 ns         9.03 ns     76644990
BM_GetProduct2/64          44.4 ns         44.3 ns     15828673
BM_GetProduct2/512          433 ns          433 ns      1613180
BM_GetProduct2/4096        3578 ns         3574 ns       196333
BM_GetProduct2/32768      28688 ns        28650 ns        24450
BM_GetProduct2/65536      57435 ns        57345 ns        12213

The branching version now performs as well as or even better than the "optimized" version. First, the function processes all negative values (the if branch), then all positive values (the else branch), and the branch predictor only mispredicts once — at the transition point.

But why does get_product1 still win on unsorted random data when the array size is less than 4K?

The Real Culprit

Let's recall how the branch predictor works. It has a history buffer where it remembers previous branch outcomes and identifies patterns. For example, "always true" or a pattern like "true-true-false". But in our case, there are no patterns because the data is random!

Or is it?

The answer is that Google Benchmark runs millions of iterations with the same data, so the branch predictor simply learns the sequence! When we start a new iteration, it "recognizes" the sequence and starts choosing the correct branch based on its "experience". But if the sequence is too long, it can't memorize it, and the data becomes random again from its perspective.

The Fixed Benchmark

To verify this hypothesis, let's ensure that short sequences don't repeat across iterations:

static void BM_GetProduct1(benchmark::State &state) {
  auto data = generateData(max_data_size);
  const auto size = state.range(0);
  std::vector<std::span<double>> sets;
  for (int i = 0; i + size <= max_data_size; i += size) {
    sets.push_back(std::span(&data[i], size));
  }

  double p = 0.0;
  int i = 0;
  for (auto _ : state) {
    benchmark::ClobberMemory();

    benchmark::DoNotOptimize(p += get_product1(sets[i], 0.0));
    if (++i >= sets.size()) {
      i = 0;
    }
  }
}

And here are the results:

---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
BM_GetProduct1/16          58.1 ns         57.9 ns     11410584
BM_GetProduct1/64           214 ns          214 ns      3277196
BM_GetProduct1/512         1649 ns         1646 ns       419308
BM_GetProduct1/4096       13133 ns        13118 ns        52947
BM_GetProduct1/32768     104506 ns       104381 ns         6656
BM_GetProduct1/65536     208442 ns       208213 ns         3307
BM_GetProduct2/16          11.0 ns         11.0 ns     63463067
BM_GetProduct2/64          46.4 ns         46.3 ns     15161975
BM_GetProduct2/512          438 ns          438 ns      1608315
BM_GetProduct2/4096        3589 ns         3584 ns       195693
BM_GetProduct2/32768      28825 ns        28785 ns        24373
BM_GetProduct2/65536      57613 ns        57544 ns        12100

Now that's a different story! I almost concluded that branching is better for small data! The branchless version is definitively faster across all data sizes, as long as the data is truly random.