Your CPU Is Guessing the Future, and Wrong 5% of the Time

#cpu #branchprediction #spectre #performance

The "5%" in the title is the headline that gets people interested in branch prediction. It is also a workload-averaged number that papers over the only interesting part of the topic, which is that the misses live exactly where you can't see them. The textbook claim — pick a percentage in the 90s — is dominated by the predictable code on hot paths. The unpredictable code is where you actually pay.

So I want to make a more useful claim. The interesting question about modern branch predictors isn't accuracy. It's where the inaccuracy lives, what the speculation costs when it's wrong, and what it leaks when it's right.

What a predictor actually does

A modern CPU is a pipeline. An instruction enters the front end, gets decoded, gets renamed, finds its operands, executes, and retires. While one instruction is in execute, the next is decoding. Skylake has 14 pipeline stages; Apple's Firestorm core is wider but shallower in cycles. Either way, the front end has to keep feeding instructions even when it doesn't yet know which way a branch will go. The predictor's job is to put a plausible answer where the truth doesn't yet exist.

When the predictor is wrong, the speculative work gets thrown away. On Skylake, the misprediction penalty is roughly 15 to 20 cycles. On the Apple M1, the penalty is closer to 13 cycles. At 5 GHz, 15 cycles is 3 nanoseconds. Doesn't matter once. Matters enormously a billion times a second.

The state of the art for the actual prediction is TAGE, an architecture by André Seznec at INRIA/IRISA. TAGE keeps several tagged tables, each indexed by a different number of past branches — 4, 8, 16, 64, geometric. A short, tight loop is captured by the short-history table. A long, irregular pattern requires the deep one. On the SPEC 2000 integer suite at a 4 KB hardware budget, a basic TAGE hit a 4.6% misprediction rate, a 26% improvement over gshare as a baseline. Production silicon today uses larger budgets, and a recent reverse-engineering paper on Apple Firestorm and Qualcomm Oryon found six pattern-history tables in the predictor on each.

So "95% accurate" is a workload-averaged headline. The interesting question is: which 5%?

Where the misses live

Predictable branches are predictable. A loop that runs 10,000 times. A jump table dispatched on a single tag byte. The hot path of a long-running web server processing similar requests. None of these are where you spend any cycles missing.

Unpredictable branches are where the cost piles up. Indirect calls in dynamic dispatch — virtual method calls, function-pointer tables, the kind of thing every interpreter and JIT-compiled language hits. Pointer chasing through linked structures, where the next branch depends on the result of a memory load that hasn't finished. Data-dependent comparisons over input that genuinely has no pattern. The famous Stack Overflow question from 2012 — "Why is processing a sorted array faster than processing an unsorted array?" — gets to a roughly 6× speedup on the same code over the same data because the sorted version reduces a fundamentally random branch to two long stretches the predictor can settle into. The unsorted version hits the predictor with coin flips.

The aggregate misprediction rate on a real workload averages all of these together. The number you read in a microarchitecture survey is dominated by the hot, predictable paths. The number you feel as a slow application is dominated by the cold ones. They're different numbers. The one in the slide deck is not the one paying your latency budget.

If you want to see this directly on Linux, run perf stat -e branches,branch-misses against your binary. The aggregate ratio tells you whether you have a problem. Breaking it down by function tells you where it lives.

The Spectre detour

In January 2018, Project Zero disclosed a class of attacks that turned branch prediction from a performance feature into a side channel. The trick is short to describe and not short to fix: train the predictor on legitimate inputs so it expects a particular branch, then supply an out-of-bounds input. The predictor sends the CPU down the legitimate path speculatively. The speculative path reads memory it shouldn't. The result gets discarded; the cache state from the speculative read does not. Time the cache, recover the bytes.

Mitigations cost real performance, asymmetrically. AMD's Zen-class chips generally lose under 10% on Spectre v2 mitigations; one pass of networking benchmarks on a Ryzen 9 5950X clocked around 5.3% loss. Intel has had a worse time of it, with the i9-12900K losing 26.7% in the same networking suite on default mitigation settings.

The original 2018 family did not stay alone. August 2023 brought Downfall on Intel (formally Gather Data Sampling, affecting Skylake through Rocket Lake) and Inception on AMD's Zen 1 through Zen 4 (CVE-2023-20569). June 2024 added TikTag against ARM's Memory Tagging Extension from researchers at Samsung Research and Seoul National University. July 2024, Indirector on Intel. The list keeps growing because the underlying engine (speculation across permission boundaries) is load-bearing for performance.

The contrarian close

The interesting question is not "can predictors get more accurate." Diminishing returns on the workloads we already have are real. The interesting question is whether speculation across security boundaries was ever a good idea and what writing software without it would cost. The answer some people have given for the secret-handling part of their stack is: write that part without data-dependent branches. Constant-time comparisons in cryptography. Branchless implementations of conditional moves. cmov instead of if. Compilers will sometimes generate that automatically; it isn't a guarantee you can rely on without checking the disassembly.

That's not free either. Branchless code does the work on both arms of every conditional. It bloats hot loops. On workloads where the predictor is well-trained, branched code is faster. That's the trade.

Branch prediction is a wager that the past predicts the future. It usually does. Most of the time you pay nothing for being wrong. The rest of the time you pay a pipeline flush. And occasionally — when the speculative reads happen to cross a privilege boundary you forgot was there — you pay your kernel.

What the textbook 5% averages over is exactly that distribution: thousands of small wins, a handful of expensive losses, and a much smaller handful of catastrophic ones. The interesting work in 2026 is no longer making the predictor better. It is deciding which code can afford to let the predictor try, and which code has to be written as if the predictor were the threat. Most of us write the first kind. The rest of us — kernel authors, crypto library maintainers, anyone whose buffer holds someone else's secret — are writing the second.

DEV Community

Your CPU Is Guessing the Future, and Wrong 5% of the Time

What a predictor actually does

Where the misses live

The Spectre detour

The contrarian close

Top comments (0)