C++ and Microarchitecture Nuances

Sami Al-Jamal — Wed, 17 Jun 2026 23:38:18 +0000

C++ source code is written in order. That does not mean the processor executes it in order.

This is the first correction. It is also the one many performance discussions manage to avoid.

Modern high-performance cores use out-of-order execution. They accept a sequential instruction stream, break it into internal operations, rename registers, place work into scheduling structures, execute ready operations early, and then retire the results in program order. The machine preserves the visible behavior of sequential execution. Internally, it is not taking attendance line by line.

For ordinary software, this is mostly invisible. For C++ intended to run in tens of nanoseconds, it is not invisible. At that scale, performance is not just about the number of instructions. It is about whether those instructions can be scheduled in parallel or whether the program quietly built a dependency chain and then acted surprised.

The processor is a dependency scheduler

Out-of-order execution exists because in-order pipelines waste time. If an older instruction stalls, an in-order processor must often wait even if later instructions are independent and ready. That is a poor use of hardware. The chip has execution units available. The instruction stream has more work.

Dynamic scheduling fixes part of this problem. The processor tracks which operations have their inputs ready. When an operation is ready and an execution unit is available, it can issue. Older operations may still be waiting. Later operations may run first. The final architectural state is still committed in order, so the program behaves correctly.

Tomasulo’s algorithm is the classic model for this idea. It used reservation stations and register renaming to allow instructions to execute when their operands became available rather than strictly when they appeared in the original program (Tomasulo, 1967). Later superscalar processors extended the same general approach with speculation and reorder buffers, but the central idea stayed the same: execute according to readiness, not textual order (Hennessy & Patterson, 2019).

That is the point relevant to C++: the processor is not primarily reading the program as a list. It is resolving a graph.

Nodes are operations. Edges are dependencies.

No edge, possible overlap.
Real edge, forced order.

True dependencies are the hard limit

Out-of-order execution cannot violate true data dependencies.

If instruction B needs a value produced by instruction A, B waits for A. No scheduler can change that. No amount of confidence in “modern CPUs are smart” changes that either.

Consider this reduction:

std::uint64_t s = 0;

for (std::size_t i = 0; i < n; ++i) {
    s += data[i];
}

The loop looks harmless. It also creates a loop-carried dependency. Each update to s depends on the previous value of s.

Conceptually:

s1 = s0 + data[0]
s2 = s1 + data[1]
s3 = s2 + data[2]
s4 = s3 + data[3]

The processor may overlap some surrounding work, and it may have several loads in flight, but the additions themselves form a chain. The next addition needs the previous result.

A better version exposes independent chains:

std::uint64_t s0 = 0;
std::uint64_t s1 = 0;
std::uint64_t s2 = 0;
std::uint64_t s3 = 0;

for (std::size_t i = 0; i < n; i += 4) {
    s0 += data[i + 0];
    s1 += data[i + 1];
    s2 += data[i + 2];
    s3 += data[i + 3];
}

std::uint64_t s = s0 + s1 + s2 + s3;

Now there are four shorter dependency chains instead of one long chain. The out-of-order core has more ready work. This does not make the program faster by aesthetic force. It makes it faster because the dependency graph improved.

This is the useful mental model. Optimizing for out-of-order execution means reducing critical-path length and increasing available independent work.

The processor cannot schedule independence that the program does not expose.

False dependencies are bookkeeping problems

Not all apparent dependencies are real.

A read-after-write dependency is real. If a later operation needs the result of an earlier operation, it must wait.

A write-after-read or write-after-write dependency can be false. These often come from reusing the same architectural register name, not from the actual values depending on each other.

Out-of-order processors handle this with register renaming. The instruction set exposes a limited set of architectural registers. Internally, the processor maps those architectural registers onto a larger pool of physical registers. This lets the machine separate unrelated values that happen to use the same architectural name.

For example, source code may produce machine instructions that appear to reuse a register. Internally, the processor can assign different physical registers to different live values. The name is reused. The storage is not necessarily reused.

That matters because it prevents false dependencies from blocking execution. The processor can see that two writes to the same architectural register do not need to serialize if no actual value relationship exists between them.

This is also why reading assembly is useful but incomplete. Assembly shows the architectural instruction stream. It does not show the physical register mappings, scheduling queues, issue timing, or reorder-buffer behavior. Assembly is closer to the machine than C++. It is still not the machine.

The reorder buffer keeps the lie consistent

If instructions can execute out of order, the processor needs a way to preserve precise program behavior. This is the job of the reorder buffer, or ROB.

Instructions may finish execution out of order. They do not usually retire out of order. The ROB holds results until each instruction is safe to commit in the original program order. If speculation was wrong, the processor can discard the speculative work and restore a correct state.

This separation matters:

execute: may happen out of order
retire: happens in program order

That is how the processor gets performance without giving up the language’s sequential behavior.

For a C++ programmer, this explains a common confusion. The processor may execute later independent operations before earlier blocked operations, but the program still appears to obey the abstract machine rules, subject to the usual caveat that undefined behavior gives the compiler and hardware no meaningful obligation. That caveat is not a footnote. It is where many “low-level tricks” go to die.

Out-of-order execution rewards boring independence

The easiest way to help out-of-order execution is to provide independent operations.

This often means splitting state.

Bad:

state = mix(state, x0);
state = mix(state, x1);
state = mix(state, x2);
state = mix(state, x3);

Every line depends on the previous value of state. The processor sees a chain.

Better, if the algorithm allows it:

s0 = mix(s0, x0);
s1 = mix(s1, x1);
s2 = mix(s2, x2);
s3 = mix(s3, x3);

state = combine(s0, s1, s2, s3);

Now the processor sees several independent chains. The final combination still has to happen, but much of the work can be scheduled earlier.

This pattern appears everywhere: checksums, reductions, counters, hash-like computations, parsing loops, pricing loops, and numeric kernels. The exact transformation depends on the algorithm. The principle does not.

Out-of-order execution does not reward clever-looking code. It rewards code that gives the scheduler options.

Pointer chasing defeats the scheduler

A dependent load chain is especially restrictive.

Node* p = head;

while (p != nullptr) {
    sum += p->value;
    p = p->next;
}

The next address depends on the current load. The processor cannot issue the load of p->next->next until it has loaded p->next. The chain is serial.

This is not mainly a lecture about cache locality. Cache locality matters, but the out-of-order point is narrower: the address of future work is not available yet. The scheduler cannot issue an operation whose address has not been computed.

That is why pointer-heavy structures often behave poorly in nanosecond-scale code. The processor has resources, but the program gives it one dependent step at a time. A large out-of-order window helps only when there is other independent work nearby. If the whole loop is a dependent chain, the window fills with waiting.

The code is then not CPU-bound in the useful sense. It is dependency-bound.

Function calls and abstraction can hide independence

Out-of-order execution operates on the instruction stream produced by the compiler. It does not see source-level intent. If useful independence is hidden behind function calls, aliasing, virtual dispatch, or opaque control flow, the compiler may fail to expose it in the emitted code.

Inlining can matter because it gives the optimizer more context. With more context, the compiler may remove redundant work, keep values in registers, reorder independent operations, or unroll a loop enough to expose multiple independent chains. Pikus emphasizes this interaction between C++ structure, compiler optimization, and CPU utilization: high-performance C++ depends on giving the compiler and hardware enough information to use available resources effectively (Pikus, 2021).

This does not mean every function call is bad. That would be a convenient rule, and therefore suspicious. The actual question is whether the final instruction stream exposes enough independent work for the core to schedule.

The source is not the object of measurement. The emitted machine code is.

Latency and throughput are different questions

Out-of-order execution also forces a distinction between latency and throughput.

An operation’s latency is how long its result takes to become available to a dependent operation.

An operation’s throughput is how many such operations can be started per cycle when enough independent work exists.

This distinction matters constantly.

If a multiply has a latency of several cycles but the processor can start one multiply every cycle, then one dependency chain pays the full latency repeatedly. Four independent chains can approach throughput limits instead. Same instruction. Different dependency graph. Different performance.

This is why the earlier accumulator example matters. It does not make addition itself faster. It changes the loop from latency-limited toward throughput-limited.

Low-latency C++ often consists of this kind of work: find the critical dependency chain, shorten it, split it, or move independent work across it. Not glamorous. Usually effective. A rare combination.

What to measure

A benchmark that reports only wall-clock time is often not enough. It may show that one version is faster, but not why.

For out-of-order behavior, useful questions include:

Is the loop limited by one dependency chain?
Did unrolling expose independent operations?
Did the compiler actually emit multiple accumulators?
Are instructions waiting on operands or execution resources?
Did a small source change alter the generated dependency graph?

Tools such as Compiler Explorer, objdump, LLVM-MCA, perf, and Intel VTune help answer these questions. They do not replace thinking. They reduce the amount of fiction in the thinking.

A simple timing result can say “faster.” The assembly and counters explain whether the speed came from less work, better scheduling, fewer stalls, or a shorter critical path.

For nanosecond-scale code, that difference matters.

The practical conclusion

Out-of-order execution is not a general blessing applied to slow code. It is a specific hardware strategy for finding ready work inside an instruction stream.

It can hide stalls when independent work exists.
It can remove false register dependencies through renaming.
It can execute operations before older stalled instructions.
It can preserve sequential behavior through in-order retirement.
It cannot violate true dependencies.
It cannot issue future work whose inputs are unknown.
It cannot rescue a program that presents one long chain and no alternatives.

The C++ programmer writing low-latency code should therefore ask a different question. Not “how many lines did I write?” Not even only “how many instructions did the compiler emit?”

The better question is: what dependency graph did I give the processor?

That graph is what the out-of-order core actually schedules. The source code is merely how the problem was submitted.

References

Hennessy, J. L., & Patterson, D. A. (2019). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann.

Kessler, R. E. (1999). The Alpha 21264 microprocessor. IEEE Micro, 19(2), 24–36.

Pikus, F. G. (2021). The Art of Writing Efficient Programs: An Advanced Programmer’s Guide to Efficient Hardware Utilization and Compiler Optimizations Using C++ Examples. Packt Publishing.

Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE, 83(12), 1609–1624.

Tomasulo, R. M. (1967). An efficient algorithm for exploiting multiple arithmetic units. IBM Journal of Research and Development, 11(1), 25–33.

DEV Community: Sami Al-Jamal