Cache Deep Dive V — Write Policies, Store Buffer, and Memory Barriers

#architecture #computerscience #performance #tutorial

The previous installments focused on reads — how address translation affects memory access latency. This part turns to writes. A store instruction may seem simple, but its path from the pipeline's retire stage to the final write into an L1d cache line passes through a structure critical to multicore program correctness: the Store Buffer. Understanding how the Store Buffer works is the hardware prerequisite for understanding memory fences (barriers) and memory ordering models.

Write Policies

Caches are meant to be transparent to the programmer — modifying a cache line should be equivalent to modifying the corresponding data in main memory. There are two basic strategies for implementing this transparency: write-through and write-back.

Write-through: A cache line is immediately written to main memory after being modified. Simple to implement — cache and main memory are always consistent. But the performance penalty is severe: if a program repeatedly modifies the same cache line, every single modification must travel across the bus to main memory. Today, write-through is only used in a handful of hardware edge cases (e.g., memory-mapped registers in certain embedded systems).

Write-back: Roughly 99% of modern processors use this strategy. A write only modifies the cache; the cache line is marked dirty and the operation is complete. Dirty data is written back to main memory only when the line is evicted. Additionally, the CPU can proactively write back dirty lines and clear the dirty flag during bus-idle periods, without waiting for eviction.

Write-back reduces write bandwidth pressure, but it introduces a critical problem: if the target cache line for a store instruction is not currently in L1d — for instance, the line is held in Modified state by core B and requires an RFO to complete — the CPU will be stalled for hundreds of cycles waiting for the cache line to arrive. Write-back alone does not solve this problem; the Store Buffer does.

Store Buffer

Suppose a CPU had to wait for the target cache line to be present in L1d and for the write to complete on every store instruction — the pipeline would be severely blocked. If the target cache line is not in L1d (requiring a fetch from L2, L3, or even main memory, or waiting for a MESI RFO broadcast), the intervening hundreds of cycles of latency would leave the pipeline entirely idle. To solve this problem, modern CPUs insert a Store Buffer between the pipeline execution units and L1d.

When a store instruction retires from the pipeline, the data is not written directly into L1d. Instead, it first enters the Store Buffer queue. From the pipeline's perspective, the store instruction is considered complete — it can retire, and subsequent instructions can continue to issue. The data in the Store Buffer is written back to L1d asynchronously by the hardware, whenever the target cache line achieves the appropriate MESI state in L1d (E or M) and the bus is not busy.

The Store Buffer also provides a Store-to-Load Forwarding mechanism: if a load instruction's target address matches an outstanding store in the Store Buffer that has not yet been written back to L1d, the hardware forwards the latest value directly from the Store Buffer to the load instruction, without waiting for the L1d write to complete. Semantically, this value is equivalent to "having already been written to the cache."

Store Buffer capacity is finite. In modern high-performance processors, the store buffer typically ranges from several dozen to over a hundred entries; exact numbers vary by microarchitecture generation (Intel client cores in recent years are mostly in the 56–64 entry range, AMD Zen series roughly 44–48 entries, and Apple M series, estimated through reverse engineering, above 80 entries). This means that if a large number of store instructions pile up in the pipeline in a short period, the Store Buffer will become full. Once full, subsequent store instructions cannot retire — the pipeline halts, out-of-order execution capability collapses, and the CPU can only wait for the Store Buffer to free up. This is the internal causal chain behind L1d "seemingly disappearing" under heavy multi-threaded writes and saturated bus bandwidth: the downstream DDR bus is congested → L1d cannot commit writes → the Store Buffer cannot drain → pipeline stalls. The bottleneck is not the speed of the cache itself, but the overall throughput of the store path.

Differences in how platforms tolerate the same bursty write traffic partly explain why the same production code can feel different in "writability" between x86 servers and Apple Silicon: a larger Store Buffer can absorb more burst writes during traffic spikes, deferring the critical threshold at which the pipeline stalls.

Memory Ordering and Barriers

The Store Buffer introduces a fundamental side effect: a store instruction becomes visible to external cores (other CPU cores) later than it becomes visible to the local core (the core that executed the store). The reason is that store-to-load forwarding allows the local core to "see" data before it has actually been written to L1d, whereas other cores cannot observe it until the data actually enters L1d and is propagated via the cache coherence protocol.

Consider the following scenario: Core 0 executes

x = 1;
y = 2;

Core 1 simultaneously executes

if (y == 2) {
    assert(x == 1);  // Will this assertion succeed?
}

Under x86's TSO (Total Store Order) model, the assertion will always succeed — because TSO guarantees that stores from the same core appear in program order to other cores. Under weakly-ordered models such as ARM, however, Core 1 may observe y = 2 first and x = 1 only later. The reason is not that the Store Buffer necessarily reorders outgoing writes, but that ARM does not require stores to different addresses to become visible to other cores in program order — even though Core 0's pipeline retired the two stores sequentially, their visibility propagation across the cache coherence network does not follow a unified global order. Unless a barrier instruction is inserted between the two stores.

What eliminates this kind of uncertainty is the memory barrier (or fence). x86 provides three main barrier instructions:

SFENCE (Store Fence): Guarantees that all store instructions preceding the SFENCE have left their program-order pending state — subsequent stores cannot bypass it and execute early. SFENCE does not require that data has been written to L1d or become visible to other cores, nor does it constrain the behavior of load instructions. Under the WC memory type, SFENCE also forces a flush of the write-combining buffers, making previously written data observable to devices (detailed below).
MFENCE (Memory Fence): The strongest barrier. Ensures that all load and store instructions preceding the MFENCE have become globally visible, and that all subsequent load and store instructions only begin execution after the MFENCE. x86's TSO guarantees LoadLoad, LoadStore, and StoreStore ordering by default, but allows StoreLoad reordering — that is, a store can linger in the Store Buffer while a subsequent unrelated load executes first. MFENCE is the sole instruction for restoring StoreLoad ordering.
LFENCE (Load Fence): Serializes the instruction stream — all instructions before LFENCE must complete local execution (retire), and all instructions after it only begin issue afterward. LFENCE does not drain the Store Buffer, does not control cache writeback, and is used solely to prevent speculative execution of instructions from crossing the boundary (e.g., the serialization requirement around RDTSC). In Spectre mitigations, LFENCE is used to prevent speculative execution pollution from conditional branches.

x86's TSO model provides programs with a relatively strong default guarantee: stores from the same core appear in program order to other cores (StoreStore). However, TSO allows a subsequent load on the same core to execute before prior un-drained stores — this is StoreLoad reordering, the sole gap that separates TSO from sequential consistency. Because while a store sits in the Store Buffer queue, a completely unrelated subsequent load can issue first, reading the "old" value from L1d or even other caches. MFENCE is specifically designed to fill this gap.

ARM's weak ordering model is different: on ARM, ordering among stores, among loads, and even between loads and stores is not guaranteed by default. ARMv8 introduced dedicated acquire/release instructions (LDAR — Load-Acquire, STLR — Store-Release) to provide one-way barriers: LDAR ensures that all subsequent loads and stores are not reordered before it; STLR ensures that all prior loads and stores are not reordered after it. For scenarios requiring both directions, ARM's DMB (Data Memory Barrier) instruction is equivalent to MFENCE. The cost of ARM's weak ordering model is this: lock-free code that runs correctly on x86 may, after being ported to ARM, exhibit heisenbugs — ordering violations that are extremely low-probability and exceptionally difficult to reproduce — due to missing barriers.

Atomic Operations and Caches

C++'s std::atomic provides four standard memory ordering tags, each mapping to different hardware behavior:

memory_order_relaxed: Provides no ordering constraints. The sole guarantee is atomicity — RMW (Read-Modify-Write) operations on the variable will not be torn by concurrent accesses from other cores. On x86, relaxed atomic loads and stores compile to ordinary mov instructions (thanks to TSO's default load/store ordering), and only RMW operations require the lock prefix (e.g., lock cmpxchg). On ARM, relaxed loads and stores compile to ordinary ldr/str, and RMW is implemented through LDREX/STREX loops without additional barriers.

memory_order_acquire: Acquire semantics. All subsequent loads and stores cannot be reordered before this load. Ensures that after reading a flag, all data protected by that flag has been "acquired" with correct visibility. On x86, TSO already provides LoadLoad and LoadStore ordering, so an acquire load compiles to an ordinary mov, producing no additional barrier instruction. On ARM, an acquire load compiles to the LDAR instruction, which implies a one-way acquire barrier.

memory_order_release: Release semantics. All prior loads and stores cannot be reordered after this store. Ensures that all data preparation is complete before setting the flag, allowing an acquire on another core to observe the complete state. On x86, TSO already provides LoadStore and StoreStore ordering, so a release store compiles to an ordinary mov. On ARM, a release store compiles to the STLR instruction.

memory_order_seq_cst: Sequential consistency. Provides a single, globally-unique total order — all cores observe the order of seq_cst operations identically. This is the most expensive ordering constraint in terms of performance. On x86, seq_cst can often exploit TSO's natural ordering guarantees and does not necessarily correspond to an explicit MFENCE; seq_cst atomic RMW operations typically leverage lock-prefixed instructions to establish this total order. The exact implementation varies by compiler version and context; one should not assume that a specific instruction sequence will always be generated. On ARM, seq_cst requires DMB barriers inserted both before and after.

With these mappings understood, a key conclusion emerges: on x86, correctly using acquire/release is entirely free (at the instruction level); only seq_cst incurs the cost of a full barrier. This fact leads many performance-sensitive concurrent codebases to compress the paths requiring total-order synchronization to the absolute minimum — for instance, using seq_cst only on flag sets or reads, while implementing large amounts of internal data exchange using acquire/release. On ARM, even the instruction overhead of acquire/release is far lower than the full barrier of seq_cst, but compared to x86's "free" acquire/release, ARM programmers must choose ordering constraints with greater deliberation.

The lock prefix (used on x86 atomic RMW instructions such as lock add, lock cmpxchg) does not, in modern cache coherence systems, typically lock the entire bus. Instead, it first acquires exclusive ownership of the target cache line (M or E state under MESI) and then completes an indivisible read-modify-write sequence on that cache line. Only in a very small number of non-cacheable accesses does it degrade into a genuine bus lock. The lock prefix also implicitly includes a full memory barrier (the draining effect of MFENCE), so a lock-prefixed instruction also tells the hardware: before execution, drain all Store Buffers, ensuring that the globally-visible atomic RMW establishes full StoreLoad ordering both before and after itself.

WC and UC: Write-Combining and Uncacheable

Beyond the Store Buffer, the CPU has two additional specialized write channels.

Write-Combining (WC) targets device memory. Consider game rendering as an example: the CPU must write 100 MB of pixel data to the GPU's framebuffer. If every 4-byte pixel write were sent as a separate write transaction over the PCIe bus, the available bus bandwidth would be consumed by fragmented, tiny writes. The WC buffer coalesces several consecutive small writes into a larger transfer block — typically 64 bytes (one cache line in size) — and sends it over the bus to the device in one burst once the buffer is full. Intel processors typically have 4–6 WC buffers, each 64 bytes in size. Non-temporal stores (_mm_stream_si128, etc.) are precisely how data is directed into WC buffers rather than polluting the L1–L3 caches. Note that a non-temporal store does not equate to directly writing to DRAM — for WB (write-back) memory types, the CPU may still aggregate the data through write-combining buffers before writing it out; the core objective is to avoid filling cache hierarchies with this data, not to guarantee an immediate commit to main memory.

Uncacheable (UC) is a marking that the OS sets on specific physical pages. Accesses to such pages completely bypass the L1, L2, and L3 caches — reads fetch the value from the device bus every time, and writes are placed directly onto the bus each time. UC ensures that the CPU's writes to device registers are immediately observable by the device after the instruction retires, without being held up by any intermediate cache level. The OS uses page table attributes and mechanisms such as PAT (Page Attribute Table) / MTRR (Memory Type Range Register) to mark specific physical pages as UC or WC.

A point that is easy to confuse: in the x86 reference manuals, SFENCE explicitly flushes data from WC buffers — that is, SFENCE not only ensures that conventional write-back data in the Store Buffer has entered the cache hierarchy, but also forces data in WC buffers to actually be issued onto the bus / to the device. Therefore, in GPU drivers or RDMA stacks, following a sequence of command/descriptor writes to a device with an SFENCE is the standard way to guarantee that the device observes a complete write sequence. The effect of MFENCE includes all the functionality of SFENCE — so if MFENCE has already been used, an additional SFENCE is unnecessary.

Returning to the Memory Wall thesis raised in Part I — the latency hiding of loads depends on OOO and MLP, and the latency hiding of stores depends on the Store Buffer. The essence of both is the same: finite-capacity hardware queues that decouple memory accesses lasting hundreds of cycles from the front-end pipeline. The moment those queues are exhausted, the CPU once again crashes into the Memory Wall.

The next part will enter the battlefield that data reaches after the Store Buffer drains — multicore cache coherence: how the MESI protocol maintains cache line state synchronization across multiple cores, the cost of RFO completing a closed-loop handshake among all cores, and the diagnosis and repair of false sharing in practice.