Cache Deep Dive III — Replacement Policies, Prefetch, and Single-Thread Memory Access

#architecture #computerscience #performance #systems

The previous article discussed the static structure of caches. This part moves into dynamic aspects: when a program continuously issues read requests, how the cache decides which lines to retain and which to evict; how hardware prefetchers pull data into the cache before the program even issues a request; and the performance characteristics of two extreme access patterns under single-threaded execution — sequential and random access.

Replacement and Placement Policies

Cache replacement and placement policies are primarily managed by hardware logic; the programmer has no direct control in the vast majority of cases. Modern ISAs do provide instructions that can influence cache behavior — x86's PREFETCH instruction, CLFLUSH / CLWB flush instructions, and non-temporal stores (MOVNTI and the like) that bypass the cache on writes — but these are, in essence, merely hints to the hardware: the architecture does not guarantee that a prefetch will actually occur or that a flush will actually complete; at the microarchitecture level, however, modern processors typically convert such instructions into real prefetch or flush requests. Whether actual benefit materializes depends on the prefetch distance, current bandwidth pressure, and cache state. What truly decides which line gets evicted and where data gets placed is always the combinational logic in hardware. If these decisions were delegated to the OS, every replacement would require triggering an interrupt, trapping into kernel mode, and executing hundreds of software instructions to select a victim line — an overhead far greater than any performance gain it could bring. Hardware cache replacement is done by pure digital logic in less than a single clock cycle.

When a program needs data from level k+1, the cache first checks the current level. A hit is a cache hit, saving the latency of accessing the lower level; a miss is a cache miss. Since level k is necessarily smaller than level k+1, and the amount of memory a program uses often exceeds the cache capacity, a full cache must evict an existing line to make room for a new one. The decision of which block to evict is controlled by the replacement policy. The simplest policy is random replacement — picking a line to sacrifice at random. A more sophisticated one is LRU — evicting the line that was Least Recently Used. LRU is non-trivial to implement in hardware, as it requires maintaining the access order of all lines within a set.

In real chips, LRU approximations or alternatives are widely used. Both academic research and reverse engineering generally conclude that modern Intel last-level caches (since Haswell) use a replacement mechanism conceptually similar to RRIP (Re-Reference Interval Prediction), rather than strict LRU: each cache line is tagged with a Re-Reference Prediction Value (RRPV), cleared to zero on access, and when eviction is needed, the line with the highest current RRPV is selected. When all RRPVs saturate, a global aging event is triggered. DRRIP (Dynamic RRIP) further adaptively switches between SRRIP (biased toward protecting newly inserted lines) and BRRIP (biased toward quickly evicting new lines). Compared to strict LRU, such mechanisms do not unconditionally promote every accessed line to the "most recently used" top — thereby avoiding cases where a single incidental access evicts hot data, and performing better under mixed access patterns.

Note that the complex policies described above appear primarily in larger-capacity LLCs. Caches closer to the core (L1, L2) emphasize access latency more — the few hundred picoseconds of additional delay that an extra replacement state machine might introduce are already unacceptable on the L1 path — so various LRU approximations (such as Tree-PLRU, NRU) and even random replacement are common in L1/L2. The complexity of replacement policies increases outward along the cache hierarchy, inversely related to latency tolerance.

Beyond the replacement policy, the hardware must also decide where a new piece of data should be placed — that is, the placement policy. The placement policy determines the type of a miss. If data is being accessed for the first time and is not in the cache, that is a cold miss, which is unavoidable. If there is still available space in the cache, but mapping-rule constraints cause certain addresses to be repeatedly mapped to the same location while other locations sit empty, that is a conflict miss. If the entire working set is too large and exceeds the cache capacity, that is a capacity miss.

From a measurement perspective, the three types of misses can be distinguished by shrinking/enlarging the working set and cross-referencing against the miss counts from perf stat: seeing misses even on a very small dataset (far smaller than the cache) → dominated by cold misses; miss rate varying with data distribution on a medium dataset → conflict misses; miss rate asymptotically approaching a fixed value on a large dataset → capacity misses.

Hardware Prefetchers

Since program execution inevitably requires data and instructions, can they be fetched asynchronously in advance to reduce or even eliminate misses? This is prefetching.

Modern CPU hardware prefetchers employ multiple strategies. The basic behavioral pattern of a prefetcher is: upon detecting a sequence of consecutive accesses, preemptively pull the next expected address into the cache before it is actually accessed. Implementations from different vendors each have their own emphasis.

Intel's typical configuration includes:

L1 Data Prefetcher: monitors L1d access patterns, and upon detecting two consecutive cache-line loads (within the same 4 KB page), prefetches the next cache line into L1d.
L2 Streamer (Spatial Prefetcher): monitors L1 miss requests, and upon detecting a sequence of misses at consecutive addresses, prefetches several subsequent cache lines into L2 along the same direction, typically with a prefetch depth of 2–4 lines.
L2 Adjacent Cache Line Prefetcher: within a 128-byte-aligned pair, when the L2 receives a miss request for one half, it simultaneously pulls the other half (the adjacent 64-byte line) into L2 as well. This prefetcher does not rely on pattern detection; it is purely spatial.
Next Page Prefetcher: when the access sequence approaches a 4 KB page boundary, if the next page is already registered in the TLB, the prefetcher can continue prefetching across the page boundary — crossing the page boundary without stalling, provided no page fault is triggered.

AMD Zen series (Zen 4/5) corresponding implementations:

L1/L2 Stream Detector: similar to Intel's L1 Data Prefetcher + L2 Streamer, detects sequential accesses and prefetches subsequent lines.
L2 Up/Down Prefetcher: a bidirectional prefetcher — prefetches not only in the direction of access, but also backward (fetching the previous cache line relative to the current line), more friendly to scenarios requiring bidirectional traversal (such as prefix and suffix scans).

Prefetcher aggressiveness is tunable in specific BIOS or MSR settings, but is generally not directly controlled by the application.

Software prefetch allows the programmer to explicitly specify a prefetch address in code. GCC and Clang provide __builtin_prefetch:

for (int i = 0; i < n; ++i) {
    __builtin_prefetch(&data[i + 16]);  // prefetch 16 steps ahead
    process(data[i]);
}

This built-in function accepts three arguments: the target address, the read/write type (0 for read-only, 1 for read-write), and a temporal-locality hint (0 for use-once-and-discard, 3 for retain-as-long-as-possible). On x86, it maps to the PREFETCH instruction; the compiler intrinsic is _mm_prefetch. If the prefetch distance is too short — the data arrives while the CPU is still in the middle of prior computation — the prefetch is meaningless, merely occupying a cache line ahead of time. If the distance is too long — the data arrives but is evicted by subsequent accesses before use — the bandwidth is wasted. Effective use of software prefetching therefore requires measured parameter tuning; improper use not only brings no benefit but actually evicts useful data by consuming additional bandwidth.

Single-Thread Sequential Access

The theoretical latencies given in the previous article — roughly 12 cycles for L2, roughly 200 cycles for main memory — rarely appear in their full magnitude during real sequential access. The reason is that hardware prefetchers hide most of the latency through overlapping.

Consider the following scenario: a single thread sequentially traversing a uint64_t array. When the total number of elements n is small and the entire array fits in L1d, the vast majority of accesses are L1 hits, with latency around 4–5 cycles. When n exceeds the L1d capacity but remains within the L2 range, the theoretical L2 hit penalty is 12–14 cycles, yet the measured effective cost typically lands around 8 cycles — the prefetcher has already moved subsequent data from L2 into L1d before the L1 miss occurs. When n grows large enough that main memory becomes the source, the single-access latency to DRAM is still about 200 cycles; but in a sequential streaming access, the prefetcher and MLP overlap multiple requests, bringing the amortized cost per element down to single-digit cycles. These values depend on microarchitecture and prefetcher configuration, but the direction is consistent across all modern CPUs: in sequential access, effective throughput far exceeds what single-access latency alone would suggest (these orders of magnitude can be reproduced on a target machine using Intel MLC or a comparable benchmark).

The fundamental reason prefetchers can so effectively mask main-memory latency is that sequential access provides them with the most ideal input — the increment in access addresses is fixed and predictable. This allows the prefetcher's pattern detection to lock onto the direction and stride after the first or second access, and subsequent prefetches can pull multiple lines at a time, forming a pipeline of inflowing data.

The Impact of Stride

When the traversal stride increases, prefetcher efficiency drops. If each element is 64 bytes (exactly one cache line), each access crosses one line; if 128 bytes, it crosses two lines, and the prefetcher's pipeline speed must double just to keep up with downstream consumption. At a stride of 256 bytes, the cache's "effective capacity" is diluted — although the hardware still pulls every full cache line, only a small fraction of each line is actually used, and the remaining bytes are wasted.

The most common engineering case is traversing an array of large structs while accessing only one field at a time:

struct Entity {
    int hot_field;
    char padding[60];
};
Entity entities[100000];
for (auto& e : entities) sum += e.hot_field;

Each loop iteration loads a 64-byte cache line but uses only 4 bytes (hot_field being int), for an effective utilization of 4/64 ≈ 6%. The prefetcher's bandwidth is filled with a large amount of useless data, and actual throughput drops to 6% of the theoretical bandwidth.

In such scenarios, separating the hot fields into their own independent array can dramatically improve cache efficiency:

struct Entities {
    std::vector<int> hot_fields;
    std::vector<char[60]> paddings;
};

This is the transition from AoS (Array of Structures) to SoA (Structure of Arrays). Hot fields are now laid out contiguously; a single cache line can hold 16 ints (64 B / 4 B), and each line the prefetcher brings back feeds 16 loop iterations. When a program accesses only a small subset of fields while ignoring the rest, SoA-style traversal typically significantly outperforms AoS. However, if all fields of the same element need to be accessed together (e.g., the x, y, z coordinates of a Particle struct), AoS instead makes fuller use of all the data pulled back per line by the prefetcher. This design principle, fundamental to Data-Oriented Design, is a basic strategy for cache optimization: arrange data by access pattern, not by conceptual model.

Single-Thread Random Access

The discussion above is entirely based on sequential access — scenarios where prefetchers can function effectively. When the memory access pattern becomes completely random, the situation is reversed.

Under sequential access, the amortized cost per element can be as low as a few cycles; under random access, because prefetchers and MLP struggle to be effective, the program begins to be directly exposed to the hundreds of cycles of main-memory latency. Main memory itself has not become slower. The problem lies with the hardware prefetcher — unable to recognize a random pattern, it still issues prefetch requests according to its own policy, but the prefetch addresses bear no relation to the data the program actually needs. The useless data pulled back by the prefetcher not only occupies memory bus bandwidth but also evicts useful hot data from the cache. Prefetching transforms from a means of reducing latency into a burden on the system.

The performance curves of sequential and random access exhibit fundamentally different shapes. Using a pointer-chasing benchmark to plot a latency-vs-working-set-size curve: sequential access appears as a staircase — clear latency steps at the capacity boundaries of each cache level (L1/L2/L3), with the prefetcher pressing latency down to near the theoretical hit latency at the edge of each step. Random access, by contrast, is a smoothly rising ramp — the larger the working set, the more the cache hit rate continuously declines, and latency slides gradually from L1 to L2, to L3, and finally to the DRAM plateau. On either side of the LLC capacity boundary, random-access latency transitions almost continuously rather than jumping — because the probability of a miss rises smoothly with the working set size, rather than flipping abruptly at a capacity threshold.

The reason pointer chasing is the standard method for measuring random-access latency is that it constructs a dependency chain that is fundamentally impossible for any prefetcher to predict: allocate an array with elements randomly permuted, each element storing a pointer to the next element. The CPU cannot know the address of the next load until the current load completes — this is the most stringent form of data dependency. Pointer chasing does not merely suffer from a low cache hit rate; it simultaneously destroys the entire foundation on which prefetchers, MLP, and out-of-order execution rely to hide latency: the prefetcher is defeated because it cannot predict the next address, MLP is paralyzed because addresses are serially dependent, and all instructions in the ROB that indirectly depend on the load result stall and wait.

Random access is equally devastating to the TLB — when the number of pages in the working set exceeds the number of TLB entries, every new random jump may land on an uncached page, triggering a full page-table walk. Detailed analysis of this topic, however, is deferred to Part IV.

From an engineering perspective, the irrecoverability of random-access performance means that for data structures based primarily on pointer traversal (linked lists, hash tables, B-trees, skip lists), even with arbitrarily large caches, as long as the working set exceeds the cache, performance degrades irreversibly. This is the fundamental dividing line between "cache-friendly data structures" and "memory-intensive data structures." A contiguous array is one of the data layouts most easily exploited by caches and prefetchers — not only because of sequential memory access, but more importantly because its memory layout is entirely transparent to the prefetcher. Any data structure built around indirect pointer access inherently surrenders a portion of its performance back to the memory wall. That said, in real systems, cache-optimized index structures (such as B+-trees aligning internal nodes to cache-line size, or Adaptive Radix Tree with path compression) do exploit intra-cache-line locality as much as possible — but their core access paths still involve at least one level of pointer-level indirection.

The next part will focus on the TLB, the cost of page-table walks, and how huge pages alleviate both of these problems.