Cache Deep Dive I — The Memory Wall and Locality

#architecture #computerscience #performance #systems

Introduction

A core conviction behind this series: engineers who truly understand the underlying systems possess an intuition in performance engineering that is difficult to replace. When code exhibits unexpected latency, they can trace the problem down through the cache hierarchy, pipeline state, coherence protocol, and even kernel scheduling paths to its physical root cause. This ability is not built on algorithm textbooks — it rests on low-level foundations that are easily overlooked: CPU caches, pipelines, NUMA, Linux kernel memory management, and more. None of these pieces are "difficult" in isolation, but once they combine into a complete picture, they fundamentally change how one reads code.

This series assumes readers have a basic understanding of computer architecture and operating systems. If not, reading CSAPP first would be advisable.

The Memory Wall

Since the 1980s, processor performance has grown at roughly 60% per year, while DRAM access latency has improved by only about 7% annually. By the late 1990s, this divergence was stark enough that Wulf and McKee coined the term "Memory Wall" in 1995 — the processor's computational capacity is constrained by the speed at which data reaches the registers from memory. Nearly three decades later, DRAM's absolute access latency still hovers at the 60–80 ns range; the core problem has not been eliminated by process advances.

Typical access latencies across the storage hierarchy:

Storage Level	Latency (Typical)	Zen 5 (~5 GHz)	Golden Cove (~5.5 GHz)
Register	≤1 cycle	—	—
TLB	≤1 cycle	—	—
L1 Cache	~4–5 cycles, ~1 ns	4 cycles	5 cycles
L2 Cache	~12–14 cycles, ~3 ns	14 cycles	13 cycles
L3 Cache	~40–50 cycles, ~10 ns	50 cycles	44 cycles
Main Memory	~150–250 cycles, ~60–80 ns	~200 cycles	~200 cycles
NVMe SSD	~15,000 ns	—	—
HDD	~5,000,000 ns	—	—

For context, a lightweight syscall costs tens to hundreds of nanoseconds — meaning a single main-memory access is already comparable to a system call. With KPTI (Kernel Page Table Isolation, the Meltdown mitigation) enabled, additional page-table switching and TLB-related overhead narrow the gap further. It is also worth noting that Apple's M3 series uses 16 KB pages, giving TLB coverage inherently superior to 4 KB pages, and its system call mechanism differs from x86, with syscall latencies in the 15–30 ns range — platform differences mean the "memory wall" does not feel the same everywhere.

The latency hierarchy stems from physical mechanisms. L1–L3 caches are built from SRAM, requiring six transistors per bit — fast but large in area and high in cost. DRAM needs only one transistor and one capacitor per bit, offering high density at low cost, but every access must go through a full timing sequence of row activate, column strobe, and precharge — these tens of nanoseconds of overhead are the hard floor of DRAM physics. For NVMe SSDs, the NAND media read itself contributes relatively little to latency; most comes from PCIe bus transfer and NVMe protocol stack processing. HDDs involve mechanical seek delays from the read head, belonging to an entirely different physical regime than electronic latencies.

These latencies are not always fully exposed on every memory access. Modern CPUs rely on two complementary mechanisms to hide them.

The first is out-of-order execution (OOO): when a load instruction waits for memory, the CPU picks later instructions from the reorder buffer (ROB) that do not depend on that load's result and continues executing, overlapping computation with memory access in time. The ROB typically holds hundreds of instructions — valuable buffering within a 200-cycle DRAM latency window, but nowhere near enough to fill the entire wait.

The second, equally important but often overlooked, is Memory-Level Parallelism (MLP): the CPU's memory subsystem allows multiple outstanding memory requests to be in flight on the bus simultaneously. Hardware MSHRs (Miss Status Holding Registers, Intel terminology) or LFBs (Line Fill Buffers) track each outstanding cache miss — each core typically has 10–12 such tracking slots. If a program has two independent load instructions — for example, traversing two unrelated linked lists — the CPU can issue both to the memory subsystem concurrently, with the second request not waiting for the first to complete. Two 200-cycle requests are overlapped, yielding an effective latency of roughly 200 cycles rather than 400. Modern server CPUs survive on the memory wall precisely through this MLP + OOO collaboration: OOO finds parallelizable memory operations in the instruction stream, and MLP enables them to actually execute in parallel on the bus.

Both defenses share the same blind spot: data dependency chains. In a = p->next; b = a->next;, the address of the second load depends on the result of the first — the second cannot be issued until the first returns. MLP drops to zero at this point, and all instructions in the ROB that indirectly depend on these addresses stall, gradually exhausting the OOO window. This is the fatal weakness of dependency-chain-intensive operations such as pointer chasing, hash table probing, and B-tree traversal: the program hits DRAM's 200-cycle hard wall while hundreds of execution units inside the CPU sit idle. The fundamental motivation for cache design lies exactly here — by keeping "faster, smaller copies" across multiple storage levels, each step in the dependency chain lands in L1/L2's single-digit cycle latency rather than DRAM.

These latency numbers can be observed directly on a target machine using perf stat or Intel MLC (Memory Latency Checker). By constructing a pointer-chasing benchmark with a singly linked list, using RDTSC for timestamping and LFENCE to eliminate instruction-reordering bias, one can precisely measure the access latency of each cache level. The methodology will be explained further when discussing random access patterns later in this series.

The Memory Hierarchy

The central idea of the memory hierarchy: each level (level k) serves as a smaller, faster cache for level k+1. Data is transferred between two levels in fixed-size units called blocks. For example, if level k has 4 blocks and level k+1 has 16 blocks, data moves back and forth between these layers. The block size between any pair of adjacent levels is fixed — between main memory and cache, the block corresponds to a cache line — but different level pairs may use different block sizes. In modern CPUs, the block size across L1, L2, L3 caches and main memory is almost uniformly 64 bytes.

Anchoring this in real CPUs, the cache parameters of major microarchitectures (production models as of 2024):

Microarchitecture	L1d (per core)	L2 (per core/cluster)	L3 / LLC (shared)
AMD Zen 5	48 KB / 12-way	1 MB / 16-way	32 MB / 16-way (per CCD)
Intel Golden Cove	48 KB / 12-way	2 MB / 16-way (P-core)	36 MB (i9-14900K)
Apple M3 P-core	128 KB / 16-way	32 MB (per P-cluster, shared)	48 MB (SLC)

That total cache capacity is roughly 1/1000 of main memory is no coincidence: SRAM is about 5–10× larger in area than DRAM and roughly 100× more expensive per bit. Equipping a processor with gigabytes of SRAM would push die area and power consumption far beyond manufacturing feasibility. This economic constraint fundamentally determines the capacity ratios across cache levels.

The Locality Principle

Caches work not because program memory access is uniform and random — quite the opposite: typical programs access memory in a highly non-uniform fashion. This non-uniformity manifests in two dimensions.

Temporal locality: an address, once accessed, is very likely to be accessed again in the near future. Typical sources include loop variables, frequently-called function stack frames, and short-lived counters or status flags.

Spatial locality: after accessing an address, nearby addresses are very likely to be accessed soon afterwards. Sources include sequential array traversal, contiguously-allocated struct fields, and the sequential execution of instruction streams.

These two forms of locality are the prerequisite for caches to function at all: if programs truly accessed memory purely at random, no cache hierarchy design could prevent hit rates from approaching zero. Data from standard benchmarks (e.g., SPEC CPU 2017) shows that well-written programs typically achieve L1 data cache hit rates above 90%, with L2 hit rates exceeding 70% on the subset that misses L1. This means the vast majority of instructions never face DRAM's full latency.

The quantitative tool that unifies temporal and spatial locality is reuse distance (also called LRU stack distance): between two consecutive accesses to a given address, how many other distinct addresses does the program access? If the reuse distance is smaller than the cache capacity — more precisely, smaller than the number of sets in the cache — the access is a hit; otherwise, a miss. Analyzing a program's reuse distance distribution allows one to predict cache behavior without running benchmarks. This concept is the core working model of cache simulators such as Valgrind's Cachegrind. In real hardware, set-associative structures also introduce conflict misses, so this relationship holds only as an approximate analytical tool.

Sequential access and random access form two extremes within this framework. Sequential access exhibits short reuse distances and strong spatial locality, allowing prefetchers to stay ahead; random access often has reuse distances exceeding the effective range of any cache level and prefetcher. Detailed analysis of these two access patterns appears in Parts III and IV of this series.

Virtual Addresses and Cache Addressing

A cache line identifies its corresponding main-memory block via an address tag. This address can be virtual or physical. The choice between the two involves fundamental design trade-offs.

VIVT (Virtually Indexed, Virtually Tagged): Cache indexing and tagging both use virtual addresses. The advantage is that the cache lookup can complete without waiting for TLB translation, minimizing latency. However, the synonym problem arises: when the same physical page is mapped to multiple virtual addresses (common in shared-memory scenarios), multiple copies of the same data may simultaneously exist in the cache, with a modification to one semantically invalidating the others. Additionally, different physical addresses with identical virtual addresses (the same VA in different processes) pollute each other, forcing a cache flush on every context switch. As a result, VIVT is almost never used in modern general-purpose processors, appearing only in a few special-purpose tiny caches (such as the internal structures of some TLB implementations).

PIPT (Physically Indexed, Physically Tagged): Both indexing and tagging use physical addresses. Data correctness is guaranteed, but every cache access must first translate the virtual address to a physical address via the TLB — effectively adding TLB latency on top of cache latency. This is unacceptable for L1d, which has a total latency target of only 4–5 cycles and cannot afford an additional TLB translation stage. PIPT is therefore mainly used for L2 and lower-level caches, where latency budgets are more generous.

VIPT (Virtually Indexed, Physically Tagged): The compromise for L1d. The cache uses the low-order bits of the virtual address as the set index while simultaneously sending the virtual address to the TLB for translation; after locating the target set, the physical address from the TLB is compared against the physical tags of each way in the set. The key insight: the low-order bits of the virtual address (the page offset) are identical to the low-order bits of the physical address — address translation only modifies the high-order bits. Therefore, as long as all the cache index bits fall within the page-offset range (i.e., bits that do not change during translation), cache indexing can proceed in parallel with TLB translation. Address bits beyond the page offset would be ambiguous until TLB translation completes, causing the aliasing problem.

This constraint directly limits the maximum capacity of a VIPT L1d cache: with 4 KB pages, the page offset is 12 bits (bits 0–11). If the cache's set index bits fall within bits 0–11, the total cache capacity must not exceed associativity × page size. For an 8-way set-associative cache, maximum capacity = 8 × 4 KB = 32 KB. This explains why x86 processors long had L1d caches of 32 KB 8-way — not a coincidence, but a VIPT addressing constraint. Starting with Zen 4, AMD enlarged L1d to 48 KB 12-way (12 × 4 KB = 48 KB), while Apple's M3 L1d reaches 128 KB 16-way — the latter, aided by 16 KB pages, supports a VIPT ceiling of 16 × 16 KB = 256 KB, far exceeding the constraint under 4 KB pages.

Virtually all modern high-performance general-purpose processors use VIPT for L1d and L1i, with PIPT from L2 downwards. The fundamental reason for choosing VIPT is performance: it allows cache indexing and TLB translation to proceed in parallel within the same pipeline stage, saving a pipeline cycle that is decisive for the L1 critical latency path.

The aliasing problem introduced by VIPT (when different virtual addresses map to the same physical address, identical physical tags but different virtual indices cause the same physical line to appear in multiple sets) must be handled by the OS during page allocation. Traditional Unix used page coloring to ensure that the low-order bits of virtual addresses for shared pages match, thereby avoiding aliasing. Modern operating systems increasingly rely on hardware cache design and page-mapping constraints to avoid aliasing-related correctness issues.

Measurement Primer

The following two commands form the measurement foundation for all subsequent performance discussions:

perf stat -e L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./program

This reports the miss counts and total accesses for the L1 data cache and the Last Level Cache (LLC, typically L3). High L1 miss rates usually point to data layout problems (see Part III); high LLC miss rates usually point to working sets exceeding cache capacity (see Part IV).

perf stat -e cycles,instructions,cache-misses,cache-references ./program
IPC=$(instructions / cycles)

The ratio of instructions to cycles is IPC (Instructions Per Cycle). When IPC is significantly below the microarchitecture's theoretical peak (e.g., Zen 5's 8-wide issue width corresponds to a typical value of about 4–6), and cache-misses is simultaneously high, the problem usually points to cache efficiency.

To precisely measure the latency of each cache level, the core method is to construct a pointer-chasing benchmark: allocate an array internally linked as a singly linked list in random order, traverse the list, and measure the average time per step. By controlling the list length to be less than L1d capacity (measuring L1 latency), greater than L1d but less than L2 (measuring L2 latency), greater than L2 but less than L3 (measuring L3 latency), and greater than LLC (measuring main memory latency), one can precisely fit the access latency for each level.

This part begins at the top of the picture: why the memory wall exists, how the memory hierarchy is designed, why locality enables caching to work, and how virtual and physical addresses interact during cache lookup. The next part dives into the cache internals: the hardware trade-offs of the three organization schemes, why cache lines are 64 bytes, the cache topology and inclusion policies of modern CPUs, and the specific mechanism by which an address is partitioned into tag, set index, and block offset.