Tyler Tan

Posted on Jun 8

Cache Deep Dive II — Cache Organization and CPU Topology

#architecture #computerscience #performance #systems

Part I discussed the physical roots of the memory wall, the design principles of the memory hierarchy, and the interaction between virtual and physical addresses during cache lookup. This part delves into cache internals: how addresses are partitioned into tag, set index, and block offset; the hardware trade-offs among the three organization schemes; the rationale behind the 64-byte cache line; the actual cache topologies of modern CPUs; and the inclusion policies between cache levels.

Address Partitioning

When a 64-bit address is sent to the L1 data cache, the hardware partitions it into three fields:

|←──── tag ─────→|←── set index ──→|← block offset →|
      T bits            S bits          O bits

With a cache line size of 2^O bytes, the low O bits are the block offset, ensuring every byte within the line can be addressed. For a 64-byte cache line, O = 6. The middle S bits form the set index — the cache has 2^S sets, and the address maps to exactly one of them. The remaining high T bits constitute the tag, stored alongside the data in the cache line's metadata and used during set-internal comparison to confirm a match. The set index itself is not stored — all lines in the same set share the same index bits.

Taking AMD Zen 5's L1d as an example for bit-width calculation: 48 KB / 12-way / 64 B = 64 sets, so S = 6 (2^6 = 64). O = 6 (2^6 = 64 B). Logically, T = 64 − S − O = 52 bits. However, the actual storage width of the tag is determined by the number of effective physical address bits — x86-64 physical address width currently ranges from 48 to 52 bits (depending on LAM and 5-level paging support). Subtracting the 12 bits for S + O, the actual tag width participating in comparison is approximately 36–40 bits. This calculation is also influenced by VIPT design: as described in Part I, L1d is VIPT, so the low 12 bits of the virtual address (page offset) directly serve as the set index and block offset, while tag comparison uses the high-order physical address bits output by the TLB. Under VIPT, S + O ≤ 12 (the page offset bit width), a constraint that ensures the set index and block offset are identical between virtual and physical addresses.

For programmers, the most important corollary of address partitioning is this: if the access stride happens to be an integer multiple of 2^S × cache line size, every request maps to the same set. For example, with a 64-set, 64-byte-line cache (S = 6, O = 6), a stride of 64 × 64 = 4096 bytes — exactly 4 KB, one page — forces all requests into the same set. With 8-way associativity, the first 8 accesses fill the set, and from the 9th onward, each access triggers an eviction. This is the hardware root of conflict misses — a programmer may "see" high miss rates even knowing the cache has plenty of empty capacity elsewhere, purely due to addressing rules. Notably, a stride of 4096 not only triggers cache set conflicts but also causes every access to cross a page boundary, a topic that will resurface in the discussion of TLB.

Three Cache Organization Schemes

The most intuitive way to implement a cache is to allow every cache line to store any block from main memory. This is called a fully associative cache. Its advantage is maximum cache utilization — new data can be placed in any empty slot. However, its lookup cost is prohibitive. For a 4 MB L2 cache with 64-byte lines, there are 65,536 lines. On every memory access, the processor must compare the target address's tag against the tags of every single line — 65,536 comparisons per cycle — which is infeasible in power and timing. Fully associative designs are only viable for extremely small caches, such as the TLBs in some Intel CPUs. For L1i, L1d, and larger caches, other approaches are required.

The other extreme is to map each main-memory address to a unique, fixed location in the cache — a direct-mapped cache. On access, the processor extracts several bits from the address to compute the target slot and compares against only that one slot's tag. A single comparator and multiplexer suffice, making it extremely fast. But the drawback is obvious: if a program repeatedly accesses multiple addresses that map to the same slot, conflict misses occur — multiple addresses fight for one slot while others sit idle. Real programs rarely exhibit uniform access patterns, causing direct-mapped cache utilization to drop sharply. A classic degenerate scenario: a program alternates between two addresses spaced exactly one cache capacity apart — each access evicts the other, yielding a zero hit rate.

Set-associative caches combine the strengths of both. The cache is partitioned into sets, each containing a fixed number of cache lines (the associativity, or number of "ways"). On access, the address first identifies the set, then all tags within that set are compared in parallel. Within a set, the behavior is fully associative; across sets, it is direct-mapped. This design mitigates conflict misses while preserving lookup speed. Virtually all contemporary CPU caches use set-associative designs.

The fundamental formula:

Cache capacity = number of sets × associativity (ways) × cache line size

Cache associativities are not arbitrary — they represent trade-offs under physical design constraints for target workloads:

Cache Level	Zen 5	Golden Cove (P-core)	Apple M3 (P-core)
L1d	48 KB / 12-way	48 KB / 12-way	128 KB / 16-way
L1i	32 KB / 8-way	32 KB / 8-way	192 KB / 16-way
L2	1 MB / 16-way	2 MB / 16-way	32 MB / 20-way (per cluster)
L3	32 MB / 16-way (per CCD)	36 MB / 12-way	48 MB SLC

L1d and L1i associativities and capacities are constrained by VIPT (see Part I): under 4 KB pages, 12-way 48 KB or 8-way 32 KB are natural choices within that constraint. For L2, around 16 ways has become the common balance point among access latency, power, and conflict rate in current high-performance processors — too few ways raise conflict miss rates, while too many ways lengthen the tag-comparison timing path, requiring either frequency reduction or additional pipeline stages to accommodate the comparison logic. Apple's M3 achieves 20-way in some P-cluster L2s, partly enabled by its lower clock frequency target (~4 GHz vs. x86's ~5.5 GHz), offering more physical timing margin per cycle for parallel tag comparison.

Why 64-Byte Cache Lines

The cache line is not only the fundamental unit of data transfer, but also the minimum granularity at which cache coherence protocols maintain ownership — MESI and similar protocols track state, broadcast invalidations, and transfer ownership at the cache-line level. The false sharing problem discussed later is, at its core, multiple cores contending for ownership of the same cache line while operating on logically unrelated variables. Recognizing that "64 bytes is the common granularity for both data movement and ownership tracking" is prerequisite to understanding many multi-core performance problems.

Cache line size is determined by three factors.

First, tag overhead: every line must store a tag and status bits (valid, dirty, MESI state). Smaller lines mean higher tag overhead. For a 4 MB cache: with 32-byte lines (~40-bit tag, ~5-bit status), there are 131,072 lines, tag overhead ≈ 750 KB (≈ 18%). With 64-byte lines, 65,536 lines, tag overhead ≈ 370 KB (≈ 9%).

Second, spatial locality: larger cache lines pull in more nearby data on a single miss, indirectly improving hit rates.

Third, DRAM physical transfer characteristics: DDR SDRAM transfers data in bursts on consecutive clock edges. Once a row is activated, multiple columns can be read from it sequentially without additional activate overhead. 64 bytes corresponds exactly to the most common DDR4/DDR5 burst length = 8 × 64-bit data bus width = 8 × 8 B = 64 B.

Under these three constraints, 64 bytes became the industry standard. Historically, the Intel Pentium (1993) used 32-byte cache lines; the Pentium 4 (2000) mixed 64-byte and 128-byte lines in some caches; from Core 2 (2006) onward, all caches unified at 64 bytes. Note that 64 bytes refers only to the data portion. Counting tag, valid bit, dirty bit, and MESI state bits, each cache line actually occupies about 72 bytes — roughly 12% metadata overhead. A manufacturer-labeled 32 MB L3 cache actually requires about 36 MB of SRAM transistors etched on the silicon.

C++17 provides std::hardware_destructive_interference_size and std::hardware_constructive_interference_size, exposing the 64-byte alignment constant. alignas(std::hardware_destructive_interference_size) forces two variables that may be concurrently written by different cores onto separate cache lines, avoiding false sharing (detailed in Part VI).

Modern CPU Cache Topology

CPU cores are not directly connected to main memory; all reads and writes must pass through the cache hierarchy. Caches are first divided into data caches and instruction caches — Intel adopted this split design starting with the Pentium in 1993 and has maintained it ever since. The L1 cache is divided into L1i and L1d, implementing a Harvard architecture: instruction fetch and data read can proceed in parallel, avoiding bandwidth contention on a single interface. L2 and L3 caches are generally unified — instructions and data share the same storage, achieving higher space utilization: when the workload is instruction-heavy, more space goes to instructions; when data-heavy, more goes to data.

The above is a general description. Specific topologies differ significantly across vendors, with direct performance implications.

AMD: CCDs and Chiplet

Since Zen 2, AMD has employed a chiplet architecture, dividing a single physical package into one I/O Die (IOD) and multiple Core Complex Dies (CCDs). Each CCD contains 8 cores sharing one L3 cache (32 MB for both Zen 4 and Zen 5). Each core has private L1i (32 KB) and L1d (48 KB), plus private L2 (1 MB). When a core accesses an address residing in its local CCD's L3, latency is roughly 50 cycles; if the address resides in a different CCD's L3, the request must be routed through the IOD's Infinity Fabric to the target CCD, raising latency to approximately 100 cycles or more.

The direct implication for programmers: on dual-CCD consumer processors (e.g., Ryzen 9 7950X, two CCDs with 16 cores total), if a thread frequently migrates between CCDs, its hot cache lines in private L1/L2 must be transferred via the coherence protocol across the IOD, with each migration incurring the cost of inter-core RFO handshakes plus the physical trace delay across CCDs. On EPYC server platforms, a single package may contain up to 12 or 16 CCDs, making cross-CCD latency non-uniformity even more pronounced — this is an on-die Non-Uniform Cache Access effect, distinct from traditional NUMA defined by memory controller distance, but with similar performance impact.

Intel: Ring and Mesh

Intel client processors (e.g., Core i9-14900K) use a ring bus connecting all cores, L3 slices, GPU, and memory controller. L3 is evenly divided into slices, with each core accessing any slice via the ring. Each ring hop takes about 4–5 cycles, giving a worst-case latency of roughly 20–30 cycles on an 8–12 node ring. Since all nodes on the ring are equidistant in terms of access, the ring bus provides approximately uniform latency — in contrast to AMD's CCD architecture.

Server-class Xeon Scalable processors (e.g., Sapphire Rapids) employ a 2D mesh interconnect, with latency growing linearly with the number of mesh hops. CHAs (Caching & Home Agents) are distributed across mesh nodes, each responsible for directory tracking of a portion of the address space. A core accessing memory managed by its local CHA experiences lower latency; accessing a region managed by a remote CHA requires multiple mesh hops, with latency reaching 2–3× that of local access.

Apple: P-Clusters and SLC

Apple's M series adopts a cache hierarchy distinct from x86. Taking M3 as an example: P-cores have 128 KB L1d and 192 KB L1i (both 16-way). L2 configurations across different M-series SKUs vary significantly, typically with clusters of P-cores sharing large L2 caches (Apple has not published precise official specifications; publicly available data largely comes from reverse-engineering analysis). E-cores have smaller caches but still substantial associativity (128 KB L1d / 96 KB L1i). All CPU clusters and GPU share a System Level Cache (SLC) — 8 MB on the base M3, up to 48 MB on Pro/Max variants. The SLC is part of the unified memory architecture: DRAM (LPDDR5) is packaged alongside the chip, and CPU and GPU access the same physical memory pool through the SLC, eliminating the need for dedicated video memory.

Apple's L1i and L1d capacities far exceed contemporary x86 — 128 KB L1d / 192 KB L1i vs. x86's 48 KB / 32 KB — enabled by the 16 KB default page size, which lifts the VIPT capacity constraint (16 KB × 16-way = 256 KB ceiling), and explains why Apple can invest far more SRAM budget at the L1 level than x86. Additionally, Apple's ultra-wide decode design (M3 is 8-wide issue) demands extremely high instruction supply bandwidth — L1i output bandwidth must be sufficient to keep the decoders fed — and the combination of large L1i and a micro-op cache (estimated at roughly 4K–6K uops from M1 reverse engineering) collectively sustains the frontend.

Uncachable Regions

Certain memory regions are not cached, such as MMIO (Memory-Mapped I/O). The OS marks these physical pages as UC (Uncacheable) via page table attributes and hardware mechanisms such as PAT (Page Attribute Table) / MTRR (Memory Type Range Register). Reads and writes to such addresses fully bypass L1–L3 caches and go directly onto the bus to the device. Meanwhile, the ISA provides instructions that allow programmers to bypass the cache — for large volumes of "write-once, discard" data (such as streaming writes to a GPU framebuffer), non-temporal stores (x86 MOVNTI, or the corresponding compiler intrinsic _mm_stream_si128) write directly to memory, avoiding cache pollution. These instructions direct data into write-combining buffers (WC buffers), which batch 64 bytes before issuing a single burst onto the bus, rather than sending one transaction per byte. Detailed discussion of WC and UC mechanisms appears in Part V.

Inclusive, Exclusive, and Non-Inclusive

The inclusion relationship between cache levels is an important microarchitectural choice that directly determines effective cache capacity and coherence protocol overhead.

Inclusive: every line in L1 must also exist in L2; likewise for L2–L3. Writebacks are faster when reads dominate, but capacity waste is significant.
Exclusive: a line in L1 does not exist in L2 or L3. A line of data exists in exactly one cache level. Writebacks evict from level to level, wasting no capacity but requiring a longer eviction path.
Non-inclusive: the inclusion relationship is neither guaranteed nor denied. A lower level may or may not have the line.

Modern processors are universally non-inclusive between L1 and L2 — L1 and L2 store data independently without mandatory duplication. Between L2 and L3, there are two camps.

Intel Core client processors have seen significant changes in L2–L3 inclusion across microarchitecture generations. Early Nehalem through Broadwell (2008–2015) used strict inclusive LLC, motivated not by capacity management but by the snoop filter: when Core A needs to know whether a cache line at a given address is held by other cores, full-die broadcast (querying every core individually) would cause interconnect traffic to grow linearly with core count. An inclusive L3 provides a shortcut — since every line in L2 must have a copy in L3, simply checking L3's tag array answers "which core holds this address." L3's tag array doubles as a snoop filter, suppressing coherence query broadcast traffic within L3. The cost is capacity loss: L3 effective capacity = nominal capacity − Σ(all core L2 capacities).

Starting with Skylake (2015), Intel gradually transitioned to non-inclusive or weakly-inclusive LLC. Contemporary Golden Cove and Raptor Cove no longer require L2 lines to keep copies in L3, instead relying on distributed directory information and LLC metadata to independently track the ownership of each cache line. This shift eliminates the duplicate storage overhead of L2 data in L3, making L3's nominal capacity its effective capacity, but introduces the SRAM overhead of the directory itself and additional lookup latency.

AMD's Zen architecture is non-inclusive between L2 and L3. There is no requirement that "L2 contents must be backed in L3"; the full 32 MB of L3 is used for independent data. Snooping functionality is achieved through independent probe filters or directory tracking, without relying on inclusion. This choice gives AMD higher effective utilization of the labeled L3 capacity — for memory-intensive workloads with large working sets and low data reuse, non-inclusive is superior.

Apple M-series SLC is a variant with inclusion-like properties (forward-compatible in certain versions with subsets of L2), but Apple has not disclosed the exact inclusion semantics between SLC and L2.

Subject to correct memory model enforcement, the CPU enjoys considerable freedom in cache management. Take x86 TSO (Total Store Order): as long as Core 0's sequence of writes to A then B is observed by all other cores as A changing before B, any optimization is permitted above that TSO baseline — for instance, opportunistically writing back dirty cache lines to main memory during idle bus cycles and clearing their dirty bits. Such operations are fully transparent to the programmer as long as the memory model is not violated.

This part analyzed the internal organization of caches. The next part moves into dynamic behavior: the hardware implementation of cache replacement policies, the classification and behavior of hardware prefetchers, and the performance characteristics of sequential versus random access under single-thread conditions, as shaped by prefetchers and TLBs.

DEV Community