DEV Community: Tyler Tan

From 178ms to 1ms: When Store-to-Load Forwarding Stalls Your For Loop

Tyler Tan — Thu, 11 Jun 2026 17:18:02 +0000

Take a look at this C++ snippet. Spend half a minute reading it, then take a guess — how long do you think it takes to run?

constexpr int N  = 1'000'000;
constexpr int NG = 262144;

auto* buf = new uint64_t[NG];
std::memset(buf, 0xfe, NG * 8);          // fill with 0xfe, meaning "empty slot"

for (int i = 0; i < N; ++i) {
  int g       = (i >> 7) & (NG - 1);     // first 128 iterations g=0, next 128 g=1, and so on
  uint64_t w  = buf[g];                  // read 64-bit word

  uint64_t fb = w & 0x8080808080808080ULL;
  uint64_t e  = ( (w ^ 0xfefefefefefefefeULL) - 0x0101010101010101ULL )
              & ~(w ^ 0xfefefefefefefefeULL) & 0x8080808080808080ULL;
  int pos = fb ? (std::countr_zero(fb) / 8) : 0;

  ((uint8_t*)buf)[g * 8 + pos] = (i & 0x7f);  // write 1 byte
}

The program looks a bit odd, but it has a real-world use case (revealed at the end). What it does is actually straightforward:

There's a uint64_t array (262144 elements), all initialized to 0xfe
The loop runs 1 million times. The traversal pattern is simple: the first 128 iterations read/write buf[0], the next 128 read/write buf[1], then 128 on buf[2]... advancing every 128 iterations. From a cache perspective, this is very friendly — repeatedly hitting the same cache line
Each iteration: read a 64-bit value, do some bitwise operations, then modify one byte and write it back

The loop itself advances i from 0 to N-1 sequentially, and g increments accordingly — one group at a time before switching to the next. The prefetcher has an easy time predicting this. From temporal locality, spatial locality, to hardware prefetching, everything looks almost flawless.

With that in mind, let's estimate: each iteration does one mov load, a few ALU bitwise ops, one mov store. The bitwise ops barely consume extra cycles under superscalar scheduling. Assuming all stores hit L1 cache, each iteration should take about 5–8 cycles. 1 million iterations × 6 cycles ÷ 2.5 GHz ≈ 2.4 ms.

Let's actually run it:

Time: 178 ms

178 milliseconds. That's an average of ~178 ns per iteration. On the M5 chip I used (running at 2.5 GHz), that works out to about 445 clock cycles per iteration.

Thinking...

If each iteration only does one load, a few bitwise ops, and one store, why on earth would it cost 445 cycles?

False sharing? No — this is a single-threaded program; the cache coherence protocol isn't involved at all. False sharing doesn't apply here.

Cache misses? No — 128 consecutive iterations all operate on the same slot, repeatedly hitting L1d. The cache hit rate couldn't be higher.

Branch prediction? No — the loop body has no unpredictable if statements. The iteration count is known at compile time. The branch predictor is under zero pressure.

TLB miss then? No — the entire array is only 2 MB, easily fitting in L2, with TLB coverage to spare.

Keep thinking...

After ruling out each candidate one by one, an inconspicuous detail suddenly stands out — every iteration reads a full uint64_t, but modifies only one byte, then writes that changed byte back.

Read → modify 1 byte → write back → next iteration reads the same 64-bit word.

Wait.

When the next iteration executes w = buf[g], where is the byte we just wrote? It's still in the Store Buffer.

Why is there a Store Buffer? Because store instructions are too slow. A store needs to write data into L1d, but the target cache line might not even be in L1d — it could be held by another core, requiring an RFO (Request For Ownership) handshake, costing hundreds of cycles round-trip. Without a Store Buffer, the pipeline would stall — the store can't complete, and everything behind it grinds to a halt. The Store Buffer exists precisely to decouple the pipeline from the slow process of writing back to L1d: the store drops the data in and calls it done, the pipeline marches on, and the buffer drains asynchronously in the background.

Think of the Store Buffer as a "staging area" for L1d — data written into it sits there temporarily, not yet committed. Meanwhile, it has a bonus ability called Store-to-Load Forwarding: if a subsequent load happens to hit a store that hasn't drained from the buffer yet, the CPU forwards the latest value directly to the load, bypassing L1d entirely.

Sounds perfect, right? We just wrote a byte, and the next read gets it forwarded directly — no waiting at all.

But the problem is exactly in that "forwarding" step.

Our store wrote only 1 byte. The next load wants to read 8 bytes (the full 64-bit word). The store buffer only has the new value for that one modified byte — the other 7 bytes are still in L1d. The CPU can't simply forward 1 byte from the store buffer to satisfy an 8-byte load — "your data is incomplete; where do I get the missing 7 bytes?"

What the CPU does here is called Partial Store-to-Load Forwarding. Instead of a simple forward, it:

Stalls, waiting for that byte store to drain into L1d
Reads back the full 64-bit value from L1d (1 byte newly written, 7 bytes original)
Hands the read-back value to the load instruction

On Intel/AMD, this path costs 5–8 extra cycles. On Apple Silicon, although the store buffer is deeper, partial forwarding is just as slow. Every iteration waits for this "1 byte write → 8 byte read" merge to complete — and over 1 million iterations, that adds up to 178ms.

Attempt 1: Full-Word Write

Since byte→word forwarding is slow, let's try a different approach — don't write 1 byte; instead, construct the full 64-bit value and write it back in one shot (full-word RMW):

// Before (byte store):
((uint8_t*)buf)[g * 8 + pos] = (i & 0x7f);

// After (full-word store):
uint64_t new_w = (w & ~(0xFFULL << (pos * 8)))
               | ((uint64_t)(i & 0x7f) << (pos * 8));
buf[g] = new_w;

Result:

Slow (byte store) : 178 ms
Full-word RMW     :  75 ms    ← 2.4× faster

Some improvement, but still not fast enough — because the next iteration still reads and writes the same array element. Even though word→word forwarding is faster than byte→word, the CPU is still doing read-modify-write on the same address over and over:

store → store buffer → load hits store buffer (forwarding) → modify the read value → another store enters the store buffer

This dependency chain serializes the iterations, rendering out-of-order execution almost useless. Under data dependency, the CPU can only wait.

The Real Fix: Breaking the Dependency Chain

The core problem is simple: each iteration reads what the previous iteration just wrote. If we make consecutive iterations access different array elements, the dependency chain disappears.

The fix — when computing the index g, don't preserve low-bit sequentiality after hashing. Add a bit mixer to the hash value, scattering consecutive inputs across the full space:

// Before (sequential index):
int g = (i >> 7) & (NG - 1);

// After (hash mixer scatters input):
auto mix = [](uint64_t h) {
    h ^= h >> 33; h *= 0xff51afd7ed558ccdULL;
    h ^= h >> 33; h *= 0xc4ceb9fe1a85ec53ULL;
    h ^= h >> 33; return h;
};
int g = mix(i) & (NG - 1);

The mixer is MurmurHash3's finalizer — three XORs, two multiplications, three shifts, all done within 5 ns. But because mix(i) and mix(i+1) produce completely uncorrelated values, consecutive iterations almost certainly access different array elements.

Now the result:

Slow (sequential) : 178 ms
Fast (hashed)     :   1 ms   ← 178× faster

1ms. The bottleneck drops from 445 cycles/iteration to about 3 cycles.

The explanation is simple — once the mixer scatters the writes:

Iteration i writes to buf[A], data enters the store buffer
Iteration i+1 reads buf[B] (A ≠ B), hitting L1d directly — because buf[B] is not on the store buffer's forwarding path, and the last time it was modified was many iterations ago; the store buffer has long since drained it into L1d
An L1d hit costs only 3–4 cycles, an order of magnitude cheaper than the dozen-plus cycles of byte→word forwarding

Note: the mixer is a bijection — it won't map two different keys to the same hash. It merely permutes the input space without introducing any collisions. Zero impact on correctness, all benefit for performance.

Real-World Application: Swiss Table Performance Optimization

This is not a contrived toy example. This benchmark captures a real bottleneck I encountered while implementing Swiss Table (the underlying algorithm of Go's built-in map).

The test scenario: sequentially inserting 1 million integer key-value pairs.

The initial version used std::hash<int> (which on aarch64 is equivalent to identity — i.e., hash(x) = x). Since consecutive keys produce consecutive hashes, the low 7 bits (h₂) cycle while the high bits (h₁) stay unchanged. This causes every 128 consecutive keys to land in the same "control word group" — exactly the situation in the demo where g stays fixed.

Result:

Inserting 1 million entries: 438 ms
std::unordered_map doing the same: 12 ms

After adding a mixer:

Inserting 1 million entries: 10 ms (beats std)
All benchmarks improved significantly

Scenario	Before	After	std baseline
Sequential insert	438ms	10ms	12ms
Random insert	226ms	9ms	35ms
Mixed operations	834ms	53ms	70ms
Dense iteration	5ms	19ms	12ms

Not a better algorithm. Not a redesigned storage layout. Just one thing — a 5-line bit mixer tacked onto the hash, completely eliminating the RAW store-to-load forwarding bottleneck triggered by consecutive keys.

Conclusion

Decades ago, people said "optimization is just finding a faster algorithm." But on modern superscalar, deeply pipelined CPUs with store buffers and out-of-order execution, what the hardware actually goes through when executing your program often matters more than big-O complexity.

What does a byte store look like under the hood? Does a load go through the store buffer or straight to L1d? Will consecutive iterations hitting the same address stall the out-of-order window? If you can answer these questions, a few lines of code can unlock hundredfold performance gains.

And that's exactly what the "Foundations of Low-Level Systems" series aims to achieve.

For a deeper dive into how the Store Buffer works, see the companion article Cache Series V — Write Policies, Store Buffer, and Memory Barriers, which explains from the hardware level how store buffers decouple from the pipeline, the forwarding paths of store-to-load forwarding, and the differences in memory ordering models across ISAs.

Cache Deep Dive V — Write Policies, Store Buffer, and Memory Barriers

Tyler Tan — Thu, 11 Jun 2026 05:08:35 +0000

The previous installments focused on reads — how address translation affects memory access latency. This part turns to writes. A store instruction may seem simple, but its path from the pipeline's retire stage to the final write into an L1d cache line passes through a structure critical to multicore program correctness: the Store Buffer. Understanding how the Store Buffer works is the hardware prerequisite for understanding memory fences (barriers) and memory ordering models.

Write Policies

Caches are meant to be transparent to the programmer — modifying a cache line should be equivalent to modifying the corresponding data in main memory. There are two basic strategies for implementing this transparency: write-through and write-back.

Write-through: A cache line is immediately written to main memory after being modified. Simple to implement — cache and main memory are always consistent. But the performance penalty is severe: if a program repeatedly modifies the same cache line, every single modification must travel across the bus to main memory. Today, write-through is only used in a handful of hardware edge cases (e.g., memory-mapped registers in certain embedded systems).

Write-back: Roughly 99% of modern processors use this strategy. A write only modifies the cache; the cache line is marked dirty and the operation is complete. Dirty data is written back to main memory only when the line is evicted. Additionally, the CPU can proactively write back dirty lines and clear the dirty flag during bus-idle periods, without waiting for eviction.

Write-back reduces write bandwidth pressure, but it introduces a critical problem: if the target cache line for a store instruction is not currently in L1d — for instance, the line is held in Modified state by core B and requires an RFO to complete — the CPU will be stalled for hundreds of cycles waiting for the cache line to arrive. Write-back alone does not solve this problem; the Store Buffer does.

Store Buffer

Suppose a CPU had to wait for the target cache line to be present in L1d and for the write to complete on every store instruction — the pipeline would be severely blocked. If the target cache line is not in L1d (requiring a fetch from L2, L3, or even main memory, or waiting for a MESI RFO broadcast), the intervening hundreds of cycles of latency would leave the pipeline entirely idle. To solve this problem, modern CPUs insert a Store Buffer between the pipeline execution units and L1d.

When a store instruction retires from the pipeline, the data is not written directly into L1d. Instead, it first enters the Store Buffer queue. From the pipeline's perspective, the store instruction is considered complete — it can retire, and subsequent instructions can continue to issue. The data in the Store Buffer is written back to L1d asynchronously by the hardware, whenever the target cache line achieves the appropriate MESI state in L1d (E or M) and the bus is not busy.

The Store Buffer also provides a Store-to-Load Forwarding mechanism: if a load instruction's target address matches an outstanding store in the Store Buffer that has not yet been written back to L1d, the hardware forwards the latest value directly from the Store Buffer to the load instruction, without waiting for the L1d write to complete. Semantically, this value is equivalent to "having already been written to the cache."

Store Buffer capacity is finite. In modern high-performance processors, the store buffer typically ranges from several dozen to over a hundred entries; exact numbers vary by microarchitecture generation (Intel client cores in recent years are mostly in the 56–64 entry range, AMD Zen series roughly 44–48 entries, and Apple M series, estimated through reverse engineering, above 80 entries). This means that if a large number of store instructions pile up in the pipeline in a short period, the Store Buffer will become full. Once full, subsequent store instructions cannot retire — the pipeline halts, out-of-order execution capability collapses, and the CPU can only wait for the Store Buffer to free up. This is the internal causal chain behind L1d "seemingly disappearing" under heavy multi-threaded writes and saturated bus bandwidth: the downstream DDR bus is congested → L1d cannot commit writes → the Store Buffer cannot drain → pipeline stalls. The bottleneck is not the speed of the cache itself, but the overall throughput of the store path.

Differences in how platforms tolerate the same bursty write traffic partly explain why the same production code can feel different in "writability" between x86 servers and Apple Silicon: a larger Store Buffer can absorb more burst writes during traffic spikes, deferring the critical threshold at which the pipeline stalls.

Memory Ordering and Barriers

The Store Buffer introduces a fundamental side effect: a store instruction becomes visible to external cores (other CPU cores) later than it becomes visible to the local core (the core that executed the store). The reason is that store-to-load forwarding allows the local core to "see" data before it has actually been written to L1d, whereas other cores cannot observe it until the data actually enters L1d and is propagated via the cache coherence protocol.

Consider the following scenario: Core 0 executes

x = 1;
y = 2;

Core 1 simultaneously executes

if (y == 2) {
    assert(x == 1);  // Will this assertion succeed?
}

Under x86's TSO (Total Store Order) model, the assertion will always succeed — because TSO guarantees that stores from the same core appear in program order to other cores. Under weakly-ordered models such as ARM, however, Core 1 may observe y = 2 first and x = 1 only later. The reason is not that the Store Buffer necessarily reorders outgoing writes, but that ARM does not require stores to different addresses to become visible to other cores in program order — even though Core 0's pipeline retired the two stores sequentially, their visibility propagation across the cache coherence network does not follow a unified global order. Unless a barrier instruction is inserted between the two stores.

What eliminates this kind of uncertainty is the memory barrier (or fence). x86 provides three main barrier instructions:

SFENCE (Store Fence): Guarantees that all store instructions preceding the SFENCE have left their program-order pending state — subsequent stores cannot bypass it and execute early. SFENCE does not require that data has been written to L1d or become visible to other cores, nor does it constrain the behavior of load instructions. Under the WC memory type, SFENCE also forces a flush of the write-combining buffers, making previously written data observable to devices (detailed below).
MFENCE (Memory Fence): The strongest barrier. Ensures that all load and store instructions preceding the MFENCE have become globally visible, and that all subsequent load and store instructions only begin execution after the MFENCE. x86's TSO guarantees LoadLoad, LoadStore, and StoreStore ordering by default, but allows StoreLoad reordering — that is, a store can linger in the Store Buffer while a subsequent unrelated load executes first. MFENCE is the sole instruction for restoring StoreLoad ordering.
LFENCE (Load Fence): Serializes the instruction stream — all instructions before LFENCE must complete local execution (retire), and all instructions after it only begin issue afterward. LFENCE does not drain the Store Buffer, does not control cache writeback, and is used solely to prevent speculative execution of instructions from crossing the boundary (e.g., the serialization requirement around RDTSC). In Spectre mitigations, LFENCE is used to prevent speculative execution pollution from conditional branches.

x86's TSO model provides programs with a relatively strong default guarantee: stores from the same core appear in program order to other cores (StoreStore). However, TSO allows a subsequent load on the same core to execute before prior un-drained stores — this is StoreLoad reordering, the sole gap that separates TSO from sequential consistency. Because while a store sits in the Store Buffer queue, a completely unrelated subsequent load can issue first, reading the "old" value from L1d or even other caches. MFENCE is specifically designed to fill this gap.

ARM's weak ordering model is different: on ARM, ordering among stores, among loads, and even between loads and stores is not guaranteed by default. ARMv8 introduced dedicated acquire/release instructions (LDAR — Load-Acquire, STLR — Store-Release) to provide one-way barriers: LDAR ensures that all subsequent loads and stores are not reordered before it; STLR ensures that all prior loads and stores are not reordered after it. For scenarios requiring both directions, ARM's DMB (Data Memory Barrier) instruction is equivalent to MFENCE. The cost of ARM's weak ordering model is this: lock-free code that runs correctly on x86 may, after being ported to ARM, exhibit heisenbugs — ordering violations that are extremely low-probability and exceptionally difficult to reproduce — due to missing barriers.

Atomic Operations and Caches

C++'s std::atomic provides four standard memory ordering tags, each mapping to different hardware behavior:

memory_order_relaxed: Provides no ordering constraints. The sole guarantee is atomicity — RMW (Read-Modify-Write) operations on the variable will not be torn by concurrent accesses from other cores. On x86, relaxed atomic loads and stores compile to ordinary mov instructions (thanks to TSO's default load/store ordering), and only RMW operations require the lock prefix (e.g., lock cmpxchg). On ARM, relaxed loads and stores compile to ordinary ldr/str, and RMW is implemented through LDREX/STREX loops without additional barriers.

memory_order_acquire: Acquire semantics. All subsequent loads and stores cannot be reordered before this load. Ensures that after reading a flag, all data protected by that flag has been "acquired" with correct visibility. On x86, TSO already provides LoadLoad and LoadStore ordering, so an acquire load compiles to an ordinary mov, producing no additional barrier instruction. On ARM, an acquire load compiles to the LDAR instruction, which implies a one-way acquire barrier.

memory_order_release: Release semantics. All prior loads and stores cannot be reordered after this store. Ensures that all data preparation is complete before setting the flag, allowing an acquire on another core to observe the complete state. On x86, TSO already provides LoadStore and StoreStore ordering, so a release store compiles to an ordinary mov. On ARM, a release store compiles to the STLR instruction.

memory_order_seq_cst: Sequential consistency. Provides a single, globally-unique total order — all cores observe the order of seq_cst operations identically. This is the most expensive ordering constraint in terms of performance. On x86, seq_cst can often exploit TSO's natural ordering guarantees and does not necessarily correspond to an explicit MFENCE; seq_cst atomic RMW operations typically leverage lock-prefixed instructions to establish this total order. The exact implementation varies by compiler version and context; one should not assume that a specific instruction sequence will always be generated. On ARM, seq_cst requires DMB barriers inserted both before and after.

With these mappings understood, a key conclusion emerges: on x86, correctly using acquire/release is entirely free (at the instruction level); only seq_cst incurs the cost of a full barrier. This fact leads many performance-sensitive concurrent codebases to compress the paths requiring total-order synchronization to the absolute minimum — for instance, using seq_cst only on flag sets or reads, while implementing large amounts of internal data exchange using acquire/release. On ARM, even the instruction overhead of acquire/release is far lower than the full barrier of seq_cst, but compared to x86's "free" acquire/release, ARM programmers must choose ordering constraints with greater deliberation.

The lock prefix (used on x86 atomic RMW instructions such as lock add, lock cmpxchg) does not, in modern cache coherence systems, typically lock the entire bus. Instead, it first acquires exclusive ownership of the target cache line (M or E state under MESI) and then completes an indivisible read-modify-write sequence on that cache line. Only in a very small number of non-cacheable accesses does it degrade into a genuine bus lock. The lock prefix also implicitly includes a full memory barrier (the draining effect of MFENCE), so a lock-prefixed instruction also tells the hardware: before execution, drain all Store Buffers, ensuring that the globally-visible atomic RMW establishes full StoreLoad ordering both before and after itself.

WC and UC: Write-Combining and Uncacheable

Beyond the Store Buffer, the CPU has two additional specialized write channels.

Write-Combining (WC) targets device memory. Consider game rendering as an example: the CPU must write 100 MB of pixel data to the GPU's framebuffer. If every 4-byte pixel write were sent as a separate write transaction over the PCIe bus, the available bus bandwidth would be consumed by fragmented, tiny writes. The WC buffer coalesces several consecutive small writes into a larger transfer block — typically 64 bytes (one cache line in size) — and sends it over the bus to the device in one burst once the buffer is full. Intel processors typically have 4–6 WC buffers, each 64 bytes in size. Non-temporal stores (_mm_stream_si128, etc.) are precisely how data is directed into WC buffers rather than polluting the L1–L3 caches. Note that a non-temporal store does not equate to directly writing to DRAM — for WB (write-back) memory types, the CPU may still aggregate the data through write-combining buffers before writing it out; the core objective is to avoid filling cache hierarchies with this data, not to guarantee an immediate commit to main memory.

Uncacheable (UC) is a marking that the OS sets on specific physical pages. Accesses to such pages completely bypass the L1, L2, and L3 caches — reads fetch the value from the device bus every time, and writes are placed directly onto the bus each time. UC ensures that the CPU's writes to device registers are immediately observable by the device after the instruction retires, without being held up by any intermediate cache level. The OS uses page table attributes and mechanisms such as PAT (Page Attribute Table) / MTRR (Memory Type Range Register) to mark specific physical pages as UC or WC.

A point that is easy to confuse: in the x86 reference manuals, SFENCE explicitly flushes data from WC buffers — that is, SFENCE not only ensures that conventional write-back data in the Store Buffer has entered the cache hierarchy, but also forces data in WC buffers to actually be issued onto the bus / to the device. Therefore, in GPU drivers or RDMA stacks, following a sequence of command/descriptor writes to a device with an SFENCE is the standard way to guarantee that the device observes a complete write sequence. The effect of MFENCE includes all the functionality of SFENCE — so if MFENCE has already been used, an additional SFENCE is unnecessary.

Returning to the Memory Wall thesis raised in Part I — the latency hiding of loads depends on OOO and MLP, and the latency hiding of stores depends on the Store Buffer. The essence of both is the same: finite-capacity hardware queues that decouple memory accesses lasting hundreds of cycles from the front-end pipeline. The moment those queues are exhausted, the CPU once again crashes into the Memory Wall.

The next part will enter the battlefield that data reaches after the Store Buffer drains — multicore cache coherence: how the MESI protocol maintains cache line state synchronization across multiple cores, the cost of RFO completing a closed-loop handshake among all cores, and the diagnosis and repair of false sharing in practice.

Cache Deep Dive IV — TLB, Huge Pages, and Memory-Level Parallelism

Tyler Tan — Wed, 10 Jun 2026 06:55:27 +0000

Earlier parts examined the performance characteristics of sequential and random access under single-threaded execution, and noted in passing the destructive effect of random access on the TLB. This part devotes full attention to the TLB: what it is, why a TLB miss is more severe than a cache miss, why a page table walk constitutes one of the longest dependency chains a CPU can encounter, how huge pages fundamentally alter TLB reach, and where memory-level parallelism falters in the face of TLB misses.

Page Boundaries: Where the Prefetcher Halts

Part III, in its discussion of prefetchers, noted a hard constraint: a prefetcher must not cross page boundaries on its own authority. The operating system manages virtual memory in units of pages (typically 4 KB, i.e., 64 cache lines). When a program reaches the end of one page and is about to step into the next, the prefetcher cannot proceed. The reason is that the next page may not reside in physical memory (it may have been swapped out to disk), or it may be an entirely invalid virtual address — if the prefetcher were to speculatively initiate an access to the next page, it would trigger a page fault: the OS would have to suspend the process and swap the page in from disk; in the case of an invalid address, the OS would terminate the process outright. From a security standpoint, the prefetcher neither can nor is permitted to autonomously cross page boundaries without TLB approval. Hence a performance brake appears every 4 KB — even when traversing an array sequentially, after every 64 cache line accesses the prefetch pipeline must pause and await confirmation of an address translation.

This is not to say that modern CPU prefetchers are completely unable to cross pages. Intel's Next Page Prefetcher and AMD's equivalent mechanism can consult the TLB when approaching a page boundary — if the address mapping for the next page is already registered in the TLB, the prefetcher receives clearance to continue prefetching across the boundary. But if the next page is not in the TLB (as happens the first time that page is traversed), the prefetcher must wait in place for the TLB fill to complete. This brings us to the central question of this section: what is the cost of a TLB miss itself?

TLB: The Cache for Address Translation

Every time the CPU issues a virtual address to access memory, that address must first be translated into a physical address. The mapping table from virtual to physical addresses — the page table — resides in main memory and is maintained by the OS. If every memory access required walking the page table in main memory, program performance would be unacceptable — a single virtual address translation would itself entail several main memory accesses. The TLB (Translation Lookaside Buffer) is designed precisely for this purpose: it is a dedicated cache for address translation, storing recently used mappings from virtual page numbers to physical page numbers.

The TLB's hierarchical structure is conceptually similar to that of data caches, but with far smaller capacity:

Microarchitecture	L1 D-TLB	L1 I-TLB	L2 TLB
AMD Zen 5	72 entries (fully associative)	64 entries (fully associative)	2,048 entries (8-way)
Intel Golden Cove (P-core)	96 entries	128 entries	4,096 entries (12-way)
Apple M3 (P-core)	~128 entries	~128 entries	~4,096 entries

L1 TLBs are fully associative — unlike L1 data caches, TLB entry counts are tiny, and a fully associative lookup across a few dozen entries is entirely feasible. The reach of the L1 TLB, calculated with 4 KB pages, is roughly 256–512 KB (64–128 entries × 4 KB), which falls neatly between the capacities of the L1d and L2 caches. The L2 TLB's reach at 4 KB works out to roughly 16–32 MB, just matching the typical capacity of the L3 cache.

The effective capacity of a TLB is usually measured not in entry count, but in TLB Reach:

TLB Reach = TLB entry count × page size

It represents "the largest working set a program can access while bypassing page table walks by hitting the TLB directly." For a 4,096-entry L2 TLB paired with 4 KB pages, TLB Reach = 16 MB. This means that once a program's random-access working set exceeds about 16 MB, the L2 TLB begins to suffer sustained capacity misses — every first access to a new page incurs the cost of a page table walk. This is a recurring bottleneck metric in discussions of NUMA, in-memory databases, and large-memory systems. The essence of huge page optimization is not to make an individual address translation faster, but to expand the reach of the address translation cache — pulling TLB Reach from 16 MB up to 8 GB.

It is worth emphasizing that the TLB also follows the Harvard architecture: at the L1 level it is split into the D-TLB (data side) and I-TLB (instruction side), each independent. Instruction fetch and data read use separate L1 TLBs, which prevents data-intensive loops from contending with scattered, large-volume code for TLB entries. The L2 TLB, however, is unified — instruction and data mappings share the same cache space.

The Page Table Walk: The True Cost of a TLB Miss

A TLB miss by itself is not an immediate disaster — if the target page table data already resides in a processor-internal cache, refilling a TLB entry costs only a dozen or so cycles. But if the relevant page table levels are absent from every cache, the hardware page table walker must traverse the full page table hierarchy.

Consider x86-64's 4-level page table:

CR3 → PML4 → PDPT → PD → PT → physical page

When a TLB miss occurs and every page table level misses every cache, the hardware must complete 4 serial memory reads: read the PML4 entry → read the PDPT entry → read the PD entry → read the PT entry. Notice that the address at each step depends on the result of the previous step — this is what makes the page table walk unique among memory operations: it is a built-in dependency chain. Parts I and III, when discussing MLP, pointed out that independent memory accesses can overlap on the bus; but a single page table walk is intrinsically non-parallelizable — until the PML4 entry is read, the hardware does not know where the PDPT entry is located. In the worst case, if every level of the page table misses the TLB, the PWC, and all data caches, a 4-level page table walk requires 4 serial DRAM reads, with a total latency potentially approaching 800 cycles; and if the target data likewise resides in DRAM, an additional main memory access is layered on top. It is worth stressing that in real program execution, page table data has extremely strong spatial locality (a single 4 KB page holds many consecutive page table entries), and page table data at every level frequently resides in the L2/L3 caches — thus the actual cost of most TLB misses is far lower than this worst-case figure.

Modern CPUs insert an extra buffer between the L2 TLB and main memory: the Page Walk Cache (PWC; Intel calls it the Paging-Structure Cache). The PWC caches the mappings at intermediate levels of the page table (e.g., the PML4→PDPT mapping, the PDPT→PD mapping, etc.), so that only those levels that miss the PWC need to be walked. For example, if the PD→PT mapping is already in the PWC, the 4-level walk is reduced to 2 levels (PML4→PDPT→PWC hit), compressing the latency dramatically. The presence of the PWC is the reason that most conventional programs' TLB misses do not cost 800+ cycles each time.

Huge Pages

All of the preceding analysis assumes the default page size of 4 KB. Enlarging the page expands the address range covered by each TLB entry, thereby multiplying the TLB's effective capacity by the corresponding factor. This is the fundamental motivation for huge pages.

Linux/x86-64 supports two huge page sizes:

2 MB huge pages: the page offset is 21 bits; a single TLB entry covers 512 of the 4 KB pages. With a 4,096-entry L2 TLB, reach goes from 16 MB to 8 GB (2 MB × 4,096).
1 GB huge pages: the page offset is 30 bits; a single TLB entry covers 262,144 of the 4 KB pages. With a 4,096-entry L2 TLB, reach can reach 4 TB. But the granularity is very coarse, making this useful only in very few scenarios with enormous contiguous allocations (e.g., virtual machine memory passthrough, giant pre-allocated buffer pools in in-memory databases).

With huge pages enabled, the page table hierarchy is also shortened: a 2 MB page table has only 3 levels (PML4→PDPT→PD, mapping directly to a physical page — no PT required), and a 1 GB page table has only 2 levels (PML4→PDPT). This simultaneously reduces the depth of the page table walk and lowers the worst-case cost of a TLB miss.

There are three paths to using huge pages in a program:

Explicit reservation: use the mmap system call with the MAP_HUGETLB flag to reserve a specified number of huge pages and map a virtual address region directly onto them. The libhugetlbfs library provides wrappers to simplify usage.
Transparent Huge Pages (THP): the Linux kernel, via the khugepaged background thread, periodically scans the address spaces of processes and, upon detecting large stretches of contiguous 4 KB pages, attempts to collapse them into 2 MB huge pages. Applications benefit without any code changes. THP is recommended for the vast majority of server workloads.
Active hinting: via madvise with MADV_HUGEPAGE, the program explicitly tells the kernel that a particular address region is a good candidate for huge pages, giving THP a priority hint for collapsing. Allocators such as jemalloc and tcmalloc can automatically issue this hint for the heap regions they manage.

Common commands for verifying that huge pages are in effect:

grep Huge /proc/meminfo          # view global huge page statistics
cat /proc/smaps | grep KernelPageSize   # view the actual page size for a specific process
perf stat -e dTLB-loads,dTLB-load-misses ./program  # compare TLB miss rate before and after enabling

Side Effects of Huge Pages

Huge pages are not without cost. There are three known side effects that should be evaluated before deciding whether to enable them.

First, amplified copy-on-write on fork. A child process created by fork shares the same set of memory pages with the parent (COW semantics). When the parent subsequently writes, the OS must make a physical copy of that page. With 4 KB pages, the copy is barely perceptible; with 2 MB or 1 GB huge pages, the OS must copy an entire page on a single COW trigger. Services such as Redis, which rely on fork for persistence, have long recommended disabling THP — one of the primary reasons being that huge pages amplify COW latency. In a latency-sensitive online service where every second counts, this effect can cause client-perceived connection timeouts.

Second, internal fragmentation. The fragmentation issue with THP is not that a single object directly occupies 2 MB — malloc(10) does not immediately receive a 2 MB page — but rather that fine-grained allocations steadily degrade the effective utilization within a collapsed huge page: a single 2 MB huge page may contain hundreds of small objects and large unused holes, driving up overall memory consumption.

Third, the compaction cost of THP. The khugepaged kernel thread periodically scans the address spaces of processes, looking for sequences of contiguous 4 KB pages that are suitable to collapse into a 2 MB huge page. On systems under high load and with large memory, the CPU consumption and memory bandwidth occupation of khugepaged can themselves become a bottleneck — when its scanning speed cannot keep up with the allocation rate, vast numbers of 4 KB pages linger in an intermediate "waiting to be collapsed" state, neither enjoying the TLB benefits of huge pages nor sparing the page table walk from the extra steps caused by page table fragmentation.

Given these side effects, choosing a huge page strategy requires identifying the characteristics of the workload: for applications with continuous allocation, stable memory footprints, and insensitivity to fork latency (e.g., pre-allocated regions in in-memory databases), explicit huge pages are the optimal choice; for workloads with fluctuating memory footprints, fork requirements, or extremely fine-grained allocation, THP's passive collapsing is more appropriate — but CPU usage of khugepaged should be monitored.

I-TLB: The Instruction Side Is Not Immune

The preceding discussion has focused on the data-side TLB (D-TLB). The I-TLB obeys the same physical principles: instruction addresses emitted by the fetch unit likewise require TLB translation. When a program's code footprint exceeds the I-TLB reach and the execution path is scattered across a large number of pages, the instruction fetch stage will suffer stalls just as severe as those on the data side.

Typical warning signs include: a C++ program that instantiates template functions across numerous headers, or one that relies heavily on virtual functions and indirect calls such that the compiler cannot lay out the code paths compactly. When each function call lands in a different 4 KB page, the instruction fetch pipeline must first wait for the I-TLB translation to complete. The instruction side likewise has prefetch mechanisms (Instruction Prefetcher / Next-Line Prefetcher), which on simple paths of sequential fetch can trigger I-TLB fills ahead of time; but their predictive horizon is far narrower than the data-side prefetcher. Faced with numerous indirect branches (virtual function calls, function pointers, jump tables of switch statements) and scattered code layouts, I-TLB misses remain difficult to hide effectively. perf stat -e iTLB-load-misses can quantitatively detect the problem.

Compiler optimizations offer indirect help for the I-TLB: -Os or -Oz (optimization levels that prioritize code size) not only shrink the overall size of the .text section, but also reduce the number of pages spanned by the code, indirectly lowering the I-TLB miss rate. [[likely]] and [[unlikely]] hint the compiler to move cold path code into separate sections so that hot path code is concentrated in a small number of pages — a topic that will be examined in detail when Part VII discusses code layout.

MLP Revisited: Why TLB Misses Are Lethal

Part I, when first introducing memory-level parallelism (MLP), stressed that it allows the CPU to have multiple independent memory requests in flight on the bus simultaneously, thereby partially overlapping the latencies of multiple accesses. Part III showed how random access undermines MLP: dependency chains prevent independent memory-accessing instructions from being issued in parallel.

A TLB miss inflicts a third, and the most thorough, form of destruction on MLP. The interior of a single page table walk is a dependency chain — each read of the 4-level page table depends on the result of the previous step, and the finite state machine within the hardware must complete them serially. In modern Intel/AMD processors, the hardware page table walker can handle multiple mutually independent TLB misses concurrently (e.g., two unrelated load instructions each triggering their own page table walks) — this is known as page-walk-level parallelism; but the progress of any single walk is not accelerated by this. More critically, when a load instruction is stalled because its page has not yet been cached by the TLB, every instruction in the ROB that directly or indirectly depends on that load's result is frozen as well — meaning it is not only that one instruction waiting on the TLB, but an entire dependency subtree locked in place.

This explains why random-access performance degradation under small pages (4 KB) is exponential rather than linear: when the dataset doubles, not only does the cache hit rate smoothly descend along the random-access ramp curve, but the number of pages not cached by the TLB also doubles — and every one of those accesses incurs hundreds of extra cycles for the page table walk. This accelerated degradation only eases when the working set shrinks to within L2 TLB reach, or when huge pages expand the TLB's effective capacity.

The relief huge pages provide for this problem is twofold: they reduce the total number of pages requiring TLB translation, and they shorten the page table walk depth after each TLB miss. For big-data applications with working sets reaching tens or even hundreds of gigabytes — such as in-memory analytics engines, graph processing frameworks, and real-time bidding systems — the two levels of TLB together total only a few thousand entries, which against millions of 4 KB pages is little more than a drop in the bucket. After switching to 2 MB huge pages, the page count drops to 1/512 of the original, and the TLB reach stretches from ~16 MB to ~8 GB, providing the only sustainable address translation path for these applications.

The next part will discuss write policies: how write policies manage dirty data, how the Store Buffer buffers writes inside the pipeline, and how the hardware nature of memory barriers derives from the existence of the write buffer.

Cache Deep Dive III — Replacement Policies, Prefetch, and Single-Thread Memory Access

Tyler Tan — Tue, 09 Jun 2026 05:01:34 +0000

The previous article discussed the static structure of caches. This part moves into dynamic aspects: when a program continuously issues read requests, how the cache decides which lines to retain and which to evict; how hardware prefetchers pull data into the cache before the program even issues a request; and the performance characteristics of two extreme access patterns under single-threaded execution — sequential and random access.

Replacement and Placement Policies

Cache replacement and placement policies are primarily managed by hardware logic; the programmer has no direct control in the vast majority of cases. Modern ISAs do provide instructions that can influence cache behavior — x86's PREFETCH instruction, CLFLUSH / CLWB flush instructions, and non-temporal stores (MOVNTI and the like) that bypass the cache on writes — but these are, in essence, merely hints to the hardware: the architecture does not guarantee that a prefetch will actually occur or that a flush will actually complete; at the microarchitecture level, however, modern processors typically convert such instructions into real prefetch or flush requests. Whether actual benefit materializes depends on the prefetch distance, current bandwidth pressure, and cache state. What truly decides which line gets evicted and where data gets placed is always the combinational logic in hardware. If these decisions were delegated to the OS, every replacement would require triggering an interrupt, trapping into kernel mode, and executing hundreds of software instructions to select a victim line — an overhead far greater than any performance gain it could bring. Hardware cache replacement is done by pure digital logic in less than a single clock cycle.

When a program needs data from level k+1, the cache first checks the current level. A hit is a cache hit, saving the latency of accessing the lower level; a miss is a cache miss. Since level k is necessarily smaller than level k+1, and the amount of memory a program uses often exceeds the cache capacity, a full cache must evict an existing line to make room for a new one. The decision of which block to evict is controlled by the replacement policy. The simplest policy is random replacement — picking a line to sacrifice at random. A more sophisticated one is LRU — evicting the line that was Least Recently Used. LRU is non-trivial to implement in hardware, as it requires maintaining the access order of all lines within a set.

In real chips, LRU approximations or alternatives are widely used. Both academic research and reverse engineering generally conclude that modern Intel last-level caches (since Haswell) use a replacement mechanism conceptually similar to RRIP (Re-Reference Interval Prediction), rather than strict LRU: each cache line is tagged with a Re-Reference Prediction Value (RRPV), cleared to zero on access, and when eviction is needed, the line with the highest current RRPV is selected. When all RRPVs saturate, a global aging event is triggered. DRRIP (Dynamic RRIP) further adaptively switches between SRRIP (biased toward protecting newly inserted lines) and BRRIP (biased toward quickly evicting new lines). Compared to strict LRU, such mechanisms do not unconditionally promote every accessed line to the "most recently used" top — thereby avoiding cases where a single incidental access evicts hot data, and performing better under mixed access patterns.

Note that the complex policies described above appear primarily in larger-capacity LLCs. Caches closer to the core (L1, L2) emphasize access latency more — the few hundred picoseconds of additional delay that an extra replacement state machine might introduce are already unacceptable on the L1 path — so various LRU approximations (such as Tree-PLRU, NRU) and even random replacement are common in L1/L2. The complexity of replacement policies increases outward along the cache hierarchy, inversely related to latency tolerance.

Beyond the replacement policy, the hardware must also decide where a new piece of data should be placed — that is, the placement policy. The placement policy determines the type of a miss. If data is being accessed for the first time and is not in the cache, that is a cold miss, which is unavoidable. If there is still available space in the cache, but mapping-rule constraints cause certain addresses to be repeatedly mapped to the same location while other locations sit empty, that is a conflict miss. If the entire working set is too large and exceeds the cache capacity, that is a capacity miss.

From a measurement perspective, the three types of misses can be distinguished by shrinking/enlarging the working set and cross-referencing against the miss counts from perf stat: seeing misses even on a very small dataset (far smaller than the cache) → dominated by cold misses; miss rate varying with data distribution on a medium dataset → conflict misses; miss rate asymptotically approaching a fixed value on a large dataset → capacity misses.

Hardware Prefetchers

Since program execution inevitably requires data and instructions, can they be fetched asynchronously in advance to reduce or even eliminate misses? This is prefetching.

Modern CPU hardware prefetchers employ multiple strategies. The basic behavioral pattern of a prefetcher is: upon detecting a sequence of consecutive accesses, preemptively pull the next expected address into the cache before it is actually accessed. Implementations from different vendors each have their own emphasis.

Intel's typical configuration includes:

L1 Data Prefetcher: monitors L1d access patterns, and upon detecting two consecutive cache-line loads (within the same 4 KB page), prefetches the next cache line into L1d.
L2 Streamer (Spatial Prefetcher): monitors L1 miss requests, and upon detecting a sequence of misses at consecutive addresses, prefetches several subsequent cache lines into L2 along the same direction, typically with a prefetch depth of 2–4 lines.
L2 Adjacent Cache Line Prefetcher: within a 128-byte-aligned pair, when the L2 receives a miss request for one half, it simultaneously pulls the other half (the adjacent 64-byte line) into L2 as well. This prefetcher does not rely on pattern detection; it is purely spatial.
Next Page Prefetcher: when the access sequence approaches a 4 KB page boundary, if the next page is already registered in the TLB, the prefetcher can continue prefetching across the page boundary — crossing the page boundary without stalling, provided no page fault is triggered.

AMD Zen series (Zen 4/5) corresponding implementations:

L1/L2 Stream Detector: similar to Intel's L1 Data Prefetcher + L2 Streamer, detects sequential accesses and prefetches subsequent lines.
L2 Up/Down Prefetcher: a bidirectional prefetcher — prefetches not only in the direction of access, but also backward (fetching the previous cache line relative to the current line), more friendly to scenarios requiring bidirectional traversal (such as prefix and suffix scans).

Prefetcher aggressiveness is tunable in specific BIOS or MSR settings, but is generally not directly controlled by the application.

Software prefetch allows the programmer to explicitly specify a prefetch address in code. GCC and Clang provide __builtin_prefetch:

for (int i = 0; i < n; ++i) {
    __builtin_prefetch(&data[i + 16]);  // prefetch 16 steps ahead
    process(data[i]);
}

This built-in function accepts three arguments: the target address, the read/write type (0 for read-only, 1 for read-write), and a temporal-locality hint (0 for use-once-and-discard, 3 for retain-as-long-as-possible). On x86, it maps to the PREFETCH instruction; the compiler intrinsic is _mm_prefetch. If the prefetch distance is too short — the data arrives while the CPU is still in the middle of prior computation — the prefetch is meaningless, merely occupying a cache line ahead of time. If the distance is too long — the data arrives but is evicted by subsequent accesses before use — the bandwidth is wasted. Effective use of software prefetching therefore requires measured parameter tuning; improper use not only brings no benefit but actually evicts useful data by consuming additional bandwidth.

Single-Thread Sequential Access

The theoretical latencies given in the previous article — roughly 12 cycles for L2, roughly 200 cycles for main memory — rarely appear in their full magnitude during real sequential access. The reason is that hardware prefetchers hide most of the latency through overlapping.

Consider the following scenario: a single thread sequentially traversing a uint64_t array. When the total number of elements n is small and the entire array fits in L1d, the vast majority of accesses are L1 hits, with latency around 4–5 cycles. When n exceeds the L1d capacity but remains within the L2 range, the theoretical L2 hit penalty is 12–14 cycles, yet the measured effective cost typically lands around 8 cycles — the prefetcher has already moved subsequent data from L2 into L1d before the L1 miss occurs. When n grows large enough that main memory becomes the source, the single-access latency to DRAM is still about 200 cycles; but in a sequential streaming access, the prefetcher and MLP overlap multiple requests, bringing the amortized cost per element down to single-digit cycles. These values depend on microarchitecture and prefetcher configuration, but the direction is consistent across all modern CPUs: in sequential access, effective throughput far exceeds what single-access latency alone would suggest (these orders of magnitude can be reproduced on a target machine using Intel MLC or a comparable benchmark).

The fundamental reason prefetchers can so effectively mask main-memory latency is that sequential access provides them with the most ideal input — the increment in access addresses is fixed and predictable. This allows the prefetcher's pattern detection to lock onto the direction and stride after the first or second access, and subsequent prefetches can pull multiple lines at a time, forming a pipeline of inflowing data.

The Impact of Stride

When the traversal stride increases, prefetcher efficiency drops. If each element is 64 bytes (exactly one cache line), each access crosses one line; if 128 bytes, it crosses two lines, and the prefetcher's pipeline speed must double just to keep up with downstream consumption. At a stride of 256 bytes, the cache's "effective capacity" is diluted — although the hardware still pulls every full cache line, only a small fraction of each line is actually used, and the remaining bytes are wasted.

The most common engineering case is traversing an array of large structs while accessing only one field at a time:

struct Entity {
    int hot_field;
    char padding[60];
};
Entity entities[100000];
for (auto& e : entities) sum += e.hot_field;

Each loop iteration loads a 64-byte cache line but uses only 4 bytes (hot_field being int), for an effective utilization of 4/64 ≈ 6%. The prefetcher's bandwidth is filled with a large amount of useless data, and actual throughput drops to 6% of the theoretical bandwidth.

In such scenarios, separating the hot fields into their own independent array can dramatically improve cache efficiency:

struct Entities {
    std::vector<int> hot_fields;
    std::vector<char[60]> paddings;
};

This is the transition from AoS (Array of Structures) to SoA (Structure of Arrays). Hot fields are now laid out contiguously; a single cache line can hold 16 ints (64 B / 4 B), and each line the prefetcher brings back feeds 16 loop iterations. When a program accesses only a small subset of fields while ignoring the rest, SoA-style traversal typically significantly outperforms AoS. However, if all fields of the same element need to be accessed together (e.g., the x, y, z coordinates of a Particle struct), AoS instead makes fuller use of all the data pulled back per line by the prefetcher. This design principle, fundamental to Data-Oriented Design, is a basic strategy for cache optimization: arrange data by access pattern, not by conceptual model.

Single-Thread Random Access

The discussion above is entirely based on sequential access — scenarios where prefetchers can function effectively. When the memory access pattern becomes completely random, the situation is reversed.

Under sequential access, the amortized cost per element can be as low as a few cycles; under random access, because prefetchers and MLP struggle to be effective, the program begins to be directly exposed to the hundreds of cycles of main-memory latency. Main memory itself has not become slower. The problem lies with the hardware prefetcher — unable to recognize a random pattern, it still issues prefetch requests according to its own policy, but the prefetch addresses bear no relation to the data the program actually needs. The useless data pulled back by the prefetcher not only occupies memory bus bandwidth but also evicts useful hot data from the cache. Prefetching transforms from a means of reducing latency into a burden on the system.

The performance curves of sequential and random access exhibit fundamentally different shapes. Using a pointer-chasing benchmark to plot a latency-vs-working-set-size curve: sequential access appears as a staircase — clear latency steps at the capacity boundaries of each cache level (L1/L2/L3), with the prefetcher pressing latency down to near the theoretical hit latency at the edge of each step. Random access, by contrast, is a smoothly rising ramp — the larger the working set, the more the cache hit rate continuously declines, and latency slides gradually from L1 to L2, to L3, and finally to the DRAM plateau. On either side of the LLC capacity boundary, random-access latency transitions almost continuously rather than jumping — because the probability of a miss rises smoothly with the working set size, rather than flipping abruptly at a capacity threshold.

The reason pointer chasing is the standard method for measuring random-access latency is that it constructs a dependency chain that is fundamentally impossible for any prefetcher to predict: allocate an array with elements randomly permuted, each element storing a pointer to the next element. The CPU cannot know the address of the next load until the current load completes — this is the most stringent form of data dependency. Pointer chasing does not merely suffer from a low cache hit rate; it simultaneously destroys the entire foundation on which prefetchers, MLP, and out-of-order execution rely to hide latency: the prefetcher is defeated because it cannot predict the next address, MLP is paralyzed because addresses are serially dependent, and all instructions in the ROB that indirectly depend on the load result stall and wait.

Random access is equally devastating to the TLB — when the number of pages in the working set exceeds the number of TLB entries, every new random jump may land on an uncached page, triggering a full page-table walk. Detailed analysis of this topic, however, is deferred to Part IV.

From an engineering perspective, the irrecoverability of random-access performance means that for data structures based primarily on pointer traversal (linked lists, hash tables, B-trees, skip lists), even with arbitrarily large caches, as long as the working set exceeds the cache, performance degrades irreversibly. This is the fundamental dividing line between "cache-friendly data structures" and "memory-intensive data structures." A contiguous array is one of the data layouts most easily exploited by caches and prefetchers — not only because of sequential memory access, but more importantly because its memory layout is entirely transparent to the prefetcher. Any data structure built around indirect pointer access inherently surrenders a portion of its performance back to the memory wall. That said, in real systems, cache-optimized index structures (such as B+-trees aligning internal nodes to cache-line size, or Adaptive Radix Tree with path compression) do exploit intra-cache-line locality as much as possible — but their core access paths still involve at least one level of pointer-level indirection.

The next part will focus on the TLB, the cost of page-table walks, and how huge pages alleviate both of these problems.

Cache Deep Dive II — Cache Organization and CPU Topology

Tyler Tan — Mon, 08 Jun 2026 12:57:34 +0000

Part I discussed the physical roots of the memory wall, the design principles of the memory hierarchy, and the interaction between virtual and physical addresses during cache lookup. This part delves into cache internals: how addresses are partitioned into tag, set index, and block offset; the hardware trade-offs among the three organization schemes; the rationale behind the 64-byte cache line; the actual cache topologies of modern CPUs; and the inclusion policies between cache levels.

Address Partitioning

When a 64-bit address is sent to the L1 data cache, the hardware partitions it into three fields:

|←──── tag ─────→|←── set index ──→|← block offset →|
      T bits            S bits          O bits

With a cache line size of 2^O bytes, the low O bits are the block offset, ensuring every byte within the line can be addressed. For a 64-byte cache line, O = 6. The middle S bits form the set index — the cache has 2^S sets, and the address maps to exactly one of them. The remaining high T bits constitute the tag, stored alongside the data in the cache line's metadata and used during set-internal comparison to confirm a match. The set index itself is not stored — all lines in the same set share the same index bits.

Taking AMD Zen 5's L1d as an example for bit-width calculation: 48 KB / 12-way / 64 B = 64 sets, so S = 6 (2^6 = 64). O = 6 (2^6 = 64 B). Logically, T = 64 − S − O = 52 bits. However, the actual storage width of the tag is determined by the number of effective physical address bits — x86-64 physical address width currently ranges from 48 to 52 bits (depending on LAM and 5-level paging support). Subtracting the 12 bits for S + O, the actual tag width participating in comparison is approximately 36–40 bits. This calculation is also influenced by VIPT design: as described in Part I, L1d is VIPT, so the low 12 bits of the virtual address (page offset) directly serve as the set index and block offset, while tag comparison uses the high-order physical address bits output by the TLB. Under VIPT, S + O ≤ 12 (the page offset bit width), a constraint that ensures the set index and block offset are identical between virtual and physical addresses.

For programmers, the most important corollary of address partitioning is this: if the access stride happens to be an integer multiple of 2^S × cache line size, every request maps to the same set. For example, with a 64-set, 64-byte-line cache (S = 6, O = 6), a stride of 64 × 64 = 4096 bytes — exactly 4 KB, one page — forces all requests into the same set. With 8-way associativity, the first 8 accesses fill the set, and from the 9th onward, each access triggers an eviction. This is the hardware root of conflict misses — a programmer may "see" high miss rates even knowing the cache has plenty of empty capacity elsewhere, purely due to addressing rules. Notably, a stride of 4096 not only triggers cache set conflicts but also causes every access to cross a page boundary, a topic that will resurface in the discussion of TLB.

Three Cache Organization Schemes

The most intuitive way to implement a cache is to allow every cache line to store any block from main memory. This is called a fully associative cache. Its advantage is maximum cache utilization — new data can be placed in any empty slot. However, its lookup cost is prohibitive. For a 4 MB L2 cache with 64-byte lines, there are 65,536 lines. On every memory access, the processor must compare the target address's tag against the tags of every single line — 65,536 comparisons per cycle — which is infeasible in power and timing. Fully associative designs are only viable for extremely small caches, such as the TLBs in some Intel CPUs. For L1i, L1d, and larger caches, other approaches are required.

The other extreme is to map each main-memory address to a unique, fixed location in the cache — a direct-mapped cache. On access, the processor extracts several bits from the address to compute the target slot and compares against only that one slot's tag. A single comparator and multiplexer suffice, making it extremely fast. But the drawback is obvious: if a program repeatedly accesses multiple addresses that map to the same slot, conflict misses occur — multiple addresses fight for one slot while others sit idle. Real programs rarely exhibit uniform access patterns, causing direct-mapped cache utilization to drop sharply. A classic degenerate scenario: a program alternates between two addresses spaced exactly one cache capacity apart — each access evicts the other, yielding a zero hit rate.

Set-associative caches combine the strengths of both. The cache is partitioned into sets, each containing a fixed number of cache lines (the associativity, or number of "ways"). On access, the address first identifies the set, then all tags within that set are compared in parallel. Within a set, the behavior is fully associative; across sets, it is direct-mapped. This design mitigates conflict misses while preserving lookup speed. Virtually all contemporary CPU caches use set-associative designs.

The fundamental formula:

Cache capacity = number of sets × associativity (ways) × cache line size

Cache associativities are not arbitrary — they represent trade-offs under physical design constraints for target workloads:

Cache Level	Zen 5	Golden Cove (P-core)	Apple M3 (P-core)
L1d	48 KB / 12-way	48 KB / 12-way	128 KB / 16-way
L1i	32 KB / 8-way	32 KB / 8-way	192 KB / 16-way
L2	1 MB / 16-way	2 MB / 16-way	32 MB / 20-way (per cluster)
L3	32 MB / 16-way (per CCD)	36 MB / 12-way	48 MB SLC

L1d and L1i associativities and capacities are constrained by VIPT (see Part I): under 4 KB pages, 12-way 48 KB or 8-way 32 KB are natural choices within that constraint. For L2, around 16 ways has become the common balance point among access latency, power, and conflict rate in current high-performance processors — too few ways raise conflict miss rates, while too many ways lengthen the tag-comparison timing path, requiring either frequency reduction or additional pipeline stages to accommodate the comparison logic. Apple's M3 achieves 20-way in some P-cluster L2s, partly enabled by its lower clock frequency target (~4 GHz vs. x86's ~5.5 GHz), offering more physical timing margin per cycle for parallel tag comparison.

Why 64-Byte Cache Lines

The cache line is not only the fundamental unit of data transfer, but also the minimum granularity at which cache coherence protocols maintain ownership — MESI and similar protocols track state, broadcast invalidations, and transfer ownership at the cache-line level. The false sharing problem discussed later is, at its core, multiple cores contending for ownership of the same cache line while operating on logically unrelated variables. Recognizing that "64 bytes is the common granularity for both data movement and ownership tracking" is prerequisite to understanding many multi-core performance problems.

Cache line size is determined by three factors.

First, tag overhead: every line must store a tag and status bits (valid, dirty, MESI state). Smaller lines mean higher tag overhead. For a 4 MB cache: with 32-byte lines (~40-bit tag, ~5-bit status), there are 131,072 lines, tag overhead ≈ 750 KB (≈ 18%). With 64-byte lines, 65,536 lines, tag overhead ≈ 370 KB (≈ 9%).

Second, spatial locality: larger cache lines pull in more nearby data on a single miss, indirectly improving hit rates.

Third, DRAM physical transfer characteristics: DDR SDRAM transfers data in bursts on consecutive clock edges. Once a row is activated, multiple columns can be read from it sequentially without additional activate overhead. 64 bytes corresponds exactly to the most common DDR4/DDR5 burst length = 8 × 64-bit data bus width = 8 × 8 B = 64 B.

Under these three constraints, 64 bytes became the industry standard. Historically, the Intel Pentium (1993) used 32-byte cache lines; the Pentium 4 (2000) mixed 64-byte and 128-byte lines in some caches; from Core 2 (2006) onward, all caches unified at 64 bytes. Note that 64 bytes refers only to the data portion. Counting tag, valid bit, dirty bit, and MESI state bits, each cache line actually occupies about 72 bytes — roughly 12% metadata overhead. A manufacturer-labeled 32 MB L3 cache actually requires about 36 MB of SRAM transistors etched on the silicon.

C++17 provides std::hardware_destructive_interference_size and std::hardware_constructive_interference_size, exposing the 64-byte alignment constant. alignas(std::hardware_destructive_interference_size) forces two variables that may be concurrently written by different cores onto separate cache lines, avoiding false sharing (detailed in Part VI).

Modern CPU Cache Topology

CPU cores are not directly connected to main memory; all reads and writes must pass through the cache hierarchy. Caches are first divided into data caches and instruction caches — Intel adopted this split design starting with the Pentium in 1993 and has maintained it ever since. The L1 cache is divided into L1i and L1d, implementing a Harvard architecture: instruction fetch and data read can proceed in parallel, avoiding bandwidth contention on a single interface. L2 and L3 caches are generally unified — instructions and data share the same storage, achieving higher space utilization: when the workload is instruction-heavy, more space goes to instructions; when data-heavy, more goes to data.

The above is a general description. Specific topologies differ significantly across vendors, with direct performance implications.

AMD: CCDs and Chiplet

Since Zen 2, AMD has employed a chiplet architecture, dividing a single physical package into one I/O Die (IOD) and multiple Core Complex Dies (CCDs). Each CCD contains 8 cores sharing one L3 cache (32 MB for both Zen 4 and Zen 5). Each core has private L1i (32 KB) and L1d (48 KB), plus private L2 (1 MB). When a core accesses an address residing in its local CCD's L3, latency is roughly 50 cycles; if the address resides in a different CCD's L3, the request must be routed through the IOD's Infinity Fabric to the target CCD, raising latency to approximately 100 cycles or more.

The direct implication for programmers: on dual-CCD consumer processors (e.g., Ryzen 9 7950X, two CCDs with 16 cores total), if a thread frequently migrates between CCDs, its hot cache lines in private L1/L2 must be transferred via the coherence protocol across the IOD, with each migration incurring the cost of inter-core RFO handshakes plus the physical trace delay across CCDs. On EPYC server platforms, a single package may contain up to 12 or 16 CCDs, making cross-CCD latency non-uniformity even more pronounced — this is an on-die Non-Uniform Cache Access effect, distinct from traditional NUMA defined by memory controller distance, but with similar performance impact.

Intel: Ring and Mesh

Intel client processors (e.g., Core i9-14900K) use a ring bus connecting all cores, L3 slices, GPU, and memory controller. L3 is evenly divided into slices, with each core accessing any slice via the ring. Each ring hop takes about 4–5 cycles, giving a worst-case latency of roughly 20–30 cycles on an 8–12 node ring. Since all nodes on the ring are equidistant in terms of access, the ring bus provides approximately uniform latency — in contrast to AMD's CCD architecture.

Server-class Xeon Scalable processors (e.g., Sapphire Rapids) employ a 2D mesh interconnect, with latency growing linearly with the number of mesh hops. CHAs (Caching & Home Agents) are distributed across mesh nodes, each responsible for directory tracking of a portion of the address space. A core accessing memory managed by its local CHA experiences lower latency; accessing a region managed by a remote CHA requires multiple mesh hops, with latency reaching 2–3× that of local access.

Apple: P-Clusters and SLC

Apple's M series adopts a cache hierarchy distinct from x86. Taking M3 as an example: P-cores have 128 KB L1d and 192 KB L1i (both 16-way). L2 configurations across different M-series SKUs vary significantly, typically with clusters of P-cores sharing large L2 caches (Apple has not published precise official specifications; publicly available data largely comes from reverse-engineering analysis). E-cores have smaller caches but still substantial associativity (128 KB L1d / 96 KB L1i). All CPU clusters and GPU share a System Level Cache (SLC) — 8 MB on the base M3, up to 48 MB on Pro/Max variants. The SLC is part of the unified memory architecture: DRAM (LPDDR5) is packaged alongside the chip, and CPU and GPU access the same physical memory pool through the SLC, eliminating the need for dedicated video memory.

Apple's L1i and L1d capacities far exceed contemporary x86 — 128 KB L1d / 192 KB L1i vs. x86's 48 KB / 32 KB — enabled by the 16 KB default page size, which lifts the VIPT capacity constraint (16 KB × 16-way = 256 KB ceiling), and explains why Apple can invest far more SRAM budget at the L1 level than x86. Additionally, Apple's ultra-wide decode design (M3 is 8-wide issue) demands extremely high instruction supply bandwidth — L1i output bandwidth must be sufficient to keep the decoders fed — and the combination of large L1i and a micro-op cache (estimated at roughly 4K–6K uops from M1 reverse engineering) collectively sustains the frontend.

Uncachable Regions

Certain memory regions are not cached, such as MMIO (Memory-Mapped I/O). The OS marks these physical pages as UC (Uncacheable) via page table attributes and hardware mechanisms such as PAT (Page Attribute Table) / MTRR (Memory Type Range Register). Reads and writes to such addresses fully bypass L1–L3 caches and go directly onto the bus to the device. Meanwhile, the ISA provides instructions that allow programmers to bypass the cache — for large volumes of "write-once, discard" data (such as streaming writes to a GPU framebuffer), non-temporal stores (x86 MOVNTI, or the corresponding compiler intrinsic _mm_stream_si128) write directly to memory, avoiding cache pollution. These instructions direct data into write-combining buffers (WC buffers), which batch 64 bytes before issuing a single burst onto the bus, rather than sending one transaction per byte. Detailed discussion of WC and UC mechanisms appears in Part V.

Inclusive, Exclusive, and Non-Inclusive

The inclusion relationship between cache levels is an important microarchitectural choice that directly determines effective cache capacity and coherence protocol overhead.

Inclusive: every line in L1 must also exist in L2; likewise for L2–L3. Writebacks are faster when reads dominate, but capacity waste is significant.
Exclusive: a line in L1 does not exist in L2 or L3. A line of data exists in exactly one cache level. Writebacks evict from level to level, wasting no capacity but requiring a longer eviction path.
Non-inclusive: the inclusion relationship is neither guaranteed nor denied. A lower level may or may not have the line.

Modern processors are universally non-inclusive between L1 and L2 — L1 and L2 store data independently without mandatory duplication. Between L2 and L3, there are two camps.

Intel Core client processors have seen significant changes in L2–L3 inclusion across microarchitecture generations. Early Nehalem through Broadwell (2008–2015) used strict inclusive LLC, motivated not by capacity management but by the snoop filter: when Core A needs to know whether a cache line at a given address is held by other cores, full-die broadcast (querying every core individually) would cause interconnect traffic to grow linearly with core count. An inclusive L3 provides a shortcut — since every line in L2 must have a copy in L3, simply checking L3's tag array answers "which core holds this address." L3's tag array doubles as a snoop filter, suppressing coherence query broadcast traffic within L3. The cost is capacity loss: L3 effective capacity = nominal capacity − Σ(all core L2 capacities).

Starting with Skylake (2015), Intel gradually transitioned to non-inclusive or weakly-inclusive LLC. Contemporary Golden Cove and Raptor Cove no longer require L2 lines to keep copies in L3, instead relying on distributed directory information and LLC metadata to independently track the ownership of each cache line. This shift eliminates the duplicate storage overhead of L2 data in L3, making L3's nominal capacity its effective capacity, but introduces the SRAM overhead of the directory itself and additional lookup latency.

AMD's Zen architecture is non-inclusive between L2 and L3. There is no requirement that "L2 contents must be backed in L3"; the full 32 MB of L3 is used for independent data. Snooping functionality is achieved through independent probe filters or directory tracking, without relying on inclusion. This choice gives AMD higher effective utilization of the labeled L3 capacity — for memory-intensive workloads with large working sets and low data reuse, non-inclusive is superior.

Apple M-series SLC is a variant with inclusion-like properties (forward-compatible in certain versions with subsets of L2), but Apple has not disclosed the exact inclusion semantics between SLC and L2.

Subject to correct memory model enforcement, the CPU enjoys considerable freedom in cache management. Take x86 TSO (Total Store Order): as long as Core 0's sequence of writes to A then B is observed by all other cores as A changing before B, any optimization is permitted above that TSO baseline — for instance, opportunistically writing back dirty cache lines to main memory during idle bus cycles and clearing their dirty bits. Such operations are fully transparent to the programmer as long as the memory model is not violated.

This part analyzed the internal organization of caches. The next part moves into dynamic behavior: the hardware implementation of cache replacement policies, the classification and behavior of hardware prefetchers, and the performance characteristics of sequential versus random access under single-thread conditions, as shaped by prefetchers and TLBs.

Cache Deep Dive I — The Memory Wall and Locality

Tyler Tan — Mon, 08 Jun 2026 12:57:06 +0000

Introduction

A core conviction behind this series: engineers who truly understand the underlying systems possess an intuition in performance engineering that is difficult to replace. When code exhibits unexpected latency, they can trace the problem down through the cache hierarchy, pipeline state, coherence protocol, and even kernel scheduling paths to its physical root cause. This ability is not built on algorithm textbooks — it rests on low-level foundations that are easily overlooked: CPU caches, pipelines, NUMA, Linux kernel memory management, and more. None of these pieces are "difficult" in isolation, but once they combine into a complete picture, they fundamentally change how one reads code.

This series assumes readers have a basic understanding of computer architecture and operating systems. If not, reading CSAPP first would be advisable.

The Memory Wall

Since the 1980s, processor performance has grown at roughly 60% per year, while DRAM access latency has improved by only about 7% annually. By the late 1990s, this divergence was stark enough that Wulf and McKee coined the term "Memory Wall" in 1995 — the processor's computational capacity is constrained by the speed at which data reaches the registers from memory. Nearly three decades later, DRAM's absolute access latency still hovers at the 60–80 ns range; the core problem has not been eliminated by process advances.

Typical access latencies across the storage hierarchy:

Storage Level	Latency (Typical)	Zen 5 (~5 GHz)	Golden Cove (~5.5 GHz)
Register	≤1 cycle	—	—
TLB	≤1 cycle	—	—
L1 Cache	~4–5 cycles, ~1 ns	4 cycles	5 cycles
L2 Cache	~12–14 cycles, ~3 ns	14 cycles	13 cycles
L3 Cache	~40–50 cycles, ~10 ns	50 cycles	44 cycles
Main Memory	~150–250 cycles, ~60–80 ns	~200 cycles	~200 cycles
NVMe SSD	~15,000 ns	—	—
HDD	~5,000,000 ns	—	—

For context, a lightweight syscall costs tens to hundreds of nanoseconds — meaning a single main-memory access is already comparable to a system call. With KPTI (Kernel Page Table Isolation, the Meltdown mitigation) enabled, additional page-table switching and TLB-related overhead narrow the gap further. It is also worth noting that Apple's M3 series uses 16 KB pages, giving TLB coverage inherently superior to 4 KB pages, and its system call mechanism differs from x86, with syscall latencies in the 15–30 ns range — platform differences mean the "memory wall" does not feel the same everywhere.

The latency hierarchy stems from physical mechanisms. L1–L3 caches are built from SRAM, requiring six transistors per bit — fast but large in area and high in cost. DRAM needs only one transistor and one capacitor per bit, offering high density at low cost, but every access must go through a full timing sequence of row activate, column strobe, and precharge — these tens of nanoseconds of overhead are the hard floor of DRAM physics. For NVMe SSDs, the NAND media read itself contributes relatively little to latency; most comes from PCIe bus transfer and NVMe protocol stack processing. HDDs involve mechanical seek delays from the read head, belonging to an entirely different physical regime than electronic latencies.

These latencies are not always fully exposed on every memory access. Modern CPUs rely on two complementary mechanisms to hide them.

The first is out-of-order execution (OOO): when a load instruction waits for memory, the CPU picks later instructions from the reorder buffer (ROB) that do not depend on that load's result and continues executing, overlapping computation with memory access in time. The ROB typically holds hundreds of instructions — valuable buffering within a 200-cycle DRAM latency window, but nowhere near enough to fill the entire wait.

The second, equally important but often overlooked, is Memory-Level Parallelism (MLP): the CPU's memory subsystem allows multiple outstanding memory requests to be in flight on the bus simultaneously. Hardware MSHRs (Miss Status Holding Registers, Intel terminology) or LFBs (Line Fill Buffers) track each outstanding cache miss — each core typically has 10–12 such tracking slots. If a program has two independent load instructions — for example, traversing two unrelated linked lists — the CPU can issue both to the memory subsystem concurrently, with the second request not waiting for the first to complete. Two 200-cycle requests are overlapped, yielding an effective latency of roughly 200 cycles rather than 400. Modern server CPUs survive on the memory wall precisely through this MLP + OOO collaboration: OOO finds parallelizable memory operations in the instruction stream, and MLP enables them to actually execute in parallel on the bus.

Both defenses share the same blind spot: data dependency chains. In a = p->next; b = a->next;, the address of the second load depends on the result of the first — the second cannot be issued until the first returns. MLP drops to zero at this point, and all instructions in the ROB that indirectly depend on these addresses stall, gradually exhausting the OOO window. This is the fatal weakness of dependency-chain-intensive operations such as pointer chasing, hash table probing, and B-tree traversal: the program hits DRAM's 200-cycle hard wall while hundreds of execution units inside the CPU sit idle. The fundamental motivation for cache design lies exactly here — by keeping "faster, smaller copies" across multiple storage levels, each step in the dependency chain lands in L1/L2's single-digit cycle latency rather than DRAM.

These latency numbers can be observed directly on a target machine using perf stat or Intel MLC (Memory Latency Checker). By constructing a pointer-chasing benchmark with a singly linked list, using RDTSC for timestamping and LFENCE to eliminate instruction-reordering bias, one can precisely measure the access latency of each cache level. The methodology will be explained further when discussing random access patterns later in this series.

The Memory Hierarchy

The central idea of the memory hierarchy: each level (level k) serves as a smaller, faster cache for level k+1. Data is transferred between two levels in fixed-size units called blocks. For example, if level k has 4 blocks and level k+1 has 16 blocks, data moves back and forth between these layers. The block size between any pair of adjacent levels is fixed — between main memory and cache, the block corresponds to a cache line — but different level pairs may use different block sizes. In modern CPUs, the block size across L1, L2, L3 caches and main memory is almost uniformly 64 bytes.

Anchoring this in real CPUs, the cache parameters of major microarchitectures (production models as of 2024):

Microarchitecture	L1d (per core)	L2 (per core/cluster)	L3 / LLC (shared)
AMD Zen 5	48 KB / 12-way	1 MB / 16-way	32 MB / 16-way (per CCD)
Intel Golden Cove	48 KB / 12-way	2 MB / 16-way (P-core)	36 MB (i9-14900K)
Apple M3 P-core	128 KB / 16-way	32 MB (per P-cluster, shared)	48 MB (SLC)

That total cache capacity is roughly 1/1000 of main memory is no coincidence: SRAM is about 5–10× larger in area than DRAM and roughly 100× more expensive per bit. Equipping a processor with gigabytes of SRAM would push die area and power consumption far beyond manufacturing feasibility. This economic constraint fundamentally determines the capacity ratios across cache levels.

The Locality Principle

Caches work not because program memory access is uniform and random — quite the opposite: typical programs access memory in a highly non-uniform fashion. This non-uniformity manifests in two dimensions.

Temporal locality: an address, once accessed, is very likely to be accessed again in the near future. Typical sources include loop variables, frequently-called function stack frames, and short-lived counters or status flags.

Spatial locality: after accessing an address, nearby addresses are very likely to be accessed soon afterwards. Sources include sequential array traversal, contiguously-allocated struct fields, and the sequential execution of instruction streams.

These two forms of locality are the prerequisite for caches to function at all: if programs truly accessed memory purely at random, no cache hierarchy design could prevent hit rates from approaching zero. Data from standard benchmarks (e.g., SPEC CPU 2017) shows that well-written programs typically achieve L1 data cache hit rates above 90%, with L2 hit rates exceeding 70% on the subset that misses L1. This means the vast majority of instructions never face DRAM's full latency.

The quantitative tool that unifies temporal and spatial locality is reuse distance (also called LRU stack distance): between two consecutive accesses to a given address, how many other distinct addresses does the program access? If the reuse distance is smaller than the cache capacity — more precisely, smaller than the number of sets in the cache — the access is a hit; otherwise, a miss. Analyzing a program's reuse distance distribution allows one to predict cache behavior without running benchmarks. This concept is the core working model of cache simulators such as Valgrind's Cachegrind. In real hardware, set-associative structures also introduce conflict misses, so this relationship holds only as an approximate analytical tool.

Sequential access and random access form two extremes within this framework. Sequential access exhibits short reuse distances and strong spatial locality, allowing prefetchers to stay ahead; random access often has reuse distances exceeding the effective range of any cache level and prefetcher. Detailed analysis of these two access patterns appears in Parts III and IV of this series.

Virtual Addresses and Cache Addressing

A cache line identifies its corresponding main-memory block via an address tag. This address can be virtual or physical. The choice between the two involves fundamental design trade-offs.

VIVT (Virtually Indexed, Virtually Tagged): Cache indexing and tagging both use virtual addresses. The advantage is that the cache lookup can complete without waiting for TLB translation, minimizing latency. However, the synonym problem arises: when the same physical page is mapped to multiple virtual addresses (common in shared-memory scenarios), multiple copies of the same data may simultaneously exist in the cache, with a modification to one semantically invalidating the others. Additionally, different physical addresses with identical virtual addresses (the same VA in different processes) pollute each other, forcing a cache flush on every context switch. As a result, VIVT is almost never used in modern general-purpose processors, appearing only in a few special-purpose tiny caches (such as the internal structures of some TLB implementations).

PIPT (Physically Indexed, Physically Tagged): Both indexing and tagging use physical addresses. Data correctness is guaranteed, but every cache access must first translate the virtual address to a physical address via the TLB — effectively adding TLB latency on top of cache latency. This is unacceptable for L1d, which has a total latency target of only 4–5 cycles and cannot afford an additional TLB translation stage. PIPT is therefore mainly used for L2 and lower-level caches, where latency budgets are more generous.

VIPT (Virtually Indexed, Physically Tagged): The compromise for L1d. The cache uses the low-order bits of the virtual address as the set index while simultaneously sending the virtual address to the TLB for translation; after locating the target set, the physical address from the TLB is compared against the physical tags of each way in the set. The key insight: the low-order bits of the virtual address (the page offset) are identical to the low-order bits of the physical address — address translation only modifies the high-order bits. Therefore, as long as all the cache index bits fall within the page-offset range (i.e., bits that do not change during translation), cache indexing can proceed in parallel with TLB translation. Address bits beyond the page offset would be ambiguous until TLB translation completes, causing the aliasing problem.

This constraint directly limits the maximum capacity of a VIPT L1d cache: with 4 KB pages, the page offset is 12 bits (bits 0–11). If the cache's set index bits fall within bits 0–11, the total cache capacity must not exceed associativity × page size. For an 8-way set-associative cache, maximum capacity = 8 × 4 KB = 32 KB. This explains why x86 processors long had L1d caches of 32 KB 8-way — not a coincidence, but a VIPT addressing constraint. Starting with Zen 4, AMD enlarged L1d to 48 KB 12-way (12 × 4 KB = 48 KB), while Apple's M3 L1d reaches 128 KB 16-way — the latter, aided by 16 KB pages, supports a VIPT ceiling of 16 × 16 KB = 256 KB, far exceeding the constraint under 4 KB pages.

Virtually all modern high-performance general-purpose processors use VIPT for L1d and L1i, with PIPT from L2 downwards. The fundamental reason for choosing VIPT is performance: it allows cache indexing and TLB translation to proceed in parallel within the same pipeline stage, saving a pipeline cycle that is decisive for the L1 critical latency path.

The aliasing problem introduced by VIPT (when different virtual addresses map to the same physical address, identical physical tags but different virtual indices cause the same physical line to appear in multiple sets) must be handled by the OS during page allocation. Traditional Unix used page coloring to ensure that the low-order bits of virtual addresses for shared pages match, thereby avoiding aliasing. Modern operating systems increasingly rely on hardware cache design and page-mapping constraints to avoid aliasing-related correctness issues.

Measurement Primer

The following two commands form the measurement foundation for all subsequent performance discussions:

perf stat -e L1-dcache-load-misses,L1-dcache-loads,LLC-load-misses,LLC-loads ./program

This reports the miss counts and total accesses for the L1 data cache and the Last Level Cache (LLC, typically L3). High L1 miss rates usually point to data layout problems (see Part III); high LLC miss rates usually point to working sets exceeding cache capacity (see Part IV).

perf stat -e cycles,instructions,cache-misses,cache-references ./program
IPC=$(instructions / cycles)

The ratio of instructions to cycles is IPC (Instructions Per Cycle). When IPC is significantly below the microarchitecture's theoretical peak (e.g., Zen 5's 8-wide issue width corresponds to a typical value of about 4–6), and cache-misses is simultaneously high, the problem usually points to cache efficiency.

To precisely measure the latency of each cache level, the core method is to construct a pointer-chasing benchmark: allocate an array internally linked as a singly linked list in random order, traverse the list, and measure the average time per step. By controlling the list length to be less than L1d capacity (measuring L1 latency), greater than L1d but less than L2 (measuring L2 latency), greater than L2 but less than L3 (measuring L3 latency), and greater than LLC (measuring main memory latency), one can precisely fit the access latency for each level.

This part begins at the top of the picture: why the memory wall exists, how the memory hierarchy is designed, why locality enables caching to work, and how virtual and physical addresses interact during cache lookup. The next part dives into the cache internals: the hardware trade-offs of the three organization schemes, why cache lines are 64 bytes, the cache topology and inclusion policies of modern CPUs, and the specific mechanism by which an address is partitioned into tag, set index, and block offset.

Building an Interpreter from Scratch: What 1600 Lines of Modern C++ Can Do

Tyler Tan — Fri, 05 Jun 2026 08:55:01 +0000

When you type python3 main.py and hit enter, what actually happens? How does text sitting on your hard drive end up executing on your CPU?

The answer is a program called an interpreter. Unlike a compiler, which translates source code into a standalone executable before running it, an interpreter reads your code directly, understands what it means, and executes it on the spot. Python, Ruby, JavaScript, Lua — the languages you use every day all run on interpreters under the hood.

We built one. LoxInterp is a complete interpreter written in C++23. It has full lexical scoping, closures, class inheritance, constructors, super, 39 token types, 13 AST node types — roughly 1600 lines of source, 1200 lines of tests, zero external dependencies.

The project is based on Robert Nystrom's classic tutorial Crafting Interpreters. The original uses Java; we hand-rolled a C++23 version. Open source at https://github.com/Tenaryo/LoxInterp.

What It Looks Like

Let's see it in action first. Here's a snippet of Lox — classes, inheritance, super, instances, all in one shot:

class Animal {
    init(name) { this.name = name; }
    speak()  { print this.name; }
}
class Dog < Animal {
    speak() {
        super.speak();
        print "woof!";
    }
}
var d = Dog("Buddy");
d.speak();

Save it as demo.lox and run it with a single command:

$ ./build/LoxInterp run demo.lox
Buddy
woof!

These dozen lines of Lox go through a full pipeline from raw text to CPU execution. Let's unwrap that pipeline layer by layer.

The Architecture: Simpler Than You Think

An interpreter's skeleton has just four stages:

Source Code
    │
    ▼
┌──────────┐      ┌──────────┐      ┌──────────────┐      ┌──────────┐
│ Scanner  │ ───▶ │  Parser  │ ───▶ │   Resolver   │ ───▶ │Interpreter│
│  Lexer   │      │  Parser  │      │  Binder+Check  │      │  Runtime  │
└──────────┘      └──────────┘      └──────────────┘      └──────────┘
 Token Stream         AST              Annotated AST          Output

The run command in main.cpp is nothing more than these four steps called in sequence, maybe a dozen lines total:

auto tokens = scanner.scan_tokens();          // 1. Text → Token stream
auto statements = parser.parse_statements();  // 2. Tokens → AST
resolver.resolve(statements);                  // 3. Binding + semantic checks
interpret(statements);                         // 4. Walk the tree and execute

One stage feeds into the next. Let's start at the front door — the Scanner.

Scanner: Text → Token Stream

The Scanner's job is brutally mechanical. It doesn't care about logic, doesn't care about structure, doesn't even know whether 1 + 1 is valid syntax. Its only job is to recognize what's in this blob of characters and slap labels on it.

A Token is just four fields:

struct Token {
    TokenType type;        // LEFT_PAREN / NUMBER / STRING / IF / ...
    std::string lexeme;    // Raw text: "(" / "42" / "\"hello\"" / "if"
    TokenLiteral literal;  // Parsed value: null / 42.0 / "hello" / null
    int line;              // Line number, for error reporting
};

Feed it class A < B { fun f(){} } and it spits out 14 flat tokens:

CLASS, IDENTIFIER(A), LESS, IDENTIFIER(B), LEFT_BRACE,
FUN, IDENTIFIER(f), LEFT_PAREN, RIGHT_PAREN,
LEFT_BRACE, RIGHT_BRACE, RIGHT_BRACE, EOF

Notice < is labeled LESS. The Scanner doesn't know if this < means inheritance or comparison — that's the Parser's problem. class and fun are recognized as keywords, not generic identifiers. The entire scanning process is one giant switch statement, dispatching on the first character:

auto Scanner::scan_token() -> void {
    char ch = advance();
    switch (ch) {
    case '(': add_token(LEFT_PAREN); break;   // Single-char: produce directly
    case '"': scan_string(); break;             // String: while peek != '"'
    case '0'...'9': scan_number(); break;       // Number: greedy consume
    case '!': add_token(match('=')?BANG_EQUAL:BANG); break; // Two-char
    case '/': if (match('/')) skip_comment(); else add_token(SLASH); break;
    default:
        if (is_alpha(ch)) {                     // Identifier / keyword
            while (is_alphanumeric(peek())) advance();
            auto it = kKeywords.find(lexeme);
            add_token(it != end ? it->second : IDENTIFIER);
        } else error("Unexpected character.");  // Illegal character
    }
}

A keyword map decides which identifiers are "built-in" — sixteen entries:

const unordered_map<string, TokenType> kKeywords = {
    {"print", PRINT}, {"var", VAR}, {"if", IF}, {"while", WHILE},
    {"class", CLASS}, {"fun", FUN}, {"and", AND}, {"or", OR},
    {"return", RETURN}, {"super", SUPER}, {"this", THIS}, ...
};

This is also why print becomes the keyword PRINT while clock remains a plain IDENTIFIER — clock isn't in the table. The Scanner doesn't know about it. It only works as a function call because the Interpreter, at startup, manually stuffs a clock function object into the global environment. That's where compile time and run time part ways.

Encounter @ or some other character Lox doesn't use? The Scanner prints an error to stderr, sets a flag, and keeps going. Unlike the Parser — which throws exceptions to unwind — lexical errors don't cascade. The next token is still valid, so keep scanning.

Parser: Token Stream → AST

The Scanner produces a flat, one-dimensional list of tokens. The Parser's job is to shape them into a nested, tree-structured AST — an Abstract Syntax Tree. The expression print 2 + 3 * 4; isn't a pile of independent tokens; it's a print statement wrapping an addition whose right-hand side is itself a multiplication.

From a black-box perspective: the Parser takes std::vector<Token> in, and produces std::vector<Stmt> out:

auto Parser::parse_statements() -> std::vector<ast::Stmt> {
    std::vector<ast::Stmt> statements;
    while (!is_at_end()) {
        statements.push_back(declaration());  // Parse one top-level decl at a time
    }
    return statements;
}

Each top-level construct enters declaration(), which looks at the current token and dispatches:

declaration()
  ├─ Sees VAR    → var_declaration()          // var x = 1;
  ├─ Sees FUN    → function_declaration()     // fun foo(a,b) { ... }
  ├─ Sees CLASS  → class_declaration()        // class Dog < Animal { ... }
  └─ Otherwise   → statement()                // print / if / while / for / return / expression

Let's trace an expression statement: print 2 + 3 * 4;. statement() sees PRINT and enters print_statement(), which calls expression() to parse the right-hand side 2 + 3 * 4.

Expression parsing is where the Parser earns its keep. The technique is called recursive descent — lower-precedence operators wrap the higher-precedence ones. The parse of 2 + 3 * 4 goes like this:

expression() descends to term(). term() handles + and -:
  1. Call factor() for the left operand. factor() handles * and /.
     factor() descends to primary(), gets Literal(2). No * or / in sight → returns Literal(2).
  2. term() gets Literal(2) as the left operand. Check the next token: is it + or -?
     Yes — it's PLUS. Gobble the PLUS, record op = "+".
  3. term() calls factor() for the right operand.
     factor() descends to primary(), gets Literal(3). Then sees STAR. Gobbles it.
     Descends again, gets Literal(4). Returns Binary(*, 3, 4).
  4. term() wraps everything: Binary(+, Literal(2), Binary(*, 3, 4)).

The resulting tree — note how * sits deeper, ensuring it evaluates first:

Binary
├── left:  Literal(2)
├── op:    "+"
└── right: Binary
           ├── left:  Literal(3)
           ├── op:    "*"
           └── right: Literal(4)

Every precedence level follows the same template — recursively grab the left operand, then loop matching your own operators, recursively grab the right operand, wrap into a node:

auto term() -> Expr {
    auto expr = factor();                  // Left operand → delegate to lower level
    while (match(PLUS) || match(MINUS)) {   // Loop: my operators
        auto op = previous();
        auto right = factor();             // Right operand → delegate again
        expr = Binary(expr, op, right);    // Wrap
    }
    return expr;                           // My operators gone → return
}

Class declarations follow the same recursive pattern. class Dog < Animal { speak() { ... } } enters class_declaration(): grab the class name, check for < (superclass), then loop inside the braces matching FUN keywords, each time recursively calling function_declaration() to parse the method body:

auto class_declaration() -> Stmt {
    auto name = consume(IDENTIFIER);            // "Dog"
    optional<Expr> superclass;
    if (match(LESS))                            // Has superclass?
        superclass = Variable(consume(IDENTIFIER));  // "Animal"
    consume(LEFT_BRACE);
    vector<FunctionStmt> methods;
    while (!check(RIGHT_BRACE))
        methods.push_back(function_declaration());  // "speak() {...}"
    consume(RIGHT_BRACE);
    return ClassStmt{name, superclass, methods};
}

One elegant design choice: for loops don't get their own AST node. for_statement() desugars them directly into while + block at parse time. The Interpreter never needs to know for exists — it only handles while and block. Fewer node types, simpler backend.

Resolver: Compile-Time Binding

Between the Parser finishing and the Interpreter starting sits one more compile-time pass — the Resolver. Its core task: pre-compute, at compile time, where every variable lives, and stamp that information directly into the AST.

Why? Consider this innocent-looking code:

var x = "global";
{
    fun f() { print x; }
    f();              // Prints "global" ✓
    var x = "local";   // A new x in the same scope
    f();              // Should still print "global" (closure semantics)
}

If the Interpreter just naively walks up the scope chain at runtime, the second call to f() would find the newly-declared x = "local" and print the wrong thing. The Resolver prevents this.

It maintains a scope stack scopes_ — each frame is a map<name, bool> where true means "fully defined, ready to use" and false means "declared but not yet initialized" (the window between var x and the = sign). As it walks the AST:

struct Variable {
    Token name;
    int depth = -1;  // -1 = unresolved
    // >= 0 = "skip this many environment frames to find this variable"
};

When it enters {, it pushes an empty frame. When it exits }, it pops. For each Variable node, it scans the stack from top to bottom, counts how many frames separate the variable from its declaration, and writes that depth into the node. The Interpreter then uses env->get_at(depth, name) — no chain walking, no ambiguity, no pollution from later declarations.

The Resolver also catches a bunch of semantic errors at compile time — things that are syntactically valid but logically wrong:

return 42;                  // At top level? → Error
this.x = 1;                 // Outside a class? → Error
{ var x = x; }             // Self-reference in initializer? → Error
class Foo < Foo {}          // Inheriting from yourself? → Error
class Bar { super.m(); }   // super without superclass? → Error
class Baz { init() { return 1; } } // Returning value from init? → Error

All caught with exit code 65 before a single line of code runs.

Interpreter: The Real Thing

The Interpreter is the backend — it takes the depth-annotated AST from the Resolver, recursively walks the tree, and executes. Three functions, that's it:

auto interpret(statements)  → void           // Create global env, execute each stmt
auto execute(stmt, env)     → void           // Execute one statement (side effects)
auto evaluate(expr, env)    → LoxLiteral     // Evaluate one expression (produces a value)

execute handles all statements. print x → evaluate x and cout the result. var x = 1 → evaluate the right side, then env->define("x", 1.0). if (cond) { A } else { B } → evaluate the condition, check truthiness, execute the chosen branch. while (cond) { body } → loop evaluating the condition and executing the body until false.

evaluate handles all expressions. Literal → return the value directly. Binary → evaluate left and right, then apply the operator. Variable → skip depth frames and grab it with env->get_at(depth, name). Call → evaluate the callee (must be a Callable), evaluate each argument, then invoke .call().

The Environment Chain: Heart of the Interpreter

Every time the program enters { } (a block), a fresh Environment is created:

struct Environment {
    unordered_map<string, LoxLiteral> values_;     // Variables in this scope
    shared_ptr<Environment> enclosing_;            // Pointer to the outer scope
};

These link into a scope chain:

Global env:        {clock: Callable}
                       ↑
Block env:         {a: 1.0}           enclosing_ → global
                       ↑
Function body env: {x: 42.0}         enclosing_ → block

Variable lookup walks the chain. Variable definition only writes in the current frame. Assignment walks the chain to find where it was defined.

Why shared_ptr? Because closures grab the environment at definition time:

fun makeCounter() {
    var i = 0;                      // Local to makeCounter
    fun count() { i = i + 1; return i; }
    return count;                   // count escapes to the outside!
}
var c = makeCounter();
c();  // 1
c();  // 2

When count is defined, fn->closure captures makeCounter's local environment — the one holding i. After makeCounter returns, that environment would normally evaporate. But count's closure still holds a shared_ptr to it, so the reference count stays above zero and i lives on. Every call to c() enters Function::call():

auto Function::call(env, args) -> LoxLiteral {
    auto func_env = make_shared<Environment>(closure); // Parent = captured env
    for (i = 0; i < params.size(); i++)
        func_env->define(params[i], args[i]);          // Args bound in this frame
    try {
        for (auto& stmt : *body) execute(stmt, func_env);
    } catch (const Return& ret) { return ret.value; }   // return unwinds here
    return monostate{};
}

Parameters are bound in func_env itself (depth 0). Outer variables reachable through enclosing_. That's closures, in their entirety.

Classes, Inheritance, and super

A class itself is a value — it can be assigned to variables, passed as an argument, printed:

// When a class is defined:
auto klass = make_shared<LoxClass>();
klass->name = "Dog";
klass->methods_ = { "speak": speak_fn, "init": init_fn };
env->define("Dog", klass);  // Stored just like any other variable

Instantiation — Dog("Buddy") — calls LoxClass::call():

auto LoxClass::call(env, args) -> LoxLiteral {
    auto instance = make_shared<LoxInstance>();           // 1. Blank instance
    instance->klass = shared_from_this();                  // 2. Tag its class
    auto init = find_method("init");                      // 3. Find constructor
    if (init) {
        auto bound = copy(init);                          // 4. Copy the function
        bound->closure->define("this", instance);         // 5. Bind this
        if (superclass_) bound->closure->define("super", superclass_); // 6. Bind super
        bound->call(env, args);                           // 7. Execute constructor
    }
    return instance;                                      // 8. Always return instance
}

The constructor's return value is discarded — LoxClass::call() always returns the newly minted instance. That's the constructor guarantee.

Inheritance works through find_method, which walks the chain:

auto LoxClass::find_method(name) -> shared_ptr<Function> {
    if (methods_.contains(name)) return methods_[name];        // Check self
    if (superclass_) return superclass_->find_method(name);    // Check parent
    return nullptr;                                            // Not found
}

Method overriding is automatic — start from the subclass and go up; the first match wins.

How super Works — and Why declarating_class_ Exists

super has a subtle semantic. Consider this inheritance chain:

class A { say() { print "A"; } }
class B < A { test() { super.say(); }   say() { print "B"; } }
class C < B {                            say() { print "C"; } }
C().test();  // Should print "A", not "B" or "C"

C().test() → C doesn't have test(), so walk up to B → execute B's test(), which calls super.say(). The question is: which class does super refer to?

Logically, test was defined in class B, so its super should be B's parent — class A. But at runtime, this points to the C instance. If we naively compute this.klass.superclass_, C's parent is B — we'd find B's say(), print "B", and get it wrong.

So Function needs to remember which class it belongs to:

struct Function : Callable {
    weak_ptr<LoxClass> declaring_class_;  // ← "Which class defined me?"
};

Why weak_ptr rather than shared_ptr? Because LoxClass::methods_ already holds a shared_ptr<Function>. If Function held a shared_ptr back to LoxClass, we'd have a cycle — each keeps the other alive, reference counts never reach zero, memory leaks. A weak_ptr doesn't increment the reference count. It just asks: "are you still alive?" If the class is destroyed, the method doesn't need it anymore.

When a method is bound to an instance — that is, when instance.test() triggers LoxInstance::get("test") — the declaring class is used to inject super into the closure:

if (auto dc = method->declaring_class_.lock()) {           // Get defining class
    if (dc->superclass_ != nullptr)
        bound->closure->define("super", dc->superclass_);   // super = B's parent = A
}

No matter how deep the inheritance chain goes below it, super always refers to the superclass of the method's birthplace.

The init Constructor

An init method is mechanically identical to any other method — it just gets two special treatments. First, LoxClass::call() always returns the instance regardless of what init returns. Second, Function carries an is_init_ flag:

if (is_init_) {
    return func_env->get_at(1, "this");   // Return this, not nil
}

This allows chaining: instance.init(x).someProp — after init runs, the return value is the instance itself.

Why return Throws an Exception

A return statement can be buried at any depth — inside an if, inside a while, inside a block, inside another if. If we passed the return value back through the call stack layer by layer, every single execute would need to check "did something below me return?" That's noise in every handler.

Instead, Function::call() wraps the body execution in a try-catch. The return statement throws a Return object. No matter how deep the nesting, it unwinds directly to the catch block at the top of the function. Clean, fast, no boilerplate.

Closing

From a raw string through the Scanner's 39 token types, through the Parser's recursive descent building the AST, through the Resolver's compile-time binding and semantic checks, to the Interpreter's environment chain and recursive execution — a working interpreter gets built one layer at a time. Around 1600 lines of source, 1200 lines of tests, zero external dependencies.

The biggest takeaway from building this: interpreters aren't magic. The core engine is a few hundred lines of tree walking. Everything else — expression evaluation, scope management, function calls — is layering features on top of that kernel. After writing your own Scanner/Parser/Interpreter, you can open the CPython or V8 source code and immediately recognize which module does what.

Of course, this implementation is heavily stripped down. No JIT compilation (tree-walk interpretation is the slowest execution model), no bytecode generation, no garbage collector (reference counting via shared_ptr is the entire story), no tail-call optimization, no real error recovery (just the simplest panic-mode synchronize), no standard library beyond a single clock function. Industrial interpreters — CPython, V8, LuaJIT — invest hundreds of thousands of lines in these directions.

But that's exactly the point. By peeling away all the performance optimizations and engineering complexity, what's left is the raw, unvarnished four-layer pipeline that every interpreter shares. If you want to see what that looks like, the code is at https://github.com/Tenaryo/LoxInterp. Pull requests and nitpicks welcome.

Building SQLite from Scratch: 740 Lines of C++23 to Understand Every Byte of a .db File

Tyler Tan — Fri, 22 May 2026 11:56:56 +0000

You fire up a MySQL client, connect to port 3306, send off your SQL, and the server parses, optimizes, hits an index, fetches rows, and packs the result back to you. You can picture that entire pipeline.

SQLite has none of that. No server process, no port, no wire protocol. Just a single file: my.db.

So the real question is — what exactly is stuffed inside that file that makes SELECT * FROM apples WHERE color='Yellow' return the right answer?

TinySqlite takes this apart across 740 lines of C++23. It doesn't link against the official SQLite library. It opens a .db file's raw binary and pries the data directly out of the disk bytes. We'll follow its code path, peeling back SQLite's file format layer by layer.

This article covers: file header → B-tree pages → varint encoding → the schema table → full table scans → index scans.

What SQLite Actually Is

Let's get the definition straight first.

SQLite is an embedded relational database engine — in plain English: it's a C library you compile into your program, and once you open a file, you can run SQL against it. No server, no install, no root password.

If you're familiar with MySQL, here's the mental model. MySQL is a restaurant — a dedicated kitchen (server process), waitstaff (connection handling), a complex ordering system (query optimizer). You sit down, say "SELECT," and the back of house scrambles to bring you the dish.

SQLite is your fridge. Open it, grab what you need, nobody serves you. The entire database is a single data.db file. Copy it, carry it, done.

Traditional client-server database:           SQLite's embedded model:
┌─────────┐   TCP/network   ┌───────────┐   ┌──────────────────────────────┐
│ Your app │ ←───────────→ │ DB server  │   │ Your app                     │
└─────────┘                └───────────┘   │  ├── libsqlite.so (engine)    │
                                           │  ├── data.db (the only file)  │
                                           │  └── all ops are local calls  │
                                           └──────────────────────────────┘

Why should you care? You might not encounter MySQL every day, but you're almost certainly already using SQLite. Your phone's contacts, WeChat messages, Chrome bookmarks and browsing history — all stored in SQLite. Every iPhone, every Android device, every browser runs a SQLite instance. It's probably the most deployed database engine on the planet, bar none.

Using it is trivial. Create a database, make a table, insert data, query:

$ sqlite3 test.db
sqlite> CREATE TABLE fruits (name TEXT, price INT);
sqlite> INSERT INTO fruits VALUES ('apple', 5);
sqlite> SELECT * FROM fruits WHERE price < 10;
apple|5

If you're writing C/C++, include sqlite3.h and a handful of lines embed a full database in your program.

Great. You're using it comfortably. But do you actually know — what do the bytes inside test.db look like?

Now flip roles. Stop being the user, become the reverse engineer. TinySqlite is a set of reverse-engineering notes that dissects the .db file's binary structure piece by piece. Let's begin.

Opening the File — How a .db File Is Organized

The Entire File Is a Chain of Pages

At the macro level, a .db file is astoundingly simple: it's a sequence of fixed-size pages laid end to end. Every page is the same size (typically 4096 bytes), numbered starting from page 1.

Picture a bookshelf where every shelf slot is the same width. To find the 3rd book, you start from the shelf edge and count to position 3 × slot_width. SQLite pages work the same way — the data for page N starts at file offset (N-1) × page_size.

my.db file:
┌─────── page 1 ───────┐┌─────── page 2 ───────┐┌─────── page 3 ───────┐┌── ...
│ file header (1st 100B)││  page header         ││  page header         │
│ page_size = 4096      ││  type = 0x0D (leaf)  ││  type = 0x05 (interior)
│ num_tables = 3        ││  cells = [row1,row2] ││  child page ptrs     │
│ ...                   ││  ...                 ││  ...                 │
└───────────────────────┘└──────────────────────┘└──────────────────────┘

Page 1 is special — its first 100 bytes form the file header, storing global metadata. Every page after that has only a page header followed by actual data.

What the File Header Carries

The first 100 bytes of page 1 in every SQLite file follow a fixed format. The first 16 bytes are the magic string "SQLite format 3\000" — the file's "ID card." It tells any program that tries to read the file: hey, I'm a SQLite 3 format database.

Bytes 16–17 store the page size. Note that this is stored in big-endian — high byte first. If these two bytes read 0x10 0x00, that's 4096. If the page size is 512, they'd be 0x02 0x00.

Here's how TinySqlite reads the page size:

static constexpr size_t kPageSizeOffset = 16;

auto read_u16_be(size_t offset) const noexcept -> uint16_t {
    return static_cast<uint16_t>(data_[offset]) << 8
         | static_cast<uint16_t>(data_[offset + 1]);
}

// In the constructor:
page_size_(read_u16_be(kPageSizeOffset))

Two bytes assembled into a uint16_t. No magic.

Another critical number hides at byte offset 103 (corresponding to SQLite's byte 56): the table count. TinySqlite reads this to know how many tables the database holds — both system and user tables.

static constexpr size_t kSchemaCountOffset = 103;
num_tables_(read_u16_be(kSchemaCountOffset))

What a B-tree Page Looks Like

Now that you know a file is a chain of pages, the next question is — what's inside a page? How is data actually organized?

SQLite uses a B-tree to organize data. Each table is a B-tree, and each node in that tree is a page. Pages have two roles:

Interior pages (type 0x05): These don't store actual data rows. They store "signposts" — child page numbers and key ranges. Their job is navigation: which subtree contains the data you're looking for.
Leaf pages (type 0x0D): These hold the real row data. Every INSERT ultimately lands in a cell on some leaf page.

Visually:

              Page 2 (interior, 0x05)
               /                      \
      Page ? (leaf, 0x0D)        Page ? (leaf, 0x0D)
       [Granny Smith]              [Fuji]
       [Golden Delicious]          [Honeycrisp]

A page's internal structure (the page header) starts at page offset 0:

Offset	Size	Meaning
0	1 byte	page type (`0x05`=interior table, `0x0D`=leaf table, `0x02`=interior index, `0x0A`=leaf index)
1	2 bytes	offset of first freeblock
3	2 bytes	number of cells
5	2 bytes	start of cell content area
8	4 bytes	number of fragmented free bytes

For interior pages, after the page header comes a rightmost child pointer (4 bytes), pointing to the rightmost child page. Then the cell pointer array — 2 bytes per cell, each pointing to that cell's actual location within the page.

This "rightmost child + cell pointer array" structure is what enables the B-tree to hop between pages. We'll expand on this when we cover full table scans.

At this point, you know three key facts:

A .db file is a sequence of fixed-size pages
The file header tells you the page size and how many tables exist
A B-tree organizes the data — interior pages navigate, leaf pages store rows

The next natural question — how does SQLite know which tables exist and where each table's B-tree root lives? The answer is tucked inside a special table.

sqlite_master — The Database's "Table of Contents"

The Table of Tables

SQLite has a hidden system table called sqlite_master. Think of it as the table of contents at the front of a book — it doesn't store your business data, it describes the structure of the entire database.

sqlite_master (system table, exists in every .db file)
┌──────────┬───────────┬──────────┬─────────────────────────────┐
│ type     │ name      │ rootpage │ sql                         │
├──────────┼───────────┼──────────┼─────────────────────────────┤
│ "table"  │ "apples"  │ 2        │ "CREATE TABLE apples(...)"  │
│ "table"  │ "oranges" │ 4        │ "CREATE TABLE oranges(...)" │
│ "index"  │ "idx_..." │ 6        │ "CREATE INDEX..."           │
└──────────┴───────────┴──────────┴─────────────────────────────┘

Each row represents a table, index, view, or trigger. The two most important columns are name (the table name) and rootpage (the root page number of that table's B-tree). When you later run SELECT * FROM apples, that rootpage = 2 is how the engine finds the entry point to the apples table's data.

So how do you find sqlite_master itself? Its data lives at a fixed location — page 1. The file header's kSchemaCountOffset tells you how many rows there are, and right after it, starting at kCellPtrArrayStart (offset 108), is the cell pointer array — each 2-byte pointer references a cell within page 1 that belongs to sqlite_master.

But before we can actually parse those cells, we need two encoding tools — varint and serial type. They're how SQLite "writes numbers" and "describes types" on disk.

Varint: Waste Not, Want Not

SQLite's on-disk format leans heavily on a variable-length integer encoding called varint. The core idea is simple: small numbers take less space, big numbers take more.

The rule: each byte contributes its lower 7 bits as data, and the highest bit (bit 7) is a "continue" flag. If bit 7 is 1, there are more bytes coming. If it's 0, this is the last byte. Up to 9 bytes, with the 9th byte using all 8 bits.

Value	Encoding (hex)	Bytes
5	`05`	1
300	`82 2C`	2
1000000	`3D 09 40`	3

300 = 0b_0000_0010_0010_1100. Split into 7-bit groups: 0000010 and 0101100. Add the continue bit — high group gets 1 (10000010 = 0x82), low group gets 0 (00101100 = 0x2C). read_varint does the reverse, pulling the value back out of the byte stream:

auto read_varint(size_t offset) const noexcept -> VarintResult {
    uint64_t value = 0;
    for (int i = 0; i < 9; ++i) {
        auto byte = static_cast<uint8_t>(data_[offset + i]);
        if (i == 8) {
            value = (value << 8) | byte;
            return {value, 9};
        }
        value = (value << 7) | static_cast<uint64_t>(byte & 0x7F);
        if ((byte & 0x80) == 0)
            return {value, static_cast<size_t>(i) + 1};
    }
    std::unreachable();
}

Each iteration grabs 7 bits and shifts them into the result. The loop stops when it hits a byte whose high bit is 0. If it reaches the 9th byte without stopping, it uses the full 8 bits — that's varint's maximum width.

Serial Type: What Exactly Is in This Column

Every column of a record carries a serial type code on disk. It tells the parser: is this column NULL, an integer, text, and how many bytes does it occupy?

Type code	Meaning	Size in bytes
0	NULL	0
1 ~ 4	1/2/3/4-byte integer	equals the type code
5	6-byte integer	6
6	8-byte integer	8
7	IEEE float	8
8	literal 0	0
9	literal 1	0
≥12 and even	BLOB	(N - 12) / 2
≥13 and odd	text string	(N - 13) / 2

Notice the two special values 8 and 9 — the integer values 0 and 1 take up zero bytes on disk; the serial type alone encodes the value. SQLite's disk format is this miserly.

The corresponding code in TinySqlite:

static constexpr auto serial_type_size(uint64_t serial_type) noexcept -> size_t {
    if (serial_type <= 4)
        return std::array{0, 1, 2, 3, 4}[serial_type];
    if (serial_type == 5) return 6;
    if (serial_type == 6 || serial_type == 7) return 8;
    if (serial_type >= 12 && serial_type % 2 == 0)
        return static_cast<size_t>((serial_type - 12) / 2);
    if (serial_type >= 13 && serial_type % 2 == 1)
        return static_cast<size_t>((serial_type - 13) / 2);
    return 0;
}

Decoding a Schema Row in Real Time

With varint and serial type in hand, we can now decode a single row from sqlite_master. Say this row describes the apples table: type="table", name="apples", rootpage=2, sql="CREATE TABLE apples(...)".

A schema table cell roughly follows this layout: [payload size (varint)] [rowid (varint)] [header size (varint)] [5 serial types (varint)] [body: actual data for 5 columns]

TinySqlite's parse_schema_entry does exactly this: skip payload size and rowid → read 5 serial types → compute per-column byte sizes from each serial type → sequentially read type, name, tbl_name, rootpage, and sql from the body region.

You don't need to memorize every step — just know that this is how you get a table name and its rootpage out of raw binary. Once you have the rootpage, you start traversing that table's B-tree from the corresponding page.

SELECT + WHERE — How Data Actually Gets Found

The first three sections laid all the groundwork: file format, B-tree, schema table, encoding primitives. Now let's answer the question that's been hanging since the beginning —

You type SELECT name FROM apples WHERE color='Yellow'. What does SQLite actually do?

SQL parsing? TinySqlite handles it with string search — find FROM, split column names, find WHERE, extract conditions. Let's skip past that in one sentence.

The interesting part is what comes next: finding the data.

Full Table Scan: Recursive Descent, Zero Missed Rows

Without an index, SQLite's only option is a full table scan — read every row of the target table top to bottom, then filter with the WHERE clause. Sounds simple enough, but when a table spans multiple pages, how do you guarantee nothing is skipped?

The answer is in the B-tree traversal algorithm. The entry point is the rootpage — the page number recorded in sqlite_master. TinySqlite starts here:

auto rp = db.rootpage(sel.table);

From the rootpage, read_columns_values recursively walks the entire B-tree. The core logic:

If it's an interior page (0x05): each cell has a left child pointer pointing to a subtree. Iterate all cells, recursively process each subtree. There's also a rightmost child pointer pointing to the rightmost subtree — can't forget that one.
If it's a leaf page (0x0D): each cell is a data row. Read them one by one and match against the WHERE condition.

The rightmost child is the most overlooked piece of design — that 4-byte pointer sitting before the cell pointer array on interior pages. It points to the subpage covering the range "to the right" of all cells. Without it, the rightmost chunk of data simply gets dropped.

// Interior page processing logic
auto num_cells = read_u16_be(page_offset + 3);
auto right_child = read_u32_be(page_offset + 8);  // ← don't forget this one

for (uint32_t i = 0; i < num_cells; ++i) {
    auto cell_ptr = page_offset + read_u16_be(page_offset + 12 + i * 2);
    auto child = read_u32_be(cell_ptr);  // ← left child
    auto child_rows = read_columns_values(child, column_indices, filter);
    // ...collect rows from the child page...
}
auto right_rows = read_columns_values(right_child, column_indices, filter);
// ...collect rows from the rightmost child...

The recursion keeps diving, stops at leaf pages, reads data, and bubbles it back up. The entire B-tree gets fully traversed — not a single row is missed.

Column value extraction for each row relies on the serial type machinery from the previous section: read varints for serial types and byte widths, then pull data out of the payload region. If a column happens to be the one in the WHERE clause, compare on the spot.

Index Scan: Taking the B-tree Shortcut

The problem with full table scans is obvious: you're only looking for color='Yellow', yet you're reading every apple of every color. When a table has hundreds of thousands of rows, this hurts.

SQLite's solution is an index. An index is itself a B-tree, but with a few twists:

Page types are 0x02 (interior index page) and 0x0A (leaf index page)
Cells don't carry full rows. They carry index column values + rowid
The B-tree is sorted by the indexed column

Back to WHERE color='Yellow'. If the apples table has an index on the color column, TinySqlite's path becomes:

Step 1: Search the index B-tree, collect matching rowids.

The index_search function traverses the index B-tree to find every entry where color='Yellow'. Because the index pages are sorted by the color column, the search is binary — compare the index value on the current page, go left if the target is lower, collect the rowid on match, and stop (or continue searching the left subtree) if the target is higher. Vastly more efficient than a full table scan.

Step 2: Use rowids to locate individual rows in the table B-tree.

With a list of rowids in hand, read_row_by_rowid does point lookups on the table B-tree. Each point lookup follows a path similar to the recursive scan — compare rowids on interior pages to decide which child page to descend into — but it hunts for a single row rather than traversing every cell.

if (sel.where) {
    auto idx_rp = db.index_rootpage(sel.table, sel.where->column);
    if (idx_rp) {
        std::vector<uint64_t> rowids;
        db.index_search(*idx_rp, sel.where->value, rowids);
        for (auto rid : rowids) {
            auto row = db.read_row_by_rowid(*rp, rid, col_indices);
            // ...collect results...
        }
        return result;
    }
}
// No index, fall back to full table scan
return QueryResult{.rows = db.read_columns_values(*rp, col_indices, filter)};

The code above is the core decision logic in TinySqlite's execute_query: if there's an index, use it; otherwise, do a full scan. SQL parsing, schema rootpage lookup, B-tree traversal, WHERE filtering, index search — everything covered so far converges into these dozen lines.

Of course, production-grade SQLite is far more complex. It has a query optimizer to pick among indexes, WAL journaling for crash recovery, multi-version concurrency control, B-tree page splits and rebalancing. But the core skeleton — file header → page → B-tree → schema → full scan / index scan — is exactly this.

Databases Aren't That Mysterious

Starting from the client-server architecture contrast with MySQL, we've peeled all the way down to the .db file's bedrock: fixed-size pages strung together into a tree, a handful of header bytes telling you page size and table count, a schema table whose varint-encoded rows describe where every table lives, interior B-tree pages serving as signposts and leaf pages holding the actual data, and indexes as separate B-trees already sorted for you.

740 lines of C++23, zero external dependencies, spanning the full path from a binary file header to a SELECT query result. It won't run TPC-C, and it's not going to replace libsqlite3 in your project. But if you want to see what every byte in a .db file is doing, it's exactly enough.

Code at https://github.com/Tenaryo/TinySqlite — this is a teaching-grade implementation, not an industrial one. Issues and feedback are still welcome.

Building Kafka from Scratch: A Message Broker in 1800 Lines of C++23

Tyler Tan — Thu, 21 May 2026 02:43:49 +0000

You wrote a web scraper. It crawls product pages and pipes the results downstream for processing. You wired it up with raw TCP, fire-and-forget style. Then the downstream service crashed. After the restart, the messages were gone, and your scraper had no idea what was sent and what wasn't.

You need something that holds onto messages until the consumer is ready to pick them up. In other words, you need a message queue.

Enter Kafka.

Kafka is the most widely deployed distributed messaging engine on the planet, powering data pipelines at LinkedIn, Uber, Netflix, and basically anywhere that moves serious volume. You give it a topic (say, crawler-results), producers push messages in, consumers pull them out. In between sits the Broker, handling connections, persisting data to disk, and routing traffic. Messages don't get lost, ordering is preserved, and scaling is just a matter of adding more machines.

But real Kafka clocks in at roughly half a million lines of Java. Even browsing the source tree is enough to make most people close the tab.

So I stripped it to the bone. The result is TinyKafka, written from scratch in C++23, ~1,800 lines of core code, zero external dependencies, pure standard library and POSIX sockets. It implements four essential APIs, Produce, Fetch, DescribeTopicPartitions, and ApiVersions, plus a hand-rolled Kafka binary protocol stack and disk-backed persistence. Over 3,200 lines of tests verify every byte on the wire and every write to disk.

Running it is trivial:

$ cd TinyKafka && ./build.sh && ./build/kafka
Waiting for clients to connect...

It binds port 9092 and sits there waiting for Kafka clients to show up.

So what actually happens to a message from the moment it arrives to the moment it leaves? Let's crack it open, layer by layer.

What Kafka Actually Is

Let's get the concepts out of the way in two paragraphs.

Kafka's core model has three pieces: a Producer writes messages into a logical channel called a Topic, a Consumer reads messages from that Topic, and the Broker in the middle stores and forwards everything. A Topic can be split into multiple Partitions, spreading data across them so throughput scales horizontally.

It solves three problems: decoupling (producers and consumers don't need to know about each other), buffering (messages pile up on disk and get consumed at the consumer's pace), and durability (messages hit the disk and survive restarts).

Alright, concepts done. Now let's see what a Broker looks like when you remove everything that isn't essential.

The While Loop Is the Whole Engine

Here's TinyKafka's entire flow in pseudocode:

1. Startup: read metadata from disk → now you know what topics and partitions exist
2. Bind port 9092, start accepting client connections
3. For each connected client, detach a thread:
   while (connection alive) {
       read 4 bytes → now you know the message length
       read the full message body
       parse_request()   → binary blob becomes typed struct
       Broker::handle()  → do the actual work
       serialize()       → response back to binary
       send_all()        → fire it back to the client
   }

That's main.cpp in its entirety, 88 lines. Here's the real thing:

int main() {
    // 1. Read KRaft metadata from disk
    auto metadata = parse_cluster_metadata_file(
        "/tmp/kraft-combined-logs/__cluster_metadata-0/00000000000000000000.log");

    // 2. Start TCP server on port 9092
    auto server = Server::create(9092);

    // 3. Accept loop: accept → hand off to thread
    while (true) {
        auto client = server->accept();
        int client_fd = *client;

        std::thread([client_fd, &metadata] {
            Broker broker(metadata, "/tmp/kraft-combined-logs");

            while (true) {
                // Receive: first read 4-byte length prefix
                std::array<uint8_t, 4> len_buf{};
                recv_all(client_fd, len_buf);
                auto message_length = decode_int32_be(len_buf);

                // Read the message body
                std::vector<uint8_t> buf(message_length);
                recv_all(client_fd, buf);

                // Binary → typed request → handle → serialize → send back
                auto req  = parse_request(buf);
                auto resp = broker.handle(*req);
                auto bytes = serialize(resp);
                send_all(client_fd, bytes);
            }
        }).detach();
    }
}

Notice the detach(). Each client gets its own thread running its own receive-parse-handle-send loop. Multiple producers and consumers can connect simultaneously without stepping on each other. Simple, blunt, and effective.

Real Kafka is far less cowboy about this. It uses thread pools and a Reactor pattern to avoid the cost of spawning and tearing down threads constantly, and Java NIO for non-blocking I/O. TinyKafka's one-thread-per-connection model is more of a proof of concept, it lets you see the concurrency model in one glance. The real thing adds enormous engineering (zero-copy sendfile, mmap-backed file access, segmented indices for O(1) offset lookup, and a dozen other things), but the skeleton loop, receive request, process, send response, is identical.

Speaking Kafka's Language: The Binary Protocol

So what's parse_request() actually doing? Turning raw network bytes into C++ structs, and that's where TinyKafka's grittiest code lives: a hand-built implementation of the Kafka binary wire protocol.

Every Kafka message is structured as three parts:

+------------------+------------------+------------------+
|  message_size    |     Header       |      Body        |
|  (4 bytes, BE)   |  (variable)      |  (API-dependent)  |
+------------------+------------------+------------------+

message_size: a 4-byte big-endian integer that tells the other side "here's how many more bytes to read." TCP is a stream protocol with no built-in message boundaries. This length prefix is a simple framing layer, without it you'd never know when one message ends and the next begins.

Header carries three critical fields:

api_key (2 bytes): what kind of message this is. 0 = Produce, 1 = Fetch, 18 = ApiVersions, 75 = DescribeTopicPartitions. Real Kafka has over a hundred api keys; we implemented four.
api_version (2 bytes): the version of this API. Kafka keeps multiple versions of the same API alive simultaneously. An old client speaks v0, a newer one speaks v16, the broker picks the intersection.
correlation_id (4 bytes): a sequence number for matching responses to requests. The client stamps it on the request, the broker echoes it back, and the client uses it to figure out "this response goes with that request I sent earlier."

Body: varies by api_key. A Produce body carries a topic name and a blob of record batch bytes. A Fetch body carries a topic UUID and a list of partitions. The structure is strictly defined by the Kafka protocol spec.

Why binary instead of something like HTTP? Because it's compact. An int32 in HTTP is the string "2147483647", ten bytes. In binary it's always exactly four bytes. Kafka moves trillions of messages a day; that difference is not academic. And fixed-position binary fields mean no per-character scanning like JSON parsing, byte 4 is always this, bytes 5-6 are always that, one memcpy and you're done.

Since network byte order is big-endian everywhere, TinyKafka has a small arsenal of hand-rolled encode/decode primitives. Reading an int32, for instance:

auto decode_int32_be(std::span<const uint8_t, 4> data) -> int32_t {
    return (static_cast<int32_t>(data[0]) << 24) |
           (static_cast<int32_t>(data[1]) << 16) |
           (static_cast<int32_t>(data[2]) << 8)  |
           static_cast<int32_t>(data[3]);
}

Four bytes. Most significant in data[0], least significant in data[3]. This function looks trivial, but the entire Kafka protocol stack is built out of hundreds of calls just like it.

On top of these primitives sit ByteReader and ByteWriter, two utility classes that read and write int16/int32/varints/compact strings sequentially over std::span. parser.cpp runs 290 lines, serializer.cpp 225, both standing on the shoulders of these two helpers.

The api_key in the header determines how the body gets parsed and how the request gets handled. We implemented four of them, 0, 1, 18, and 75. Let's take them one at a time.

Four APIs, Four Kinds of Work

Open broker.cpp and you'll find a single method, Broker::handle(), that does all the heavy lifting. But before we look at the dispatch mechanism, let's understand what each of the four APIs actually does.

ApiVersions (api_key = 18): The Handshake

The first thing a Kafka client typically does after connecting is ask the broker: "What APIs do you support, and which version ranges?"

The response is a table:

struct ApiVersionEntry {
    int16_t api_key;       // API number
    int16_t min_version;   // lowest supported version
    int16_t max_version;   // highest supported version
};

struct ApiVersionsResponse {
    int32_t correlation_id;
    int16_t error_code;
    std::vector<ApiVersionEntry> api_keys;  // our four entries
    int32_t throttle_time_ms;
};

TinyKafka's answer is this compile-time table:

inline constexpr std::array<ApiVersionEntry, 4> kSupportedApis{{
    {.api_key = 0,  .min_version = 0, .max_version = 11},  // Produce
    {.api_key = 1,  .min_version = 0, .max_version = 16},  // Fetch
    {.api_key = 18, .min_version = 0, .max_version = 4},   // ApiVersions
    {.api_key = 75, .min_version = 0, .max_version = 0},   // DescribeTopicPartitions
}};

If the client sends a version outside [0, 4], the broker fires back error_code = 35 (UNSUPPORTED_VERSION) and that's the end of the conversation. That's Kafka version negotiation in its entirety, simpler than HTTP Content-Negotiation by a mile.

DescribeTopicPartitions (api_key = 75): "What partitions does this topic have?"

A client wants to know about a topic's metadata. Does it exist? What partitions does it have? Who's the leader of each partition?

TinyKafka handles this by looking up the topic name in an in-memory ClusterMetadata structure. That structure gets built at startup by parsing a KRaft metadata log file, __cluster_metadata-0/00000000000000000000.log, which contains the canonical record of every topic and partition.

Found it? Here's the partition list. Didn't find it? error_code = 3 (UNKNOWN_TOPIC_OR_PARTITION). Results come back sorted alphabetically by topic name, because the Kafka spec demands it. You can see this sorting behavior verified byte-for-byte in the integration tests: send ["zebra", "apple"], get back ["apple", "zebra"].

Fetch (api_key = 1): Consumer Pulling Messages

A consumer says: "Give me the messages for partition 0 of the topic with UUID a1b2c3d4...."

A Fetch request comes with a pile of fields, max_wait_ms, min_bytes, max_bytes, isolation_level, session_id, session_epoch, all controlling fetch behavior. TinyKafka keeps only the two that matter for the minimal path: topic UUID and partition_index. Everything else gets skipped. Real Kafka uses those extra fields for long-polling, transactional isolation, and other advanced features, but our goal is just to get the bytes flowing.

struct FetchTopicRequest {
    std::array<uint8_t, 16> topic_id;           // 16-byte UUID
    std::vector<FetchPartitionRequest> partitions;
};

struct FetchPartitionResponse {
    int32_t partition_index;
    int16_t error_code;
    std::vector<uint8_t> records;  // the payload: raw record batch bytes
};

Once the broker finds the topic, it reads the entire partition log file from disk and stuffs it directly into the records field. The consumer gets raw record batch bytes and does its own decoding. This is Kafka's philosophy: the broker should touch message content as little as possible. It stores, it forwards. Decoding is the client's problem.

Produce (api_key = 0): Producer Sending Messages

This is where the scraper's data from the opening story finally enters Kafka:

struct ProduceTopicRequest {
    std::string topic_name;
    std::vector<ProducePartitionRequest> partitions;
    // each partition carries a blob of records (record batch bytes)
};

TinyKafka looks up the topic by name, verifies the partition exists, and appends the record batch bytes to a disk file. Kafka doesn't store messages one at a time. They're packed into record batches, each batch containing multiple records, with a magic byte tagging the format version. This batching is what gives Kafka its legendary throughput, it dramatically reduces the number of disk I/O operations.

Four APIs down. Now let's see how the broker routes a request to the right handler.

variant + visit + overloaded: The Compiler Won't Let You Forget

TinyKafka models the four request types as a std::variant:

using Request = std::variant<
    ApiVersionsRequest,
    DescribeTopicPartitionsRequest,
    FetchRequest,
    ProduceRequest
>;

Response works the same way. Then Broker::handle() dispatches every case in one shot using std::visit with the overloaded pattern:

auto Broker::handle(const Request& req) -> Response {
    return std::visit(overloaded{
        [](const ApiVersionsRequest& r) -> Response { /* version negotiation */ },
        [this](const DescribeTopicPartitionsRequest& r) -> Response { /* lookup */ },
        [this](const FetchRequest& r) -> Response { /* read from disk */ },
        [this](const ProduceRequest& r) -> Response { /* write to disk */ },
    }, req);
}

In the MoonieCode post we called std::variant a "paranoid envelope": it holds exactly one of the declared types, nothing else, and the compiler forces you to handle every single case. Forget to write the Produce handler? Your build breaks. No virtual function overhead, no if-else chain, no missed cases. It's cleaner and safer than traditional OOP with virtual dispatch.

Where Messages Live: The Disk Layout

DescribeTopicPartitions depends on metadata read from disk. Fetch reads from disk. Produce writes to disk. They all converge on TinyKafka's storage layer. So what exactly is sitting on that filesystem?

Directory Structure: Exactly Like Real Kafka

/tmp/kraft-combined-logs/
├── __cluster_metadata-0/
│   └── 00000000000000000000.log    ← KRaft metadata
├── orders-0/
│   └── 00000000000000000000.log    ← partition 0 of the "orders" topic
├── crawler-results-0/
│   └── 00000000000000000000.log    ← partition 0 of "crawler-results"
└── ...

The naming convention is {topic}-{partition}/00000000000000000000.log, identical to real Kafka. Offset starts at zero. Single segment file per partition. No rolling segments (real Kafka would rotate to 00000000000000000020.log after hitting a size threshold).

The Starting Point: Metadata Files

That __cluster_metadata-0/00000000000000000000.log is the KRaft metadata log. Since Kafka 2.8, KRaft mode lets you run without ZooKeeper, cluster metadata lives as record batches inside this file.

At startup, TinyKafka slurps it into memory and walks the record batch v2 format layer by layer: first identify record batch boundaries (magic byte = 2), then parse each record's value. The value itself is a compact varint-encoded frame where the critical field is type: type=2 means it's a topic record (name + UUID), type=3 means it's a partition record (partition ID + parent topic UUID). Topics and partitions get linked by UUID.

The parsed result becomes a ClusterMetadata struct:

struct ClusterMetadata {
    std::vector<TopicInfo> topics;                              // all topics
    std::unordered_map<std::string, size_t> name_to_topic;     // lookup by name
    std::unordered_map<std::array<uint8_t, 16>, size_t, UuidHash> uuid_to_topic;  // lookup by UUID
};

Three tables. Two lookup paths. O(1) to find any topic. With this map built, all routing falls into place:

Produce routing: topic name → name_to_topic → find partition list → verify partition exists → write to disk
Fetch routing: topic UUID → uuid_to_topic → find topic name → construct file path → read from disk

Notice that Produce uses the name and Fetch uses the UUID. This isn't arbitrary, it's mandated by the Kafka protocol: producers send by topic name, consumers fetch by UUID (because the DescribeTopicPartitions step already translated name to UUID for them).

Writing and Reading

Writing (the Produce path) takes only a few lines:

auto dir = std::format("{}/{}-{}", root_path, topic_name, partition);
std::filesystem::create_directories(dir, ec);   // auto-create directory tree
auto path = std::format("{}/00000000000000000000.log", dir);
std::ofstream file(path, std::ios::binary | std::ios::app);  // append mode
file.write(records.data(), records.size());

create_directories handles the first write to a new partition by building the directory tree automatically. ios::app means every write appends to the end of the file, never overwriting existing data.

Reading (for Fetch and metadata) is just as short: ifstream open, tellg for size, one shot into a vector<uint8_t>:

std::ifstream file(path, std::ios::binary | std::ios::ate);
auto sz = file.tellg();
std::vector<uint8_t> data(sz);
file.read(reinterpret_cast<char*>(data.data()), sz);

Real Kafka would never read entire files into memory like this. It uses mmap for file-backed access and indexed lookups to find the exact byte range it needs by offset. But for an 1,800-line prototype, simple and direct is exactly the right call.

The Best Way to Understand Something Is to Build It

That's TinyKafka, every layer from network protocol to disk storage, peeled open. In 1,800 lines it packs a full binary protocol stack, handlers for four APIs, KRaft metadata parsing, and disk-backed persistence. Think of it as a miniature Kafka anatomy model.

I didn't build TinyKafka to create a production-grade broker. I built it to understand. Kafka's documentation and source code are intimidatingly large, but once you've built a minimal version yourself, you realize the core skeleton isn't that complicated: accept binary requests over TCP, dispatch by api_key, route through metadata to the right disk file, read or write. Everything else, zero-copy, segmented indices, replica synchronization, ISR management, transactional support, is engineering built on top of that skeleton.

There's a learning philosophy here: instead of letting half a million lines of source code intimidate you, spend a day building a minimal prototype. Afterwards, those massive codebases stop looking like alien artifacts. You recognize the bones.

Code is on GitHub. Stars, issues, and ruthless code review are all welcome.

Building Claude Code from Scratch: A Minimal Agent in 393 Lines of C++

Tyler Tan — Wed, 20 May 2026 04:38:16 +0000

An AI coding assistant that reads your files, writes code, and runs shell commands. The core logic? A single while loop. I thought it was bullshit too, until I built one myself.

The project is called MoonieCode, and the code lives here: https://github.com/Tenaryo/MoonieCode. Written in C++23, clocking in at 393 lines of source (637 if you count tests). Here's what it looks like in action:

$ ./moonie-code -p "list all .cpp files in the project"

A few seconds later Claude spits back your file list. What just happened? You gave it a sentence, it threw that sentence into an HTTP request, shipped it off to a Claude Haiku model somewhere in the cloud, Claude decided it needed to run find, MoonieCode ran it for Claude, fed the output back, and Claude formatted it into something human-readable.

That first step wasn't running bash. First it had to talk to the LLM. So let's start there: how do you get C++ and Claude to shake hands?

Shaking Hands with Claude

Talking to an LLM boils down to two moves: you HTTP POST a blob of JSON at it, and it sends a blob of JSON back. MoonieCode's HttpClient is a 25-line class whose guts are basically this:

cpr::Response response = cpr::Post(
    cpr::Url{base_url_ + "/chat/completions"},
    cpr::Header{{"Authorization", "Bearer " + api_key_},
                {"Content-Type", "application/json"}},
    cpr::Body{request_body.dump()}
);

cpr is a C++ wrapper around libcurl that handles the HTTP plumbing so you don't have to. You stuff your API key into the Authorization header, pack your JSON into the body, and POST to OpenRouter, an LLM API gateway that forwards the request to Claude for you.

So what's in that JSON? Two things: messages and tools.

messages is an array holding the conversation history between you and Claude. At the start it's just one entry:

{"role": "user", "content": "list all .cpp files in the project"}

tools is another array that tells Claude "here's what you have at your disposal." Each tool is a JSON object with a name, a description, and a parameter schema. Claude scans the list and goes, alright, I can ask this program to read files, write files, and run commands for me.

After you fire off the request, Claude sends back a JSON response. And here's where it gets fun: Claude's response comes in exactly two flavors.

Flavor one, straight text. You ask "what's 1+1" and it just answers:

{"choices": [{"message": {"content": "1+1 equals 2"}}]}

Flavor two, tool call. You ask it to "list all cpp files" and it can't answer directly, so it asks for help:

{"choices": [{"message": {"tool_calls": [{
  "id": "call_abc123",
  "function": {
    "name": "Bash",
    "arguments": "{\"command\": \"find . -name '*.cpp'\"}"
  }
}]}}]}

It's saying "I can't do this myself, but run this command for me and I'll take it from there." Notice arguments is a string containing more JSON, Claude packed a shell command inside it.

Now the hard part: how does your code tell these two cases apart? If Claude gives you text, print it. If it wants a tool run, execute the tool. You need those two paths separated cleanly.

MoonieCode solves this with a very C++ move:

using ParsedResponse = std::variant<ContentResult, std::vector<ToolCall>>;

std::variant works like a paranoid envelope: it contains either a letter (ContentResult) or a toolbox (a list of ToolCall objects), never both, never neither. And the compiler makes sure you handle both cases. Omit one, and your build fails.

Handling the variant means pairing it with std::visit and a classic C++ pattern called overloaded:

template <class... Ts>
struct overloaded : Ts... { using Ts::operator()...; };

Six lines of template code that let you dispatch elegantly with lambdas:

std::visit(overloaded{
    [&](const ContentResult& r) { /* Claude answered, print it */ },
    [&](const std::vector<ToolCall>& tcs) { /* Claude wants tools, run them */ },
}, parsed);

The beauty of this pattern is type safety. You physically cannot write code that forgets to handle one of the two possibilities. The compiler will chase you down until every branch exists. People love to complain that C++ is verbose, but this flavor of compile-time guardrail is genuinely satisfying when you're building something that has to not crash.

Alright, your program now knows what Claude wants. Next question: if Claude asked for a tool, what happens?

The While Loop Is the Soul of the Agent

Here's the entire agent loop in pseudocode:

push the user's prompt into messages
while (not done) {
    pack messages + tools into JSON
    POST to Claude
    parse Claude's response
    if (response is text) {
        print it, we're done
    } else if (response is tool calls) {
        append Claude's tool call records to messages
        for (each tool call) {
            execute it locally
            append the result to messages
        }
    }
}

That's it. No black magic, no secret sauce. Peel back the marketing and you find a while loop wrapping a four-step cycle: ask the LLM, see what it wants, if it answered you're done, if it asked for a tool you run it and ask again.

One detail that's easy to overlook: that messages array keeps growing. The "conversation history" with Claude isn't wiped between rounds, it just piles up layer by layer:

Starts with one role: "user" message
Claude says "run this command," so you append an role: "assistant" message with tool_calls
Command finishes, you append a role: "tool" message with the output
Next request carries the entire history, so Claude sees "last time I told you to run this, the result was this, now I will..."

That's the agent's "memory." No vector database, no fancy RAG pipeline, just push_back on a JSON array. Claude reads the full history and naturally chains multi-step reasoning.

What about stopping? MoonieCode has maxIterations = 30. If Claude chains 30 tool calls without giving a final answer, the program pulls the plug. It's a safety fuse that keeps the agent from spinning its wheels forever.

Of course, the real Claude Code is a different beast. Public information suggests its repo weighs in at over half a million lines of TypeScript. It doesn't use a crude 30-iteration cap, it runs a dynamic token budget system. It dispatches sub-agents to handle different tasks in parallel. It asks for confirmation before doing anything dangerous. It supports checkpointing so you can roll back when things explode. It speaks MCP to plug into external data sources. MoonieCode is roughly three orders of magnitude away from the real thing.

And yet. No matter how many layers of engineering get piled on top, the skeleton underneath is the same loop: ask the LLM, check what it wants, execute on its behalf, feed the result back in. That's what MoonieCode strips bare and shows you.

Doing Claude's Dirty Work

Claude says "I want to run find." That intent arrives as a JSON blob. Who turns it into an actual system call? ToolExecutor.

MoonieCode gives Claude three weapons: Read, Write, and Bash. When a tool call comes in, ToolExecutor::execute checks the name field and routes it:

auto ToolExecutor::execute(const ToolCall& tool_call) -> std::string {
    if (tool_call.name == "Read")  return handle_read(tool_call.arguments);
    if (tool_call.name == "Write") return handle_write(tool_call.arguments);
    if (tool_call.name == "Bash")  return handle_bash(tool_call.arguments);
    throw std::runtime_error("Unknown tool: " + tool_call.name);
}

That's it. A plain if-else chain mapping an LLM's "intent" to local C++ functions. No reflection. No plugin registry. No factory pattern. A 393-line project doesn't need design patterns.

Of the three tools, Bash is the star because it hands Claude the nuclear launch codes, it can run literally any command. Read and Write could technically be emulated with Bash (read with cat, write with tee), but they got their own tools because file I/O is so frequent it'd be wasteful, and error-prone, to channel it all through a shell.

Here's what's inside Bash:

auto ToolExecutor::handle_bash(const nlohmann::json& arguments) -> std::string {
    const auto command = arguments["command"].get<std::string>();
    const auto full_cmd = command + " 2>&1";  // capture stderr too

    FILE* pipe = popen(full_cmd.c_str(), "r");
    std::string output;
    std::array<char, 4096> buffer{};
    std::size_t bytes_read = 0;
    while ((bytes_read = fread(buffer.data(), 1, buffer.size(), pipe)) > 0) {
        output.append(buffer.data(), bytes_read);
    }

    int status = pclose(pipe);
    int exit_code = WIFEXITED(status) ? WEXITSTATUS(status) : status;
    output += "\n[exit code: " + std::to_string(exit_code) + "]";
    return output;
}

Pull the command field out of the JSON, tack on 2>&1 to swallow stderr too, popen it, loop fread until the pipe runs dry, pclose to clean up and grab the exit code, then mash stdout, stderr, and exit code into one string and toss it back.

Where does that string go? Right back into the messages array, wearing the role: "tool" badge. Next time Claude gets a request, it reads that message and knows exactly what happened when the command ran. Loop this, and Claude starts to feel like a pilot in a cockpit: the dashboard (messages) shows current state, the joystick (tools) lets it take action.

Read and Write follow the exact same formula: yank parameters from JSON, do local I/O, return a result string. Read uses ifstream to slurp files whole. Write uses ofstream and auto-creates parent directories with create_directories. So clean there's not much else to say.

What 393 Lines Actually Mean

The real Claude Code is reportedly over half a million lines of TypeScript. It has sub-agent dispatching, permission gatekeeping, checkpoint rollback, MCP multi-protocol adaptation, multi-model routing, context window compression, and a long list of features you won't find anywhere in MoonieCode. In terms of capabilities, MoonieCode isn't even a rounding error.

But here's the counterintuitive part: no matter how much engineering gets layered on, the agent loop at the center is the same one. Ask the LLM, receive tool calls, execute locally, feed results back. Those four steps are the Newton's laws of this space. Everything else is engineering.

MoonieCode's 393 lines don't have the right to be compared to Claude Code on features. But they do one thing well: they strip the agent skeleton down to the bone, rip off every layer of engineering skin, and let you stare directly at the heartbeat of an AI coding assistant. Once you've internalized those 393 lines, every AI coding tool you encounter will auto-decompile in your head into "okay, the permissions system is on top, sub-agent scheduling underneath, and at the very bottom... still a while loop."

Building BitTorrent from Scratch: What 2500 Lines of Modern C++ Can Do

Tyler Tan — Mon, 18 May 2026 15:51:50 +0000

A working BitTorrent downloader — from raw TCP sockets to SHA-1 hashing, all written by hand.

This project starts at the socket level: I wrote my own SHA-1, hand-rolled HTTP requests, implemented bencoding from scratch, defined all seven peer wire protocol message types one by one, and finally spawned multiple peer connections with std::jthread for parallel downloading. It supports both .torrent files and magnet links, and comes with 83 unit tests. Apart from a JSON formatting library and the test framework, it has zero external dependencies.

GitHub: TinyBitTorrent, built with C++23.

What BitTorrent Is

Before diving into the implementation, let's take a minute to understand what BitTorrent actually does.

The traditional file download model is straightforward: you click a link, your browser sends an HTTP request, and the server pushes the file to you. The bottleneck is equally straightforward — all the bandwidth pressure sits on a single server. More users means slower speeds, and if the server goes down, the file is gone.

BitTorrent turns this model on its head by making every downloader an uploader at the same time. A file is split into many small chunks called pieces, each with its own SHA-1 hash. Instead of downloading all pieces from one central server, you grab a few from each of dozens — or hundreds — of peers who are also downloading, or have already finished. Meanwhile, the pieces you already have can be uploaded to other peers. Paradoxically, the more people participate, the faster the entire distribution network becomes.

To implement this protocol, the first problem to solve is: how do you encode and transmit data and metadata? BitTorrent uses a format called bencoding — simple, compact, and unambiguous. Let's start there.

Bencoding: BitTorrent's JSON

Bencoding is BitTorrent's native serialization format. You can think of it as JSON's binary cousin. Where JSON uses curly braces and square brackets to mark structure, bencoding uses type prefixes and length prefixes. There are only four types.

The first is the string, formatted as length:content. For example, 4:spam means the string "spam", and 11:hello world means "hello world". The number before the colon must be a decimal integer with no leading zeros.

The second type is the integer, wrapped in i and e. So i42e is 42, and i-3e is -3. Leading zeros are forbidden, and i-0e is not allowed either.

The third type is the list, wrapped in l and e, containing any number of bencoded values. For instance, l4:spami42ee is a list with the string "spam" and the integer 42. Lists can nest other lists and dictionaries.

The fourth type is the dictionary, wrapped in d and e, with keys and values alternating. Keys must be strings; values can be any type. Something like d3:foo3:bar4:infod6:lengthi1024eee represents {"foo": "bar", "info": {"length": 1024}}. Dictionary keys must be sorted in lexicographic order when encoding — the protocol explicitly requires this.

At this point you might wonder — why not just use JSON? Two reasons. First, JSON can't directly represent binary data like SHA-1 hashes without Base64 encoding, which is costly. Second, bencoding is extremely simple to parse — no quote escaping, no Unicode handling, none of the complexity a JSON parser has to deal with. For BitTorrent in 2001, a format with zero library dependencies was the right call.

My implementation uses std::variant as the data model. Each of the four types is a struct, all wrapped together in a variant:

using String = std::string;
using Integer = int64_t;

struct List { std::vector<Value> items_; };
struct Dict { std::vector<std::pair<String, Value>> items_; };

using Value = std::variant<String, Integer, List, Dict>;

There's an interesting circular dependency here: the definition of Value uses List and Dict, and both List and Dict contain Value. Strictly speaking, this is an incomplete type issue in C++, but std::variant and std::vector implementations since C++17 actually support this recursive pattern in practice, so the compiler lets it through. It's the cleanest way to write it, so that's what I went with.

The parser is a recursive descent design that takes a mutable string_view reference and dispatches on the first character:

auto decode(std::string_view& data) -> Value {
    if (data[0] >= '0' && data[0] <= '9') [[likely]] {
        auto colon = data.find(':');
        auto len = /* parse int from data[0..colon) */;
        data.remove_prefix(colon + 1);
        auto str = std::string{data.substr(0, len)};
        data.remove_prefix(len);
        return str;
    }
    if (data[0] == 'i') [[unlikely]] { /* parse integer... */ }
    if (data[0] == 'l') [[unlikely]] { /* parse list... */ }
    if (data[0] == 'd') [[unlikely]] { /* parse dict... */ }
    // ...
}

If the first character is a digit, we enter the string branch (the most common case, marked [[likely]]); i means integer, and so on. The encoder runs in reverse, using std::format_to to assemble the prefix strings, with dict keys sorted via std::ranges::sort before encoding.

.torrent Files: the Download Shopping List

With bencoding in place, parsing .torrent files is the natural next step. So why do we even need a torrent file? The answer is simple: to download something, you have to know what it is, how big it is, and where to find people who have it. A .torrent file is exactly that shopping list — it tells you the file size, how many pieces it's split into, the hash of each piece, and the tracker URL for finding peers.

A .torrent file is essentially a single bencoded dictionary. At the top level there are two critical keys: announce, which is the tracker URL, and info, a sub-dictionary containing everything directly related to the download — length (total file size in bytes), piece length (the size of each piece, typically 256 KB to 1 MB), and pieces (a long string of all 20-byte SHA-1 hashes concatenated together).

My Metainfo struct captures exactly these six fields:

struct Metainfo {
    std::string announce_;              // tracker URL
    int64_t length_{};                  // total file size
    std::string info_hash_;             // 20-byte raw SHA1
    int64_t piece_length_{};            // size of each piece
    std::vector<std::string> piece_hashes_;  // hex hashes per piece
};

Parsing is a two-level iteration. First pass over the top-level dict grabs announce and info; second pass over the info sub-dict extracts length, piece length, and pieces. The pieces field needs a bit of special handling — the raw data is every 20-byte SHA-1 hash concatenated end-to-end. I slice it into 20-byte chunks and convert each one into a 40-character hex string for storage.

The most noteworthy step is computing the info_hash. This isn't just any hash — you re-bencode the entire info dictionary, then compute SHA-1 over the encoded result. Think of it as taking a "fingerprint" of the info dict. Everything downstream — tracker requests, peer handshakes — identifies the file by this fingerprint. The info_hash is the file's universal identity card in the BitTorrent world.

util::Sha1 hasher;
hasher.update(bencode::encode(Value{*info}));
info_hash = hasher.finalize();

As a side note, there's also a from_info_dict function that reconstructs a Metainfo from an info dictionary obtained through the ut_metadata extension protocol. This comes into play with magnet link downloads, which I'll cover later.

Finding Peers, Shaking Hands, Downloading

Once you have the metadata from a .torrent file, the first order of business is finding who has your data. That's the tracker's job.

A tracker is essentially an HTTP service. You send it your info_hash and your peer_id (a 20-byte random string that identifies you), and it returns a list of peers currently downloading or seeding that file. I construct an HTTP GET request with the info_hash, peer_id, port number, and download progress as URL parameters, appending compact=1 at the end. compact=1 means "give me the peer list in compact form" — 6 bytes per peer: 4 for the IP address and 2 for the port. This keeps the tracker response tiny; even dozens of peers fit in a few hundred bytes. After parsing the response, I split the peers field into 6-byte chunks, extract the IP and port from each, and the peer list is ready.

With a peer's IP and port in hand, the next step is to open a TCP connection and perform the BitTorrent handshake. The handshake packet is a neat 68 bytes, each segment with a clear purpose. Byte 1 is the protocol string length (always 19). The next 19 bytes are "BitTorrent protocol". Then 8 reserved bytes (where bit 4 of byte 26, if set, signals extension protocol support). Then 20 bytes of info_hash. Finally, 20 bytes of peer_id. The peer responds with an identically formatted packet; I verify the protocol string and info_hash match, and the handshake is done. The code for this is shorter than the description:

auto make_handshake(string_view info_hash, string_view peer_id,
                    bool reserve_extensions) -> string {
    string msg(68, '\0');
    msg[0] = 19;
    copy("BitTorrent protocol", msg.begin() + 1);
    msg[25] = reserve_extensions ? '\x10' : '\x00';
    copy(info_hash, msg.begin() + 28);
    copy(peer_id, msg.begin() + 48);
    return msg;
}

After the handshake, both sides enter a simple state machine. The peer first sends a bitfield message — a bitmap where each bit indicates whether the peer has the corresponding piece. After inspecting the bitfield, I send an interested message, essentially saying "I'd like to download from you." Then I wait for an unchoke message. Only after receiving it am I officially granted permission to request data. Choke and unchoke form BitTorrent's flow control mechanism; a peer can choke you at any time to deny transfers, though in practice most peers unchoke right after receiving an interested message.

The logic for actually downloading a piece is the most interesting part of the entire project. A piece can be several megabytes; you can't just request it all at once — that would be slow, and the retransmission cost after packet loss would be punishing. BitTorrent's approach is to split each piece into 16 KB blocks, sending a separate Request message for each block with the piece index, the block's offset within the piece, and its length. But waiting for one block to arrive before requesting the next wastes network bandwidth. The better approach is pipelining: keep up to 5 requests in flight at all times. Whenever a Piece message arrives, I copy the data into the piece buffer at the correct offset and immediately send a new Request to fill the freed slot.

// Fill the pipeline: send up to 5 block requests at once
while (pending < 5 && send_idx < total_blocks) {
    send(encode(Request{piece_index, blocks[send_idx].begin_,
                        blocks[send_idx].length_}));
    ++send_idx; ++pending;
}

// Event loop: receive Pieces, fill buffer, replenish requests
while (received < total_blocks) {
    auto msg = recv_message();
    visit(Overloaded{
        [&](const Piece& pce) {
            copy(pce.block_, piece_data.begin() + pce.begin_);
            ++received; --pending;
            if (send_idx < total_blocks && pending < 5) { /* send next request */ }
        },
        [](const auto&) {}  // ignore other message types
    }, msg);
}

// After all blocks arrive, verify SHA-1
if (sha1_hex(piece_data) != expected_hash) throw ...;

Once all blocks are in, I run SHA-1 over the assembled piece and compare it against the hash recorded in the .torrent file. A match means the piece is good. A mismatch means something went wrong in transit or the peer gave us bad data, so we throw an exception.

Multithreading: Full-Speed Download

If you can download one piece, you can download the entire file. The logic connecting these two concepts is surprisingly straightforward.

First, I pre-allocate the output file to its final size with ftruncate. Think of this as "reserving your spot" on disk — the file already occupies its full footprint, and each piece's data just gets written to its correct offset with pwrite. No need to accumulate a file-sized buffer in memory.

Then comes the multithreading. I spawn one std::jthread worker per peer, each responsible for a contiguous range of pieces. Within a thread, a single TCP connection is established and reused for all pieces in that worker's range (saving handshake overhead). Across threads, everything runs in parallel, each talking to a different peer. The core logic is clean enough to fit in a handful of lines:

// Pre-allocate the file
ftruncate(fd, metainfo.length_);

for (size_t i = 0; i < num_workers; ++i) {
    workers.emplace_back([&, start, end, peer_idx = i]() {
        auto conn = establish_connection(
            peers[peer_idx].ip_, peers[peer_idx].port_, ...);
        for (int pi = start; pi < end; ++pi) {
            auto data = download_piece_on_connection(conn, metainfo, pi);
            pwrite(fd, data.data(), data.size(),
                   pi * metainfo.piece_length_);
        }
    });
}

Error handling is taken care of too. If any worker throws, I capture the first exception with std::exception_ptr behind a mutex, and rethrow it after all threads have joined. This ensures a single failure doesn't crash the whole process before other threads have a chance to clean up their resources.

Magnet Links: Throw Away the Torrent File

The .torrent file path is done, but there's another — arguably more common — way to start a download: magnet links. You've definitely seen something like this: magnet:?xt=urn:btih:abc123...&dn=filename&tr=tracker_url. At its core, it's just a URL embedding the file's info_hash, a suggested display name, and one or more tracker addresses.

Why magnet links? A .torrent file may be small, but it's still a file — you have to get it from a website, a forum, or some other channel first. A magnet link is just a string. Sharing a link is infinitely more convenient than sharing a file. For the BitTorrent network, magnet links are also more decentralized: even if every torrent index site goes down, as long as someone is still seeding, pasting a link is enough to start downloading.

The full magnet download flow adds one critical step compared to the .torrent path: since you don't have a torrent file, you have no idea how big the file is or what its piece hashes are — you have to ask a peer for this information. The rough flow goes like this: parse the magnet link to extract the info_hash and tracker URL, query the tracker for a peer list, establish a TCP connection and perform the base handshake, then use the extension protocol to request the info dictionary from the peer. Once the info_dict passes verification, the rest is exactly the same as the .torrent path — download all the pieces as usual.

Parsing the magnet link itself is straightforward string processing: check for the magnet:? prefix, find the 40-character hex hash after xt=urn:btih: and convert it to 20 raw bytes, locate the tracker URL after tr= and URL-decode it. Compared to bencoding, this is about as hard as drinking a glass of water.

The Extension Handshake

The core challenge of magnet link downloads is "without a torrent file, how do I know what to download?" BitTorrent's answer is the ut_metadata extension defined in BEP 9, which allows peers to exchange torrent info dictionaries. But to use ut_metadata, you first need to complete the extension handshake defined in BEP 10.

The extension handshake is an extra round of negotiation that happens immediately after the standard handshake. First, bit 4 of byte 26 (the 5th byte of the reserved field) in my handshake packet is set to 1 — this flag tells the peer "I speak the extension protocol." If the peer also supports it, it will set the same bit in its handshake response.

Right after the handshake, I send an extension handshake message. This message has type Extended with message ID 0 (by convention, ID 0 is always the extension handshake), and its payload is the bencoded dictionary {"m": {"ut_metadata": 1}}. This says "I want to use the ut_metadata extension, and I'll call it ID 1." The peer responds with a similarly structured dictionary {"m": {"ut_metadata": N}}, telling me what message ID it has assigned to ut_metadata — it might be 1, 2, or some other number. From this point on, all ut_metadata messages use that ID.

// After the standard handshake, check if the peer supports extensions
bool has_ext = (hs_buf[25] & 0x10) != 0;

if (has_ext) {
    // Send extension handshake: {"m": {"ut_metadata": 1}}
    Dict ext{{"m", Dict{{"ut_metadata", Integer{1}}}}};
    send(sock, encode(Extended{0, bencode::encode(Value{ext})}));

    // Parse the response to get the peer's ut_metadata message ID
    auto msg = recv_message(sock);
    auto metadata_ext_id = parse_ext_handshake_response(msg.payload_);
}

Once this step completes, I know exactly which message ID to use when requesting metadata.

Asking a Peer for Metadata

With the peer's ut_metadata message ID in hand, requesting metadata means constructing the bencoded dictionary {"msg_type": 0, "piece": 0} and sending it as an Extended message. msg_type=0 means "this is a request," and piece=0 means "give me chunk 0 of the metadata." (The ut_metadata protocol splits the info dictionary into 16 KB chunks for transmission; the overwhelming majority of torrents have an info dict that fits in a single chunk, so piece=0 is all you need.)

The peer responds with an Extended message whose payload is {"msg_type": 1, "piece": 0, "total_size": N, ...info dict bencoded data appended at the end...}. msg_type=1 means this is a response, and total_size tells me how many bytes the info dictionary's bencoded form takes. The key operation is extracting the last total_size bytes from the payload — that's the complete bencoded info dictionary.

Once I have info_bencode, I do two things. First, bencode-decode it and feed it into from_info_dict to reconstruct a Metainfo — now I have piece_hashes, length, and piece_length, everything I need. Second, and this is the critical part, I compute SHA-1 over info_bencode and compare it against the info_hash from the magnet link. If they don't match, the peer gave me bogus data — throw an exception, try a different peer. This is "trust but verify"; the entire BitTorrent protocol's security rests on hash verification.

// Request metadata
send_metadata_request(sock, metadata_ext_id, 0);

// Receive the response
auto msg = recv_message(sock);
auto info_bencode = parse_metadata_data(msg.payload_);

// Parse the info dict
auto info_dict = std::get<Dict>(bencode::decode(info_bencode));
auto metainfo = from_info_dict(info_dict);

// Verify: info_hash must match the magnet link
if (sha1(info_bencode) != info_hash) throw ...;

Once verification passes, the path forward is identical to the .torrent download: use the Metainfo to query the tracker for a peer list, spawn multiple threads for parallel piece download, and pwrite everything to disk. Magnet links and .torrent files converge on the same destination.

Wrapping Up

From raw TCP sockets to bencoding, from .torrent parsing and tracker communication to the peer wire protocol's handshake and block pipelining, from multithreaded parallel download to magnet links and extension protocols — bit by bit, a working BitTorrent client came together. Around 2500 lines of source code, just under 3500 including tests and build configuration.

The biggest takeaway from this project is that the best way to understand a protocol or a system is to implement it yourself. The BitTorrent protocol specification is only a handful of pages. But there's an ocean of difference between calling someone else's library and filling every byte of a socket buffer by hand, cross-referencing BEP documents to figure out why the peer won't send an unchoke.

Of course, this implementation is aggressively minimal. No seeding (download-only), no DHT for decentralized peer discovery (fully tracker-dependent), no UDP tracker support (HTTP only), no rarest-first piece selection (just sequential assignment), no PEX peer exchange, and no end-game mode. These are the clear dividing lines between a production-grade BitTorrent client and a "learning wheel." As a practical tool, it doesn't hold a candle to qBittorrent or Transmission. As a learning exercise, it did everything I wanted it to do.

If the project interests you, the code is at https://github.com/Tenaryo/TinyBitTorrent. Feedback and drive-by comments welcome.