Hamza Hasanain

Posted on Nov 25

What Every Programmer Should Know About Memory Part 2

#computerscience #programming #architecture #books

Why does your pointer not point where you think it does?.

In the previous article What Every Programmer Should Know About Memory (Part 1), we covered sections 2 and 3 from the article: What Every Programmer Should Know About Memory by Ulrich Drepper. In this article, we will continue from where we left off and cover section 4 (yes, section 4 only).

The previous article explored memory hierarchies from the ground up — how DRAM hardware works, why CPU caches exist, and practical optimization techniques like cache-line awareness and data structure layout. We examined the physical reality behind the "flat array" abstraction and learned why memory access patterns matter for performance.

In this article, we continue with section 4 of Ulrich Drepper's paper, diving deep into Virtual Memory — the translation layer that gives every process its own address space while sharing physical RAM.

Prerequisites: The Basics

0.1. Paging?
0.2. More Concepts

The Illusion of Ownership: Virtual vs. Physical

1.1. The Sandbox: How the MMU makes every process believe it owns the entire RAM
1.2. The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data

The Page Table Walk: A Tree Structure

2.1. Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process
2.2. The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)
2.3. The Hardware Walker: How the processor "walks the tree" to find physical pages

The Accelerator: The TLB (Translation Look-Aside Buffer)

3.1. Caching the Address: The TLB as a tiny, ultra-fast cache specifically for address translations
3.2. TLB Thrashing: A Practical Example
3.3. The Context Switch Penalty: Why switching processes forces a TLB flush (and why it's expensive)

Optimization: Making the TLB Bigger (Without Hardware Changes)

4.1. The Page Size Limit: Why 4KB pages clog up the TLB
4.2. Huge Pages (2MB/1GB): Increasing the range of a single TLB entry to reduce misses

Conclusion: Respecting the Translation Layer

0 Prerequisites: The Basics

Before diving into the details of Virtual Memory, let's define a few key concepts that will help you understand the rest of the article.

0.1 Paging?

A page is a fixed-length contiguous block of virtual memory. In most systems, the default page size is 4KB (4096 bytes), although larger page sizes (like 2MB or 1GB) can also be used for specific applications.

Paging: is a memory management scheme that eliminates the need for contiguous allocation of physical memory. Instead, it divides virtual memory into pages and maps them to physical memory frames, allowing for more efficient use of RAM and enabling features like virtual memory.

Physical Frame: is a fixed-length block of physical memory that corresponds to a page in virtual memory. The operating system maintains a mapping between virtual pages and physical frames, allowing processes to access memory without needing to know the actual physical location of their data.

Frames VS Pages: In the context of paging, a "page" refers to a block of virtual memory, while a "frame" refers to a block of physical memory. The operating system maps virtual pages to physical frames, allowing processes to access memory without needing to know the actual physical location of their data.

0.2 More Concepts

There is much more to virtualization, but for the sake of saving time, we will see only one-line definitions of some important concepts that will help you understand the rest of the article.

Address Space: The range of memory addresses that a process can use. Each process has its own virtual address space, which is mapped to physical memory by the operating system.

Memory Management Unit (MMU): A hardware component that handles the translation of virtual addresses to physical addresses. It works in conjunction with the operating system to manage memory access and enforce protection.

Page Table: A data structure used by the operating system to keep track of the mapping between virtual pages and physical frames. Each process has its own page table.

TLB (Translation Lookaside Buffer): A small, fast cache that stores recent translations of virtual addresses to physical addresses. It helps speed up the address translation process by reducing the number of memory accesses needed.

1 The Illusion of Ownership: Virtual vs. Physical

As you probably understand by now (reading the prerequisite section of course, Paging?), virtual memory creates an illusion for each process that it has its own dedicated physical memory. In reality, the operating system manages the physical memory and allocates it to processes as needed.

Now, we will explore in a bit more detail how this illusion is created and maintained.

1.1 The Sandbox: How the MMU makes every process believe it owns the entire RAM

We know what an MMU is and what it does (see More Concepts), but how does it handle this translation? How does it know which virtual address maps to which physical address?

Let's talk about the Levels Of Translation:

Single-Level Translation: In a simple system, the MMU uses a single-level page table to map virtual addresses to physical addresses. Each entry in the page table corresponds to a virtual page and contains the physical frame number where that page is stored.

Multi-Level Translation: In more complex systems, the MMU uses a multi-level page table to reduce memory overhead. The virtual address is divided into multiple parts, each part indexing into a different level of the page table. This hierarchical structure allows for more efficient use of memory.

1.2 The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data

Let's discuss the trade-offs between single-level and multi-level page tables.

Single-level tables are simple and fast (one lookup) but waste a massive amount of RAM for the table itself. Multi-level tables save RAM by only allocating what is needed, but they are slower because they require multiple memory lookups to find the address.

The Math of Latency:
Imagine a single memory access takes 100ns. If you have a 4-level page table and a TLB miss, you don't just wait 100ns for your data. You wait:
100ns (L4) + 100ns (L3) + 100ns (L2) + 100ns (L1) + 100ns (Actual Data) = 500ns.
That is a 5x slowdown just for translation!

2 The Page Table Walk: A Tree Structure

The page table walk is the process by which the MMU translates a virtual address to a physical address using the page table. In a multi-level page table, this involves traversing a tree-like structure to find the correct mapping.

Think of it like a Library Index:
If you had a single flat list of every book in the world, it would be impossible to hold. Instead, we use a hierarchy:

L4: Which Floor?
L3: Which Aisle?
L2: Which Shelf?
L1: Which Book?

2.1 Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process

As we discussed before, using a flat page table for every process would require a massive amount of memory, especially for systems with large address spaces. For example, in a 32-bit system with 4KB pages, a flat page table would require 4MB of memory per process (2^20 entries * 4 bytes per entry). This is impractical for systems with many processes or limited memory resources.

2.2 The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)

To address the memory overhead issue, multi-level page tables break down the virtual address into multiple parts, each part indexing into a different level of the page table. This hierarchical structure allows the operating system to allocate page table entries only for used virtual pages, significantly reducing memory usage.

2.3 The Hardware Walker: How the processor "walks the tree" to find physical pages

This is the interesting part! Here, we learn how the CPU finds the physical address corresponding to a given virtual address using the multi-level page table structure.

The Hardware Walker is a component of the MMU that is responsible for traversing the multi-level page table to find the physical address corresponding to a given virtual address.

CR3 Register (or TTBR): This special CPU register holds the physical address of the root of the page table (Level 4). When a context switch occurs, the operating system updates this register to point to the page table of the new process.

When a process accesses a virtual address, the hardware walker performs the following steps:

Extract the Indices: The hardware walker extracts the indices for each level of the page table from the virtual address. For example, in a 4-level page table, it would extract indices for L4, L3, L2, and L1.
Traverse the Page Table: Starting from the root of the page table (L4), the hardware walker uses the extracted indices to navigate through each level of the page table. At each level, it reads the corresponding entry to find the address of the next level's page table.
Find the Physical Address: Once the hardware walker reaches the final level (L1), it retrieves the physical frame number from the page table entry. It then combines this frame number with the offset from the original virtual address to compute the final physical address.

3 The Accelerator: The TLB (Translation Look-Aside Buffer)

To avoid the performance hit of walking page tables for every access, processors cache the computed physical addresses in a specialized cache called the TLB.

3.1 Caching the Address: The TLB as a tiny, ultra-fast cache specifically for address translations

The TLB (Translation Lookaside Buffer) is a small, fast cache that stores recent translations of virtual addresses to physical addresses. It is designed to speed up the address translation process by reducing the number of memory accesses needed to translate a virtual address.

When a process accesses a virtual address, the MMU first checks the TLB to see if the translation for that address is already cached. If it is, the MMU can quickly retrieve the corresponding physical address from the TLB (Cache Hit), avoiding the need to walk the page table, if not, it has to walk the page table (Cache Miss).

3.2 TLB Thrashing: A Practical Example

This is where theory meets practice. If you access memory in a pattern that constantly jumps to new pages, you will cause TLB Thrashing. The TLB is small; if you touch too many pages too quickly, you evict useful entries.

Consider iterating over a large 2D array:

// Fast: Row-major access (Sequential)
// We access matrix[0][0], matrix[0][1], matrix[0][2]...
// These are all on the same page. One TLB miss per page (4096 bytes).
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];
    }
}

// Slow: Column-major access (Strided)
// We access matrix[0][0], matrix[1][0], matrix[2][0]...
// Each access jumps N * sizeof(int) bytes forward.
// We likely hit a NEW page every single time. High TLB miss rate!
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];
    }
}

3.3 The Context Switch Penalty: Why switching processes forces a TLB flush (and why it's expensive)

We did not discuss context switching before, so let's define it first:

Context Switching: is the process of saving the state of a currently running process and loading the state of another process to allow multiple processes to share a single CPU. This involves saving and restoring the CPU registers, program counter, and other process-specific information.

When a context switch occurs, the TLB must be flushed (cleared) because the cached translations in the TLB are specific to the virtual address space of the currently running process. If the TLB were not flushed, the new process could potentially access incorrect physical addresses based on stale TLB entries from the previous process, leading to data corruption or security vulnerabilities.

Flushing the TLB is expensive because it requires the MMU to walk the page tables again for each memory access made by the new process, resulting in increased latency and reduced performance. This is particularly problematic in systems with frequent context switches, as the overhead of flushing the TLB can significantly impact overall system performance.

THE OPTIMIZATION: Modern Processors and operating systems implement various techniques to mitigate the performance impact of TLB flushes during context switches. One common approach is to use Address Space Identifiers (ASIDs) or Process Context Identifiers (PCIDs), which allow the TLB to retain entries for multiple processes simultaneously. This way, when a context switch occurs, the TLB does not need to be completely flushed; instead, only entries associated with the previous process are invalidated, while entries for other processes remain valid. This significantly reduces the overhead of context switches and improves overall system performance.

Note on Threads vs. Processes:
It is important to note that Threads within the same process share the same Page Table (and thus the same TLB entries). Context switching between threads is much cheaper than switching between processes because the TLB does not need to be flushed.

4 Optimization: Making the TLB Bigger (Without Hardware Changes)

To improve TLB performance without changing the hardware, operating systems can use techniques like huge pages to increase the effective size of TLB entries.

4.1 The Page Size Limit: Why 4KB pages clog up the TLB

The default page size of 4KB can lead to TLB misses when a process accesses a large amount of memory, as each TLB entry only covers a small portion of the address space. This can result in frequent TLB misses and increased latency due to page table walks.

4.2 Huge Pages (2MB/1GB): Increasing the range of a single TLB entry to reduce misses

Huge pages are larger memory pages that can be used to reduce the number of TLB entries needed to cover a given address space. By using huge pages (e.g., 2MB or 1GB), a single TLB entry can cover a much larger portion of the address space, reducing the likelihood of TLB misses and improving performance.

The Problem with Huge Pages: While huge pages can improve TLB performance, they also come with some challenges. Allocating large contiguous blocks of physical memory can be difficult, especially in systems with fragmented memory. Additionally, using huge pages can lead to increased memory usage, as smaller pages may be wasted if a process does not fully utilize the allocated huge page (internal fragmentation).

Real World Use Case:
Database engines like PostgreSQL or Oracle often manage buffer pools (cached data) that are dozens of GBs in size. Mapping 64GB of RAM using 4KB pages would require millions of TLB entries, causing constant thrashing. Using Huge Pages makes this manageable and significantly improves database throughput.

5 Conclusion: Respecting the Translation Layer

Virtual memory and the associated translation mechanisms are fundamental to modern computing. Understanding how virtual addresses are translated to physical addresses, the role of the TLB, and optimization techniques like huge pages is crucial for developers aiming to write efficient software.

DEV Community