DEV Community

Cover image for What Every Programmer Should Know About Memory (Part 1)
Hamza Hasanain
Hamza Hasanain

Posted on

What Every Programmer Should Know About Memory (Part 1)

I recently came across an interesting paper titled
What Every Programmer Should Know About Memory by Ulrich Drepper. The paper dives into the structure of memory subsystems in use on modern commodity hardware,and what programs should do to achieve optimal performance by utilizing them.

What I will be doing is just summarizing what I (as a semi-intelligent being) have learned from reading the paper. I highly recommend reading the paper as the title says, what every programmer should know about memory.

Needless to say, some parts of the paper where quite complex for my brain, I did my best to understand everything, but I might have missed some details. If you find any mistakes, please let me know!

What I Will Cover Here

I just finished reading the first 3 sections of the paper, which cover the following topics:

  1. Basic Architecture of Modern Computers
  2. Main Memory
  3. Caches

This post will be structured around these topics, so here is my table of contents:

Table of Contents

1- Introduction: The Lie of the Flat Array

2- The Hardware Reality (RAM Physics)

3- The Caching Solution (CPU Caches)

4- Programmer Takeaways

5- Conclusion


Introduction: The Lie of the Flat Array

The Flat Memory Model (also known as the Linear Memory Model) is one of the most fundamental lies operating systems tell programmers. It is an abstraction that presents memory to your program as a single, continuous, contiguous tape of bytes, addressable from 0 to 2^N - 1.

This abstraction is crucial for programming, as it allows us to think of memory in simple terms, using pointers and offsets to access data. However, in reality, memory is far from flat. It is a complex hierarchy of storage types, each with different speeds, sizes, and costs.

You might ask the OS for a 2GB contiguous block for a game engine or database. The Limitation: The OS might not have 2GB of physically contiguous RAM. It might have 2GB free, but scattered in 4KB chunks all over the physical chips.

Reality: The OS uses Virtual Memory Paging to map your flat virtual addresses to scattered physical addresses. The abstraction holds, but the OS has to work hard (using Page Tables and the TLB - Translation Lookaside Buffer) to maintain the illusion. If you access memory too sporadically, you thrash the TLB, causing performance degradation.

This will be discussed in more detail in the next posts (where we will talk about Virtual Memory and NUMA support in more details), but for now, just remember: Memory is not flat. Access times vary wildly depending on where your data resides in the memory hierarchy.

1.1 The O(1) Myth of Pointer Access

When talking about pointer access, most programmers assume that dereferencing a pointer is an O(1) operation - meaning it takes a constant amount of time regardless of the size of the dataset or where the data is located. However, in modern systems, that constant time can vary by a factor of 1,000,000 depending on where the data physically lives.

If your data is in the L1 Cache (closest to the CPU core), the dereference is nearly unnoticeable. If it is in Main RAM, the CPU must stall and wait. If it is swapped out to the Disk, the CPU could execute millions of instructions in the time it takes to fetch that one value.

1.2 The Latency Numbers (Approximate)

To put this in perspective, let's look at the cost in CPU cycles:

Location of Data Approximate Latency Simple Analogy
L1 Cache 3−4 cycles Grabbing a pen from your desk.
L2 Cache 10−12 cycles Picking a book off a nearby shelf.
L3 Cache 30−70 cycles Walking to a colleague's desk.
Main RAM 100−300 cycles Walking to the coffee machine down the hall.
SSD/NVMe 10,000+ cycles Driving to the supermarket.
HDD (Page Fault) 10,000,000+ cycles Flying to the moon.

The Hardware Reality (RAM Physics)

If you accidentally came across the terms CPU Caches, RAM, and ever wondered what they are, why they are different, and why not just use one type of memory, this section will give you a basic understanding of how modern memory systems are structured.

CPU Caches are referred to as fast memory because they are built using SRAM (Static RAM) technology, while the Main RAM is built using DRAM (Dynamic RAM) technology. The two have different trade-offs in terms of speed, cost, and density.

2.1 SRAM vs. DRAM: Why Main Memory Uses Leaky Capacitors

Let's start with the SRAM, DRAM, and see how they are built:

SRAM uses a set of transistors to store each bit of data. A typical SRAM cell uses 6 transistors to store a single bit, forming a flip-flop circuit that can hold its state as long as power is supplied. This design allows for very fast access times (on the order of nanoseconds) because the data can be read or written directly without any additional steps.

SRAM Cell Diagram

DRAM, on the other hand, uses a single transistor and a capacitor to store each bit of data. The capacitor holds an electrical charge to represent a 1 and no charge to represent a 0. However, capacitors leak charge over time, so the data must be periodically refreshed (every few milliseconds) to maintain its integrity. This refresh process introduces latency and complexity but allows DRAM to be much denser and cheaper than SRAM.

DRAM Cell Diagram

Check this article if you want to know why real-world capacitors are not perfect insulators

2.2 The Refresh Tax: Why Execution Stalls for Memory Maintenance

As mentioned earlier, The problem with DRAM is that capacitors are imperfect, they leak electrons and lose their charge over time. To prevent data loss, The solution is that the MC (Memory Controller) must read every single row of memory and write it back (recharge it) before the data fades away.

This maintenance operation is called a Refresh Cycle. During a refresh cycle, the memory controller temporarily halts normal memory operations to perform the refresh. This can lead to delays in servicing memory requests from the CPU, causing stalls in execution.

Keep in mind that, it does not halt the entire RAM chip at once, instead it refreshes rows sequentially. However, the refresh operation occupies the Bank's sense amplifiers. Therefore, if the CPU requests data from ANY row within that specific Bank (not just the row being refreshed), it must wait until the Bank becomes available again.

What is a Bank? A Bank is a subdivision of the DRAM chip that can be accessed independently. Each Bank has its own sense amplifiers and can be refreshed or accessed separately from other Banks. This allows for some level of parallelism and reduces the impact of refresh cycles on overall memory access latency.

What is sense amplifiers? Sense amplifiers are specialized circuits within the DRAM that detect and amplify the small voltage changes on the bit lines during read operations. They are crucial for accurately reading the data stored in the capacitors of the DRAM cells.

Actual DRAM

2.3 Address Multiplexing: Sharing Pins to Save Money (and Costing Time)

Before we dive into the latency chain, why it exists, and how it works, we need to understand why DRAM chips use Address Multiplexing.

Address Multiplexing is a technique used in DRAM to reduce the number of pins required on the chip package. Instead of having separate pins for each bit of the address, the address is sent in two parts: the Row Address and the Column Address. This allows the same set of pins to be reused for both parts of the address, effectively halving the number of pins needed (while simultaneously increasing the time it takes to access data).

Why is this important? The number of pins on a chip package directly impacts its cost and complexity. By using address multiplexing, manufacturers can produce DRAM chips that are more affordable and easier to integrate into systems.

2.4 The Latency Chain: The Precharge → RAS → CAS Protocol

To understand why memory latency exists, you have to stop thinking of RAM as a magic bucket and start thinking of it as a physical Matrix (a grid of rows and columns).

To read a single byte of data, the Memory Controller cannot just say Give me index 400. It has to manipulate the physical grid using a strict three-step protocol. This sequence is determined by the physical construction of the DRAM chip and the need to minimize the number of pins on the chip package.

Imagine a massive warehouse (the DRAM Bank):

The Cells: The data lives in millions of tiny boxes (capacitors) arranged in Rows and Columns.

The Row Buffer (Sense Amps): There is a loading dock (Row Buffer).

The Rule: You cannot read a box while it is on the shelf. You must first move the entire row of boxes to the loading dock.

As we said before, when the CPU asks for a memory address, the controller breaks it down into a Row Address and a Column Address.

Step 1: Precharge: If there is already a row loaded in the Row Buffer, it must be precharged (written back to the shelf) before loading a new row. This step ensures that the Row Buffer is ready for the next operation.

Step 2: RAS (Row Address Strobe): The controller sends the Row Address to the DRAM chip, which activates the corresponding row and loads it into the Row Buffer (sense amplifiers). This step is crucial because it prepares the data for access.

Step 3: CAS (Column Address Strobe): Finally, the controller sends the Column Address to select the specific byte within the loaded row. The data is then read from the Row Buffer and sent back to the CPU.

2.5 Burst Mode: Why We Never Read Just One Byte

We have established that accessing a single byte from DRAM involves a multi-step process that requires a significant amount of taxes (Latency). If we paid that tax every time we wanted a single byte (8 bits), our computers would be extraordinarily slow. So engineers came up with a clever solution called Burst Mode.

Burst Mode allows the memory controller to read or write multiple consecutive bytes in a single operation after the initial access. Instead of fetching just one byte, the controller can fetch a block of data (typically 64 bytes) in one go, that means (for the simplicity) if you have an array arr[0..63], of 32-bit integers, (4-bytes) and you request arr[0], the memory controller will fetch arr[0] to arr[15] in one operation, because they are all located in the same row and can be accessed sequentially.

Why 64 bytes? See section 3.2 for details

The Caching Solution (CPU Caches)

Time for a quick history lesson.

In the early days of computers (1940s-1970s), CPUs and RAM were about the same speed. The CPU could ask for data and get it right away, so there was no waiting. Life was simple, and there was no need for a cache because the CPU wasn't sitting around with nothing to do.

But in the 1980s, things changed. CPUs started getting much, much faster every year, while RAM speed only improved a little. This created a huge speed difference, known as the Memory Wall. The super-fast CPU now had to spend most of its time waiting for the slower RAM to deliver data, like a sports car stuck in a traffic jam.

The Memory Wall

To solve this problem, engineers invented the cache. A cache is a small, extremely fast memory that lives right next to the CPU. It started with big, expensive computers in the 60s. By 1989, the Intel 486 brought a small L1 cache to personal computers. As the speed gap grew, we added a bigger, slightly slower L2 cache, and later an even bigger L3 cache for multiple CPU cores to share. The idea is to keep the most frequently used data in the fastest memory, so the CPU can keep working instead of waiting.

The next few sections will explain how this cache system works.

3.1 The Hierarchy: L1 (Brain), L2 (Buffer), L3 (Bridge)

CPU Cache Hierarchy

First, if you take a look at the image, you notice, there are 2 parts of the L1 cache: D-Cache (Data Cache) and I-Cache (Instruction Cache), but you notice that L2 and L3 caches are unified (store both instructions and data). This is because modern CPUs use a technique called Harvard Architecture for the L1 cache, which separates instructions and data to allow simultaneous access. This improves performance by allowing the CPU to fetch instructions and data in parallel.

The cache hierarchy is designed to balance speed, size, and cost:

  • L1 Cache: This is the smallest and fastest cache, located directly on the CPU core. It typically ranges from 16KB to 64KB in size and has the lowest latency (around 3-4 cycles). The L1 cache is split into two parts: one for instructions (I-Cache) and one for data (D-Cache). Its primary role is to provide the CPU with the most frequently accessed data and instructions as quickly as possible.

  • L2 Cache: This cache is larger than L1, typically ranging from 256KB to 1MB, and is still located on the CPU core. It has slightly higher latency (around 10-12 cycles) but can store more data. The L2 cache acts as a buffer between the fast L1 cache and the slower L3 cache or main memory, holding data that is not as frequently accessed as that in L1 but still needs to be retrieved quickly.

  • L3 Cache: This is the largest and slowest cache in the hierarchy, often ranging from MBs to GBs. It is usually shared among multiple CPU cores and has higher latency (around 30-70 cycles). The L3 cache serves as a bridge between the CPU cores and the main memory, storing data that is less frequently accessed but still benefits from being cached. It only exists due to the terrifying fact that main memory is so slow compared to the CPU.

3.2 Spatial Locality: How Cache Lines (64 Bytes) Hide Latency

As we discussed in the section Burst Mode: Why We Never Read Just One Byte, when the CPU requests data from memory, it doesn't just fetch a single byte. Instead, it fetches a block of data known as a cache line. In modern systems, a cache line is typically 64 bytes in size.

Why 64 bytes? This size is chosen because it aligns well with the cache line size of modern CPUs. By fetching data in blocks that match the cache line size, the system can take advantage of spatial locality, reducing the number of memory accesses required for sequential data access patterns.

Cache Lines: A cache line is the smallest unit of data that can be transferred between the CPU cache and main memory. Modern CPUs typically use a cache line size of 64 bytes. When the CPU requests data from memory, it fetches an entire cache line, even if only a small portion of that data is needed.

spatial locality: Spatial locality refers to the tendency of programs to access data locations that are close to each other within a short time frame. When a program accesses a particular memory address, it is likely to access nearby addresses soon after. By fetching data in blocks (cache lines), the system can take advantage of this behavior, reducing the number of memory accesses and improving overall performance.

3.3 Associativity: The Parking Lot Problem

We are talking about caches, and they are fast and all that, but they are also small, so you need to have a strategy to decide where to put data when it comes into the cache, and where to find it when you need it again.

Let's discuss 3 strategies for organizing data in the cache, known as cache associativity:

  • Direct-Mapped Cache: In a direct-mapped cache, each block of main memory maps to exactly one location in the cache. This is like having a parking lot where each car has a designated parking spot. If two cars (memory blocks) want to park in the same spot, one has to leave (be evicted). This method is simple and fast but can lead to many conflicts if multiple frequently accessed memory blocks map to the same cache line.

  • Fully Associative Cache: In a fully associative cache, any block of main memory can be stored in any location in the cache. This is like having a parking lot where cars can park anywhere. This method minimizes conflicts but requires more complex hardware to search the entire cache for a block, which can slow down access times.

  • Set-Associative Cache: This is a compromise between direct-mapped and fully associative caches. The cache is divided into several sets, and each block of main memory maps to a specific set but can be stored in any location within that set. This is like having a parking lot divided into sections, where cars can park anywhere within their designated section. This method balances the speed of direct-mapped caches with the flexibility of fully associative caches.

3.4 Write Policies: Write-Through, Write-Back, Dirty Bits, Lazy Eviction

We have established that reading from RAM is slow. Writing to RAM is just as slow.

If your program executes a loop that increments a counter i++ one million times, and every single increment forces a write to physical RAM, your CPU will spend 99.9% of its time waiting for the memory bus.

To solve this, hardware engineers created two main policies for handling writes: Write-Through (safe but slow) and Write-Back (complex but fast).

Write-Through: In a write-through cache, every time the CPU writes data to the cache, it also immediately writes that data to the main memory. This ensures that the main memory always has the most up-to-date data, which is important for data integrity. However, this approach can be slow because every write operation incurs the latency of writing to main memory.

Write-Back: In a write-back cache, when the CPU writes data to the cache, it does not immediately write that data to main memory. Instead, it marks the cache line as dirty, indicating that it has been modified. The data is only written back to main memory when the cache line is evicted (replaced) or when certain conditions are met (like a flush operation). This approach improves performance by reducing the number of write operations to main memory, but it introduces complexity in managing dirty cache lines and ensuring data consistency.

The Dirty Bit: The dirty bit is a flag associated with each cache line that indicates whether the data in that cache line has been modified (written to) since it was loaded from main memory. If the dirty bit is set, it means the cache line contains data that is different from what is in main memory, and it must be written back to main memory before being evicted.

The Eviction Policy: When the cache is full and a new block of data needs to be loaded, the cache must evict (remove) an existing block to make room. In a write-back cache, if the evicted block is marked as dirty, the cache must first write the modified data back to main memory before loading the new block. This process is known as lazy eviction because the write-back to main memory is deferred until eviction, rather than occurring immediately on every write.

As guessed, the write-back policy is generally preferred in modern CPUs due to its performance advantages, despite the added complexity of managing dirty cache lines and ensuring data consistency. And This introduces the problem of cache coherence in multi-core systems, which we will discuss next.

3.5 Multi-Core Complexity: MESI Protocol and False Sharing

The problem with Lazy Eviction (Write-Back) in a multi-core system is coherence. If Core 1 has a Dirty version of a variable, and Core 2 tries to read it from RAM, Core 2 will read garbage.

To solve this, hardware engineers implemented a Social Contract between cores. They don't just talk to RAM; they talk to each other. The most common standard for this negotiation is the MESI Protocol.

However, this strict protocol has a nasty side effect called False Sharing.

Under MESI, every Cache Line (that 64-byte chunk) has a 2-bit state tag attached to it. These bits tell the core what rights it has over that data.

  • Modified (M): The cache line is dirty (modified) and is the only valid copy. Other caches do not have this data.

  • Exclusive (E): The cache line is clean (not modified) and is the only valid copy. Other caches do not have this data.

  • Shared (S): The cache line is clean and may be present in other caches. Multiple caches can read this data.

  • Invalid (I): The cache line is not valid. It may have been modified by another core or is not present in this cache.

Now we know the modes, what they mean, let's now see how the 2 Cores interact when accessing shared data.

The Snooping Bus: All cores are connected to a shared communication channel called the snooping bus. Whenever a core wants to read or write data, it broadcasts its intention on this bus. Other cores listen (snoop) to these broadcasts and respond accordingly to maintain coherence.

The Snooping Bus

Simple example:

  • Core 1 wants to write to a variable X. It checks its cache and finds that X is in the Shared (S) state. To modify it, Core 1 must first broadcast an Invalidate message on the snooping bus, telling all other cores to mark their copies of X as Invalid (I). Once all other cores acknowledge the invalidation, Core 1 can change the state of X to Modified (M) and proceed with the write.

Note in the example, we Say it checks its cache and finds that X ...... This is not totaly correct, it dones not only find X, it finds the entire cache line that contains X. This is where False Sharing comes into play, that is a performace desaster.

False Sharing occurs when two or more cores modify different variables that happen to reside in the same cache line. Even though the variables are independent, the MESI protocol forces the cores to invalidate each other's cache lines because they share the same cache line. This is known widely as Cache Line Ping-Pong.

Cash Line Ping-Pong

Programmer Takeaways

Here is where the real fun begins. Now that we understand how memory works under the hood, let's discuss some practical takeaways for programmers to optimize their code for better performance.

4.1 Data Placement: Why std::vector Beats std::list

This is the Hello World of memory optimization. It teaches the fundamental rule: Linked Lists are cache poison.

Why: A linked list scatters nodes across the heap (0x1000, 0x8004, 0x200). The CPU cannot predict the next address, breaking the Hardware Prefetcher. You pay the full RAM latency tax for every node.

In contrast, std::vector stores elements contiguously in memory (0x1000, 0x1004, 0x1008). Accessing one element brings the next few into the cache line, leveraging spatial locality and prefetching. This drastically reduces cache misses and improves performance.

Needless to say, prefer std::vector over std::list for performance-critical code unless you have a specific reason to use a linked list (like frequent insertions/deletions in the middle of the list).

Bad Code Example: Using std::list

long long sum_list(const std::list<int>& l) {
    long long sum = 0;
    for (int val : l) sum += val;
    return sum;
}
Enter fullscreen mode Exit fullscreen mode

Good Code Example: Using std::vector

long long sum_vector(const std::vector<int>& v) {
    long long sum = 0;
    for (int val : v) sum += val;
    return sum;
}
Enter fullscreen mode Exit fullscreen mode

4.2 The Double Indirection Trap: std::vector<std::vector<T>>

Developers often use std::vector<std::vector<int>> for grids. This is a pointer to an array of pointers.

Why: To access grid[i][j], the CPU must fetch grid -> fetch pointer at grid[i] (cache miss 1) -> fetch data at [j] (cache miss 2). Rows are not contiguous in physical memory.

To solve this, we use a clever trick: flatten the 2D structure into a 1D vector.

Bad Code Example: Using std::vector<std::vector<T>>


/*
[Row 0 Data ...] -> 0x1000
[Row 1 Data ...] -> 0x8004
[Row 2 Data ...] -> 0x2000
[Row 3 Data ...] -> 0x4008

grid = [0x1000, 0x8004, 0x2000, 0x4008]
*/


std::vector<std::vector<int>> grid(rows, std::vector<int>(cols));
int value = grid[i][j]; // Double indirection, two cache misses
Enter fullscreen mode Exit fullscreen mode

Good Code Example: Flattening the 2D Structure

inline int idx(int i, int j, int cols) {
    return i * cols + j;
}
inline pair<int, int> to_2d(int index, int cols) {
    return {index / cols, index % cols};
}
// [Row 1 Data... | Row 2 Data... | Row 3 Data...] (Contiguous)
std::vector<int> grid(rows * cols);
int value = grid[idx(i, j, cols)]; // Single access, better cache locality
Enter fullscreen mode Exit fullscreen mode

4.3 The Tetris Game: Struct Packing and Alignment

The compiler aligns data to memory boundaries. If you order your variables poorly, you create holes (padding) in your cache lines.

Why does the compiler add padding? To ensure that data types are aligned to their natural boundaries (e.g., 4-byte integers on 4-byte boundaries). Misaligned accesses can be slower or even cause hardware exceptions on some architectures.

Bad Code Example: Poorly Ordered Struct

struct Bad {
    char a;      // 1 byte
    // 7 bytes padding
    double c;    // 8 bytes
    int b;       // 4 bytes
    // 4 bytes padding
};
// Size: 24 bytes
Enter fullscreen mode Exit fullscreen mode

Good Code Example: Well-Ordered Struct

struct Good {
    double c;    // 8 bytes
    int b;       // 4 bytes
    char a;      // 1 byte
    // 3 bytes padding
};
// Size: 16 bytes (no padding between members)
Enter fullscreen mode Exit fullscreen mode

4.4 Spatial Locality: Hot/Cold Data Splitting

Objects often contain data we check frequently (ID, Health) and data we rarely check (Name, Biography).

Why: If a struct is 200 bytes (mostly text strings), only 3 structs fit in a cache line. Iterating over them fills the cache with Cold text data you aren't reading, flushing out useful data.

What to do: Move rare data to a separate pointer or array.

Bad Code Example: Mixed Hot/Cold Data

struct User {
    int id;              // HOT
    int balance;         // HOT
    char username[128];  // COLD (Pollutes cache)
    char bio[256];       // COLD (Pollutes cache)
};
Enter fullscreen mode Exit fullscreen mode

Good Code Example: Split Hot/Cold Data

struct UserHot {
    int id;
    int balance;
    UserCold* coldData; // Pointer to cold data
};
struct UserCold {
    char username[128];
    char bio[256];
};
Enter fullscreen mode Exit fullscreen mode

4.5 Data Oriented Design: AoS (Array of Struct) vs. SoA (Struct of Arrays)

Imagine you are building a game with thousands of entities, each with position and color.

How do you store them?

4.5.1 Array of Structs (AoS)

struct Entity {
    float x, y, z;      // Position
    float r, g, b;      // Color
};
Enter fullscreen mode Exit fullscreen mode

The problem with such design is that when you want to update positions, you load entire cache lines with color data you don't need.

So here comes the alternative:

4.5.2 Struct of Arrays (SoA)

struct Entities {
    std::vector<float> x,y,z;
    std::vector<float> r,g,b;
};
Enter fullscreen mode Exit fullscreen mode

Essentially, you separate data by usage patterns. When updating positions, you only load position arrays into the cache, maximizing cache utilization and minimizing cache misses.

4.6 Hardware Topology: NUMA & Context Switching

This will be covered in more details in the next post.

Conclusion

Understanding how memory works at a low level is crucial for writing high-performance software. By leveraging knowledge about caches, memory hierarchies, and data locality, programmers can make informed decisions that lead to significant performance improvements.

In this post, we covered the basics of modern memory systems, including the differences between SRAM and DRAM, the structure of CPU caches, and practical programming techniques to optimize memory access patterns. In the upcoming parts of this series, we will dive deeper into virtual memory, NUMA architectures, and advanced optimization strategies.

Top comments (2)

Collapse
 
thameemul_anzari_89d3a31c profile image
Thameemul Anzari

Thank you for writing very nice informative article about memory. Awaiting for the next article on this topic.

Collapse
 
hamzahassanain0 profile image
Hamza Hasanain

Thanks for your comment. Much appreciated 😊