Hamza Hasanain

Posted on Jan 2

What Every Programmer Should Know About Memory Part 3

#programming #webdev #computerscience #lowcode

Geography Matters: NUMA Support

In the previous article What Every Programmer Should Know About Memory Part 2, we talked about Virtual Memory and how it translates the lies of the OS into physical reality. We covered page tables, the TLB, and how the hardware walks the tree to find your data.

In this article, we continue from where we left off and cover section 5 from the paper What Every Programmer Should Know About Memory by Ulrich Drepper.

Up until now, we've mostly pretended that all RAM is created equal. We assumed that if you have 16GB of RAM, accessing byte 0 is just as fast as accessing byte 15,999,999,999. In the old days of SMP (Symmetric Multi-Processing), this was true. All CPUs connected to a single memory controller via a single bus.

But as core counts exploded, that single bus became a bottleneck. The solution was to split the memory up and give each CPU its own local memory. This created NUMA (Non-Uniform Memory Access).

From SMP to NUMA: Why equality is dead
The Hardware Topology: Nodes and Interconnects
- 2.1. Local vs. Remote Memory
- 2.2. The Interconnect Penalty
OS Policies: The "First Touch" Trap
- 3.1. How Linux Allocates Memory
- 3.2. The Trap: Main Thread Initialization
- 3.3. The "Spillover" Behavior (Zone Reclaim)
Tools of the Trade
- 4.1. Analyzing with lscpu
- 4.2. The Distance Matrix (numactl)
- 4.3. Controlling Policy with numactl
- 4.4. Programming with libnuma
Conclusion

1. UMA vs. NUMA: The Death of Equality

To understand why modern servers behave the way they do, we need to look at the evolution of memory architectures.

1.1 UMA (Uniform Memory Access)

The Old Way: In the days of SMP (Symmetric Multi-Processing), we had a single memory controller and a single system bus. All CPUs connected to this bus.

What: "Uniform" means the cost to access RAM is the same for every core. Accessing address 0x0 takes 100ns for Core 0 and 100ns for Core 1.
Why it failed: The shared bus became a bottleneck. As we added more cores (2, 4, 8...), they all fought for the same bandwidth. It was like having 64 cars trying to use a single lane highway.

1.2 NUMA (Non-Uniform Memory Access)

The New Way: To solve the bottleneck, hardware architects split the memory up.

What: Instead of one giant bank of RAM, we attach a dedicated chunk of RAM to each processor socket. Each Processor + its Local RAM is called a NUMA Node.
How: The nodes are connected by a high-speed interconnect (like Intel UPI or AMD Infinity Fabric). If CPU 0 needs data from CPU 1's memory, it asks CPU 1 to fetch it and ship it over the wire.

This architecture solves the bandwidth problem (multiple highways!) but introduces a new problem: Physics.

2. The Cost of Remote Access

Now that memory is physically distributed, distance matters.

If a CPU on Node 0 needs data located in Node 0's RAM, the path is short and fast.
If a CPU on Node 0 needs data located in Node 1's RAM, the request must travel over the interconnect to Node 1, wait for Node 1's memory controller to fetch it, and ship it back.

2.1 The Latency Penalty

We often measure this cost as a "latency factor."

Local Access: 1.0 (Baseline)
Remote Access: 1.5x - 2.0x Slower

That means every cache miss that hits remote memory is twice as expensive as a local miss. In high-performance computing (HPC) or low-latency trading, this is a disaster.

2.2 Bandwidth Saturation: The Clogged Pipe

It's not just about speed; it's about capacity. The interconnect between sockets has a limited bandwidth.

If you write a program where all threads on all 64 cores are aggressively reading from Node 0's memory, you create a traffic jam. The local cores on Node 0 might get their data fine, but the remote cores on other nodes will see massive stalls as they fight for space on the interconnect.

3. OS Policies: The "First Touch" Trap

So how does the OS decide where to put your memory? If you malloc(1GB), does it go to Node 0 or Node 1?

Linux uses a policy called First-Touch Allocation.

3.1 How Linux Allocates Memory

When you call malloc(1GB), the kernel doesn't actually give you physical RAM. It gives you a promise (Virtual Memory).
The physical RAM is allocated only when you write to that page for the first time. This is called a Page Fault.

At that exact moment, the kernel looks at which CPU triggered the page fault. It says, "Ah, you are running on CPU 5, which belongs to Node 0. I will allocate this physical page from Node 0's RAM to make it fast for you."

This is normally good, but it leads to a deadly trap.

3.2 The Trap: Main Thread Initialization

This policy leads to one of the most common performance bugs in high-performance applications.

The Scenario:

You start your program. The Main Thread (running on Node 0) allocates a huge array and initializes it to zero (memset).
Because the Main Thread touched all the pages, the OS dutifully allocates 100% of the RAM on Node 0.
You spawn 64 worker threads (spread across Node 0, 1, 2, 3) to process the data in parallel.

The Result:

Threads on Node 0 are happy (Local access).
Threads on Node 1, 2, 3 are miserable. They are all being forced to fetch data remotely from Node 0.
The interconnect to Node 0 becomes saturated.
Performance scales poorly, and you wonder why adding more cores made it slower.

The Fix:
Parallel Initialization. Don't let the main thread memset everything. Have your worker threads initialize the specific chunks of data they will be working on. This ensures the physical memory pages are allocated on the local nodes where the workers live.

3.3 The "Spillover" Behavior (Zone Reclaim)

What happens if Node 0 is full? By default, if a thread on Node 0 requests memory and Node 0 is full, Linux will attempt to allocate from Node 1 rather than crashing.

This creates unpredictable latency spikes. Your application runs fast for the first 30 minutes, fills up Node 0, and suddenly slows down by 50% because new allocations are silently spilling over to Node 1. Monitoring numa_miss stats in /sys/devices/system/node/ is the only way to catch this.

4. Tools of the Trade

How do you know if you are running on a NUMA machine?

4.1 Analyzing with `lscpu`

Open your terminal and type lscpu. It reveals the truth about your hardware.

$ lscpu
...
NUMA node(s):          2
NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     32-63

NUMA node(s): 2 -> You have 2 distinct memory banks.
NUMA node0 CPU(s): 0-31 -> If you run a thread on Core 5, its local memory is Node 0. If it accesses Node 1, it pays the penalty.

4.2 The Distance Matrix (`numactl`)

To see exactly how "remote" a node is, use numactl --hardware. The "node distances" table at the bottom is key:

node distances:
node   0   1
  0:  10  21
  1:  21  10

10: Represents local access (the baseline cost).
21: Represents the cost to cross the interconnect.

If you saw a value like 30 or 40, that would imply an even longer path (like jumping over two sockets in a 4-socket server).

4.3 Controlling Policy with `numactl`

You can override the default OS behavior using numactl.

Interleaving:
If you have a read-only lookup table that every thread accesses randomly, "First Touch" is bad (it unfairly burdens one node). Instead, you can force the OS to spread the pages round-robin across all nodes.

# Interleave memory allocation across all nodes
numactl --interleave=all ./my_application

Binding:
You can also strict-bind a process to a specific node, ensuring it never inadvertently runs on a remote core or allocates remote memory.

# Run only on Node 0's CPUs, allocate only from Node 0's RAM
numactl --cpunodebind=0 --membind=0 ./my_application

4.4 Programming with `libnuma`

Sometimes you can't control how the user runs your binary. You can enforce memory policy directly in C++ using libnuma:

#include <numa.h>

// Allocate 10MB specifically on Node 0
void* data = numa_alloc_onnode(10 * 1024 * 1024, 0);

// Or run this thread only on Node 0
numa_run_on_node(0);

Note: This requires linking with -lnuma.

5. Conclusion

Ignoring NUMA is ignoring the laws of physics in your server. As programmers, we can't change the hardware, but we can change how we behave on it.

By respecting concepts like First-Touch, understanding the Interconnect Penalty, and pinning our threads appropriately, we can stop fighting the hardware and start working with it.

In the next and final part, we will cover Section 6: What Programmers Can Do. This will be a massive deep dive into cache blocking, data layout (SoA vs AoS), and the infamous False Sharing effect.

DEV Community

What Every Programmer Should Know About Memory Part 3

Geography Matters: NUMA Support

Table of Contents

1. UMA vs. NUMA: The Death of Equality

1.1 UMA (Uniform Memory Access)

1.2 NUMA (Non-Uniform Memory Access)

2. The Cost of Remote Access

2.1 The Latency Penalty

2.2 Bandwidth Saturation: The Clogged Pipe

3. OS Policies: The "First Touch" Trap

3.1 How Linux Allocates Memory

3.2 The Trap: Main Thread Initialization

3.3 The "Spillover" Behavior (Zone Reclaim)

4. Tools of the Trade

4.1 Analyzing with `lscpu`

4.2 The Distance Matrix (`numactl`)

4.3 Controlling Policy with `numactl`

4.4 Programming with `libnuma`

5. Conclusion

Top comments (0)

Geography Matters: NUMA Support

Table of Contents

1. UMA vs. NUMA: The Death of Equality

1.1 UMA (Uniform Memory Access)

1.2 NUMA (Non-Uniform Memory Access)

2. The Cost of Remote Access

2.1 The Latency Penalty

2.2 Bandwidth Saturation: The Clogged Pipe

3. OS Policies: The "First Touch" Trap

3.1 How Linux Allocates Memory

3.2 The Trap: Main Thread Initialization

3.3 The "Spillover" Behavior (Zone Reclaim)

4. Tools of the Trade

4.1 Analyzing with lscpu

4.2 The Distance Matrix (numactl)

4.3 Controlling Policy with numactl

4.4 Programming with libnuma

5. Conclusion

4.1 Analyzing with `lscpu`

4.2 The Distance Matrix (`numactl`)

4.3 Controlling Policy with `numactl`

4.4 Programming with `libnuma`