What Programmers Can Do: Writing Hardware-Sympathetic Code
In the previous article, we learned that memory geography matters. Now, we arrive at the finale—the most actionable part of Ulrich Drepper's paper: Section 6.
This is not about choosing a better algorithm (O(n) vs O(log n)). This is about writing code that respects how the hardware physically works. We will cover Cache Bypassing, TLB Optimization, Concurrency Pitfalls, and Code Layout.
Table of Contents
- Subsection A: Cache Optimization
- Subsection B: The Virtual Memory (TLB)
- Subsection C: Data & Code Layout
- Subsection D: Concurrency & NUMA
-
Subsection E: Prefetching
- 5.1. Helping the Hardware
- Conclusion
Subsection A: Cache Optimization
The most significant performance cliff in modern computing is missing the L1 Cache. Accessing L1 takes ~4 cycles. Accessing RAM takes ~200+ cycles. Your goal is to keep data in L1 as long as possible (Temporal Locality) and use every byte you load (Spatial Locality).
1.1 Data Placement: std::vector Beats std::list
This is the Hello World of memory optimization. It teaches the fundamental rule: Linked Lists are cache poison.
Why: A linked list scatters nodes across the heap (0x1000, 0x8004, 0x200). The CPU cannot predict the next address, breaking the Hardware Prefetcher. You pay the full RAM latency tax for every node.
In contrast, std::vector stores elements contiguously in memory (0x1000, 0x1004, 0x1008). Accessing one element brings the next few into the cache line, leveraging spatial locality and prefetching. This drastically reduces cache misses and improves performance.
Bad Code Example: Using std::list
long long sum_list(const std::list<int>& l) {
long long sum = 0;
for (int val : l) sum += val;
return sum;
}
Good Code Example: Using std::vector
long long sum_vector(const std::vector<int>& v) {
long long sum = 0;
for (int val : v) sum += val;
return sum;
}
1.2 The Double Indirection Trap: std::vector<std::vector<T>>
Developers often use std::vector<std::vector<int>> for grids. This is a pointer to an array of pointers.
Why: To access grid[i][j], the CPU must fetch grid -> fetch pointer at grid[i] (cache miss 1) -> fetch data at [j] (cache miss 2). Rows are not contiguous in physical memory.
To solve this, we use a clever trick: flatten the 2D structure into a 1D vector.
Bad Code Example: Using std::vector<std::vector<T>>
std::vector<std::vector<int>> grid(rows, std::vector<int>(cols));
int value = grid[i][j]; // Double indirection, two cache misses
Good Code Example: Flattening the 2D Structure
inline int idx(int i, int j, int cols) {
return i * cols + j;
}
// [Row 1 Data... | Row 2 Data... | Row 3 Data...] (Contiguous)
std::vector<int> grid(rows * cols);
int value = grid[idx(i, j, cols)]; // Single access, better cache locality
1.3 Bypassing the Cache (Non-Temporal Stores)
The Hidden Cost of Writing:
Normally, when you write to memory (e.g., data[i] = 0), the CPU must ensure cache coherency. Since it writes to a 64-byte cache line, it must first Read-For-Ownership (RFO). It fetches the existing 64 bytes from RAM into L1, modifies the 4 bytes you changed, and marks the line as "Modified".
The Problem (Cache Pollution):
If you are initializing a massive array (e.g., memset of 1GB), the CPU will:
- Read 1GB of old data from RAM (wasting bandwidth).
- Fill almost the entire L1/L2/L3 cache with this zeroed data.
- Evict your application's hot data (code, stack, other variables) to make room.
This is called Cache Pollution, and it destroys performance for code running immediately after the write.
The Solution: Non-Temporal Stores (Streaming Stores)
You can instruct the CPU to use a Write-Combining Buffer (WCB) instead of the cache. You tell the CPU: "I promise I will overwrite this entire line. Don't read it. Do not pollute the cache with it. Just write it to RAM."
Code Example (Intel Intrinsics):
#include <immintrin.h>
void stream_memset(int* data, int size, int value) {
// 1. Create a 128-bit vector filled with 'value' (4 integers)
__m128i v = _mm_set1_epi32(value);
// Note: ensure 'size' is a multiple of 4 integers (16 bytes)
for (int i = 0; i < size; i += 4) {
// 2. The Streaming Store (The Magic)
// Writes to 16-byte aligned memory, bypassing L1/L2.
// It tells the CPU to NOT fetch the old data (No Read-For-Ownership).
_mm_stream_si128((__m128i*)&data[i], v);
}
// 3. The Fence
// Streaming stores are "weakly ordered". This instruction
// Forces all Write-Combining Buffers to flush to RAM immediately.
_mm_sfence();
}
The Constraint: Memory Alignment
The specific intrinsic _mm_stream_si128 physically requires the memory address to be 16-byte aligned (divisible by 16).
- If you access address
0x1000, it works (Ends in 0). - If you access address
0x1004, it Crashes (Segfault).
Using standard new or malloc does not guarantee this alignment. You must use specific allocators:
1. Modern C++ (C++17):
#include <cstdlib>
// std::aligned_alloc(alignment, size)
int* data = (int*)std::aligned_alloc(16, 1000 * sizeof(int));
std::free(data);
2. The "Intel" Way (Intrinsics):
#include <immintrin.h>
int* data = (int*)_mm_malloc(1000 * sizeof(int), 16);
_mm_free(data); // Must use _mm_free matching _mm_malloc
3. The POSIX Way (Linux/Unix):
#include <cstdlib>
void* ptr;
if (posix_memalign(&ptr, 16, 1000 * sizeof(int)) == 0) {
int* data = (int*)ptr;
free(data);
}
For AVX/AVX2, use _mm256_stream_si256 which requires 32-byte alignment.
1.4 Access Patterns & Blocking (Tiling)
Hardware prefetchers are good at linear access (Row-Major), but they fail when access patterns are strided (Column-Major) or random.
Row-Major vs Column-Major:
// Fast: Row-major access (Sequential)
// All on the same page/cache line.
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum += matrix[i][j];
}
}
// Slow: Column-major access (Strided)
// High Cache miss rate & TLB miss rate!
for (int j = 0; j < N; j++) {
for (int i = 0; i < N; i++) {
sum += matrix[i][j];
}
}
The Fix: Blocking (Loop Tiling)
Divide the problem into small sub-problems that fit entirely inside the L1 Cache.
Choosing the Block Size (B):
For a square block of B x B elements, you want the working set (3 * B^2 * sizeof(element)) to fit in L1.
B ≈ sqrt( L1_Size / (3 * Element_Size) )
Example: L1 = 32KB, float = 4B -> B ≈ sqrt(32768 / 12) ≈ 52. Choose B=48 or B=32 for alignment.
The Algorithm:
- Load a small
B x Bblock of A and B into L1. - Compute all possible results for that block.
- Only move to the next block when finished.
This maximizes Temporal Locality (reuse). The data goes into L1 and stays there.
Subsection B: The Virtual Memory (TLB)
This is a critical section often ignored by developers. Every time your code touches a virtual address, the CPU must translate it to a physical address using the TLB (Translation Lookaside Buffer).
2.1 The High Cost of Translation
The TLB is a tiny cache for logical-to-physical address translations. It typically has distinct levels (L1/L2) with entry counts in the dozens to hundreds (e.g., 64 L1 entries, 512 L2 entries).
Standard memory pages are 4KB. If you access 2GB of memory sequentially, you need 524,288 page table entries. Your TLB will thrash constantly.
2.2 The Solution: Huge Pages
Modern CPUs support Huge Pages (e.g., 2MB or 1GB).
Using 2MB pages for that same 2GB array reduces entries to just 1,024. The entire mapping can now fit in the L2 TLB.
Enabling Huge Pages (Linux):
# Allocate 512 hugepages of 2MB each (Total 1GB)
sysctl -w vm.nr_hugepages=512
# Verify
grep Huge /proc/meminfo
Code Example (Using mmap):
#include <sys/mman.h>
// Request a 2MB Huge Page explicitly
void* huge_data = mmap(NULL, 2 * 1024 * 1024,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
-1, 0);
if (huge_data == MAP_FAILED) {
// Fallback (or check if user has privileges/OS support)
}
Note: Linux also supports **Transparent Huge Pages (THP), which tries to use huge pages automatically. However, explicit mmap or madvise gives you deterministic control.
Subsection C: Data & Code Layout
3.1 The Tetris Game: Struct Packing
The compiler aligns data to memory boundaries. If you order your variables poorly, you create holes (padding) in your cache lines.
Why does the compiler add padding? To ensure that data types are aligned to their natural boundaries (e.g., 4-byte integers on 4-byte boundaries).
Bad Code Example: Poorly Ordered Struct
struct Bad {
char a; // 1 byte
// 7 bytes padding
double c; // 8 bytes
int b; // 4 bytes
// 4 bytes padding
};
// Size: 24 bytes
Good Code Example: Well-Ordered Struct
struct Good {
double c; // 8 bytes
int b; // 4 bytes
char a; // 1 byte
// 3 bytes padding
};
// Size: 16 bytes (no padding between members)
3.2 Hot/Cold Data Splitting
Objects often contain data we check frequently (ID, Health) and data we rarely check (Name, Biography).
Why: If a struct is 200 bytes (mostly text strings), only 3 structs fit in a cache line. Iterating over them fills the cache with Cold text data you aren't reading, flushing out useful data.
What to do: Move rare data to a separate pointer or array.
Bad Code Example: Mixed Hot/Cold Data
struct User {
int id; // HOT
int balance; // HOT
char username[128]; // COLD (Pollutes cache)
};
Good Code Example: Split Hot/Cold Data
struct UserHot {
int id;
int balance;
UserCold* coldData; // Pointer to cold data
};
struct UserCold {
char username[128];
};
3.3 Struct of Arrays (SoA) vs Array of Structs (AoS)
This is a classic battle in Game Development and Data-Oriented Design.
Array of Structs (AoS) - The OOP Way:
struct Point {
int x, y, z;
};
Point points[1000];
This is good if you always access x, y, and z together. But often, you loop over just x to do a physics calculation.
The cost: Every time you load points[i].x, you also load y and z into the cache line, wasting 66% of your bandwidth.
Struct of Arrays (SoA) - The Data-Oriented Way:
struct Points {
int x[1000];
int y[1000];
int z[1000];
};
Now, x values are packed contiguously. One cache line load brings in 16 x values at once. This is also perfect for SIMD (Single Instruction Multiple Data) auto-vectorization.
3.4 Alignment Matters
CPUs love boundaries. Ideally, your data structures should start at addresses divisible by 64 (cache line size).
C++ Solution:
struct alignas(64) AlignedData {
int critical_value;
// ...
};
3.5 Instruction Cache & Branch Prediction
It's not just data that gets cached—instructions do too (L1i Cache). If your code jumps around unpredictably, the CPU pipeline stalls.
Branch Hints:
Modern CPUs have powerful dynamic branch predictors that often figure out patterns better than you can. However, for static branches (like error checking), you can give the compiler a hint to move cold code away from hot code.
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
void process_transaction(Transaction* t) {
// "Cold" path: Compiler moves this assembly block to the end of the function
if (unlikely(t == nullptr)) {
handle_error();
return;
}
// "Hot" path: Continues immediately in memory, keeping L1i efficient
do_math(t);
}
Subsection D: Concurrency & NUMA
4.1 The Silent Killer: False Sharing
This is the most insidious performance bug in multithreading.
Two threads on different cores modify variables that happen to sit on the same 64-byte cache line. The cache coherence protocol (MESI) forces the line to bounce back and forth ("ping-ponging"), executing slowly.
The Fix (Padding):
Align critical shared data to 64 bytes to ensure it lives on its own island.
struct PaddedCounter {
alignas(64) std::atomic<int> value;
// Padding is implicit due to alignas, but explicit padding
// can also be used: char pad[60];
};
PaddedCounter counters[NUM_THREADS]; // Each counter is now on a separate line
Result: often a 10x-50x speedup in contended write workloads.
4.2 Thread Affinity (Pinning)
In a NUMA system, memory is local to a specific CPU socket. If the OS scheduler moves your thread to a different socket, it must access memory remotely (high latency).
The Solution: Pin the thread to a specific core (or socket).
#include <pthread.h>
void pin_thread_to_core(int core_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}
Tooling: Use numactl to bind processes: numactl --physcpubind=0-3 --membind=0 ./myapp.
Subsection E: Prefetching
5.1 Helping the Hardware
Hardware prefetchers are great at standard patterns (i++), but they struggle with pointer lookups (p = p->next).
Software Prefetching:
You can issue a non-blocking instruction to fetch a line into L1 before you need it. Use __builtin_prefetch(addr, rw, locality).
while (node) {
// locality: 3 = heavy reuse (L1), 0 = no reuse (streaming)
// rw: 0 = read, 1 = write
__builtin_prefetch(node->next, 0, 3);
do_heavy_work(node->value);
// By the time work is done, node->next is hopefully in L1.
node = node->next;
}
Warning: Tuning this is hard. Prefetch too early, and you evict useful data. Prefetch too late, and it hasn't arrived. Measure everything.
Tools for Performance Engineers
Don't guess—measure.
- perf (Linux): The gold standard.
-
perf stat -e cycles,cache-misses,instructions ./app: Check IPC and miss rates. -
perf record -g ./app&perf report: Find exactly where cache misses happen.
-
- valgrind (Cachegrind):
valgrind --tool=cachegrind ./app. Slow, but gives deterministic cache simulation. - lscpu / hwloc: View your topology (L1 sizes, NUMA nodes).
Quick Cheat Sheet
| Mechanic | Do ... | Don't ... |
|---|---|---|
| Containers | Prefer std::vector (Contiguous). |
Use std::list (Linked Lists are cache poison). |
| Indirection | Flatten 2D arrays to 1D vectors. | Use vector<vector<T>> (Double Indirection). |
| Struct Packing | Order members: Largest to Smallest. | Order randomly (creates padding/holes). |
| Hot/Cold Data | Split rare fields into separate structs. | Pollute cache lines with unused data strings. |
| Data Layout | Use Struct of Arrays (SoA) for bulk processing. | Use Array of Structs (AoS) for everything. |
| Alignment | Align structs/arrays to 64B. | Use unaligned addresses for SIMD/Streaming. |
| Concurrency | Pad atomic counters to 64B. | Let threads fight over the same cache line. |
| Huge Pages | Use 2MB pages for >100MB arrays. | Rely on 4KB pages for massive working sets. |
Conclusion
I hope this overview of Drepper's work helps you write code that the hardware loves. Happy Coding!
Top comments (0)