Let me break down these fundamental concepts that are crucial for writing high-performance code.
Contiguous Memory
Contiguous memory means data is stored in consecutive memory addresses, with no gaps between elements.
Example:
Array: [10, 20, 30, 40]
Memory addresses: 0x1000, 0x1004, 0x1008, 0x100C
Each integer takes 4 bytes, and they're placed right next to each other in memory. This is contiguous storage.
Contiguous vs Non-Contiguous Structures
Contiguous structures:
- Arrays
- Vectors (dynamic arrays)
- Strings (typically)
Non-contiguous structures:
- Linked lists (nodes scattered across memory)
- Hash tables (with separate chaining)
- Trees (nodes allocated separately)
Why Contiguous Memory Matters
When you access array[0], then array[1], then array[2], you're accessing consecutive memory locations. Modern CPUs are optimized for this pattern, which brings us to cache locality.
Cache Locality
Cache locality refers to how effectively your program uses the CPU cache by accessing data that's "nearby" in memory.
The Memory Hierarchy
Your computer has a memory hierarchy with vastly different speeds:
CPU Registers: ~0.5 ns (fastest, tiny storage)
L1 Cache: ~1 ns (per core, ~32-64 KB)
L2 Cache: ~3-4 ns (per core, ~256-512 KB)
L3 Cache: ~10-20 ns (shared, ~8-32 MB)
Main RAM: ~100 ns (slowest, GBs of storage)
The speed difference between L1 cache and RAM is roughly 100x. This is enormous.
How Caches Work
Caches don't fetch individual bytes. They fetch entire cache lines (typically 64 bytes). When you access one byte, the CPU automatically loads the surrounding 63 bytes into cache, betting you'll need them soon.
Example: If you access array[0] (4 bytes), the CPU loads array[0] through array[15] (64 bytes total) into the cache line.
Types of Cache Locality
1. Spatial Locality (Sequential Access)
Accessing memory locations that are close to each other.
Good spatial locality:
// Sequential array traversal
for (int i = 0; i < n; i++) {
sum += array[i]; // Accessing consecutive elements
}
Poor spatial locality:
// Linked list traversal
Node* current = head;
while (current != NULL) {
sum += current->data; // Nodes scattered in memory
current = current->next;
}
2. Temporal Locality (Reuse)
Accessing the same memory locations repeatedly within a short time.
Good temporal locality:
int temp = array[i]; // Load once
result = temp * temp + temp * 2; // Reuse multiple times
Poor temporal locality:
// Processing array once, then never again
processOnce(array);
// Hours later...
processAgain(array); // Likely evicted from cache
Real-World Performance Impact
Example 1: Matrix Traversal
Matrices are stored in row-major order in C/C++/Python (rows are contiguous).
Fast (row-wise, good cache locality):
// Traverse row-by-row
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
sum += matrix[i][j]; // Sequential memory access
}
}
Slow (column-wise, poor cache locality):
// Traverse column-by-column
for (int j = 0; j < cols; j++) {
for (int i = 0; i < rows; i++) {
sum += matrix[i][j]; // Jumping across rows
}
}
For a 1000×1000 matrix, the row-wise version can be 10-50x faster than column-wise, purely due to cache effects.
Example 2: Array of Structures vs Structure of Arrays
Array of Structures (AoS) - Poor locality if you only need one field:
struct Particle {
float x, y, z; // position
float vx, vy, vz; // velocity
float mass;
};
Particle particles[1000];
// Only need to update positions
for (int i = 0; i < 1000; i++) {
particles[i].x += particles[i].vx; // Loads entire 28-byte struct
}
Structure of Arrays (SoA) - Better locality:
struct Particles {
float x[1000], y[1000], z[1000];
float vx[1000], vy[1000], vz[1000];
float mass[1000];
};
Particles particles;
// Only access what you need
for (int i = 0; i < 1000; i++) {
particles.x[i] += particles.vx[i]; // Sequential access to x and vx arrays
}
The SoA version can be significantly faster because you're loading only the data you need, and it's contiguous.
Cache Misses
When data isn't in cache, you get a cache miss, forcing the CPU to fetch from slower memory.
Types of cache misses:
- Compulsory miss: First access to data (unavoidable)
- Capacity miss: Cache too small for working set
- Conflict miss: Multiple addresses map to same cache location
Practical Optimization Tips
- Use arrays instead of linked lists when possible
- Access data sequentially rather than randomly
- Keep working set small to fit in cache
- Process data in blocks (loop tiling/blocking)
- Prefetch data manually in critical code (advanced)
- Align data structures to cache line boundaries
- Avoid false sharing in multithreaded code (different threads accessing different variables on same cache line)
Measuring Cache Performance
You can measure cache effects using:
- Performance counters (perf on Linux)
- Profiling tools (VTune, cachegrind)
- Timing comparisons of different access patterns
The performance difference between cache-friendly and cache-unfriendly code can easily be 10x or more, making this one of the most important considerations in high-performance computing.
Top comments (0)