ali ehab algmass

Posted on Jan 10

Contiguous Memory & Cache Locality

#algorithms #computerscience #performance

Let me break down these fundamental concepts that are crucial for writing high-performance code.

Contiguous Memory

Contiguous memory means data is stored in consecutive memory addresses, with no gaps between elements.

Example:

Array: [10, 20, 30, 40]
Memory addresses: 0x1000, 0x1004, 0x1008, 0x100C

Each integer takes 4 bytes, and they're placed right next to each other in memory. This is contiguous storage.

Contiguous vs Non-Contiguous Structures

Contiguous structures:

Arrays
Vectors (dynamic arrays)
Strings (typically)

Non-contiguous structures:

Linked lists (nodes scattered across memory)
Hash tables (with separate chaining)
Trees (nodes allocated separately)

Why Contiguous Memory Matters

When you access array[0], then array[1], then array[2], you're accessing consecutive memory locations. Modern CPUs are optimized for this pattern, which brings us to cache locality.

Cache Locality

Cache locality refers to how effectively your program uses the CPU cache by accessing data that's "nearby" in memory.

The Memory Hierarchy

Your computer has a memory hierarchy with vastly different speeds:

CPU Registers:    ~0.5 ns    (fastest, tiny storage)
L1 Cache:         ~1 ns      (per core, ~32-64 KB)
L2 Cache:         ~3-4 ns    (per core, ~256-512 KB)
L3 Cache:         ~10-20 ns  (shared, ~8-32 MB)
Main RAM:         ~100 ns    (slowest, GBs of storage)

The speed difference between L1 cache and RAM is roughly 100x. This is enormous.

How Caches Work

Caches don't fetch individual bytes. They fetch entire cache lines (typically 64 bytes). When you access one byte, the CPU automatically loads the surrounding 63 bytes into cache, betting you'll need them soon.

Example: If you access array[0] (4 bytes), the CPU loads array[0] through array[15] (64 bytes total) into the cache line.

Types of Cache Locality

1. Spatial Locality (Sequential Access)

Accessing memory locations that are close to each other.

Good spatial locality:

// Sequential array traversal
for (int i = 0; i < n; i++) {
    sum += array[i];  // Accessing consecutive elements
}

Poor spatial locality:

// Linked list traversal
Node* current = head;
while (current != NULL) {
    sum += current->data;  // Nodes scattered in memory
    current = current->next;
}

2. Temporal Locality (Reuse)

Accessing the same memory locations repeatedly within a short time.

Good temporal locality:

int temp = array[i];  // Load once
result = temp * temp + temp * 2;  // Reuse multiple times

Poor temporal locality:

// Processing array once, then never again
processOnce(array);
// Hours later...
processAgain(array);  // Likely evicted from cache

Real-World Performance Impact

Example 1: Matrix Traversal

Matrices are stored in row-major order in C/C++/Python (rows are contiguous).

Fast (row-wise, good cache locality):

// Traverse row-by-row
for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
        sum += matrix[i][j];  // Sequential memory access
    }
}

Slow (column-wise, poor cache locality):

// Traverse column-by-column
for (int j = 0; j < cols; j++) {
    for (int i = 0; i < rows; i++) {
        sum += matrix[i][j];  // Jumping across rows
    }
}

For a 1000×1000 matrix, the row-wise version can be 10-50x faster than column-wise, purely due to cache effects.

Example 2: Array of Structures vs Structure of Arrays

Array of Structures (AoS) - Poor locality if you only need one field:

struct Particle {
    float x, y, z;      // position
    float vx, vy, vz;   // velocity
    float mass;
};

Particle particles[1000];

// Only need to update positions
for (int i = 0; i < 1000; i++) {
    particles[i].x += particles[i].vx;  // Loads entire 28-byte struct
}

Structure of Arrays (SoA) - Better locality:

struct Particles {
    float x[1000], y[1000], z[1000];
    float vx[1000], vy[1000], vz[1000];
    float mass[1000];
};

Particles particles;

// Only access what you need
for (int i = 0; i < 1000; i++) {
    particles.x[i] += particles.vx[i];  // Sequential access to x and vx arrays
}

The SoA version can be significantly faster because you're loading only the data you need, and it's contiguous.

Cache Misses

When data isn't in cache, you get a cache miss, forcing the CPU to fetch from slower memory.

Types of cache misses:

Compulsory miss: First access to data (unavoidable)
Capacity miss: Cache too small for working set
Conflict miss: Multiple addresses map to same cache location

Practical Optimization Tips

Use arrays instead of linked lists when possible
Access data sequentially rather than randomly
Keep working set small to fit in cache
Process data in blocks (loop tiling/blocking)
Prefetch data manually in critical code (advanced)
Align data structures to cache line boundaries
Avoid false sharing in multithreaded code (different threads accessing different variables on same cache line)

Measuring Cache Performance

You can measure cache effects using:

Performance counters (perf on Linux)
Profiling tools (VTune, cachegrind)
Timing comparisons of different access patterns

The performance difference between cache-friendly and cache-unfriendly code can easily be 10x or more, making this one of the most important considerations in high-performance computing.

DEV Community