TL;DR
- Contiguous = items stored back‑to‑back in memory (arrays). Continuous = unbroken in time/space (a continuous sound).
- Arrays are fast because they exploit O(1) indexing and spatial locality: when you load one element, the CPU also pulls nearby bytes into very fast caches.
- Your data flows through a memory hierarchy: Registers → L1 → L2 → L3 → DRAM (RAM) → Disk (SSD/HDD). The closer to the CPU core, the faster and smaller it is.
- A typical cache line is 64 bytes. For 4‑byte
int
s, one line holds 16 ints—so a single miss often makes the next 15 ints effectively “free.” - Misses happen when data isn’t in cache: cold, capacity, conflict. Write code that walks memory linearly and keeps hot working sets small.
Contiguous vs. Continuous — use the right word
Contiguous memory means items are adjacent in RAM with a fixed stride. If an array a
starts at base
and stores 4‑byte ints, then the address of a[i]
is:
base + i * 4
.
Continuous means “without interruption,” usually in time. A continuous beep, a continuous line. Not a memory term.
Analogy:
- Contiguous: movie seats 12–27 are touching; you and your friends can sit together.
- Continuous: a movie plays without a break.
Arrays are contiguous. Linked lists are not; each node can be anywhere, connected by pointers.
Why arrays feel fast in practice
- O(1) indexing: compute the address and load it directly.
-
Spatial locality: programs that loop
i=0..n-1
touch nearby addresses; the CPU anticipates this and fetches bigger chunks (cache lines). - Hardware prefetching + vectorization: linear scans are easy for the CPU to accelerate.
Result: sorting, scanning, two‑pointers, sliding windows, prefix sums, DP, and many string routines run great on arrays.
Spatial locality in one minute
When you read a[i]
, the CPU brings the entire cache line containing that element into a tiny, ultra‑fast memory called L1 cache. If a line is 64 bytes, and each int
is 4 bytes, then that line holds 16 consecutive ints: a[i] … a[i+15]
.
If your next iterations read a[i+1]
, a[i+2]
, … they usually hit in L1—super fast.
Spatial locality: use one thing → likely use its neighbors next.
Meet the memory hierarchy (closest to farthest from the core)
CPU Core
├─ Registers (tiny, per core, fastest)
├─ L1 Cache (per core, data & instruction, ~tens of KB)
├─ L2 Cache (per core, ~hundreds of KB to ~1MB)
└─ L3 Cache (last-level, shared across cores, many MB)
↓
DRAM (RAM) (gigabytes, managed by OS)
↓
Disk (SSD/HDD) (terabytes, persistent; OS loads to RAM first)
- Registers live inside the core and hold operands for the instructions in flight.
- L1/L2 are private to a core; L3 is typically shared among cores on the chip.
- RAM is large but far; Disk is larger and much slower. The CPU does not load straight from disk—data must be in RAM first.
CPU cycle latency (what “two orders of magnitude” feels like)
A CPU cycle is one tick of the CPU clock (e.g., 3.0 GHz ≈ 0.33 ns per cycle). Access costs vary by level. Ballpark latencies:
Level | Typical Size | Approx Latency |
---|---|---|
Register | dozens of slots | ~1 cycle |
L1 cache | ~32 KB per core | ~4 cycles |
L2 cache | 256 KB–1 MB per core | ~10–15 cycles |
L3 cache | many MB (shared) | ~30–50 cycles |
DRAM (RAM) | GBs | ~200–300 cycles (~70–100 ns) |
SSD | TBs | ~50–100 µs |
HDD | TBs | ~5–10 ms |
Exact numbers vary by CPU, but the shape is consistent: the farther you go, the slower it gets—often by orders of magnitude (×10, ×100, ×1,000…)
The 64‑byte cache line & your array
“Arrays take 64 bytes” is not accurate—the cache line is 64 bytes. The array itself can be any size.
- When you first access
a[i]
, the core loads the entire 64‑byte line containing it into L1. - With 4‑byte ints, that line packs 16 consecutive ints.
- Next 15 accesses—
a[i+1] … a[i+15]
—often hit in L1 (unless evicted), which is why linear loops scream.
Example (TypeScript with a TypedArray):
const n = 1_000_000;
const a = new Int32Array(n); // contiguous buffer
let sum = 0;
for (let i = 0; i < n; i++) sum += a[i]; // linear scan → great locality
Typed arrays use contiguous storage (like C arrays), which lets the engine and CPU make these locality wins reliable.
When (and why) caches miss
Three classic “C” misses:
- Cold (compulsory): first time you ever touch a line. Unavoidable—but only once per line.
- Capacity: your working set is larger than the cache; older lines get evicted before reuse.
- Conflict: different addresses compete for the same cache slots (set‑associativity limits).
How to miss less:
- Walk memory linearly (row‑major for row‑major data).
- Keep hot working sets small; reuse data soon.
- Avoid unnecessary pointer chasing and sparse/holed arrays in hot loops.
- For 2D data, iterate with the inner loop over contiguous elements.
How the core “connects” to each level
- The core executes your instructions using registers.
- Loads/stores first consult L1 (per core). On a miss, hardware checks L2 → L3 → RAM.
-
Prefetchers watch patterns (like
i++
) and fetch the next lines early. - Coherence keeps each core’s caches consistent when multiple cores touch the same addresses (you’ll hear “MESI” in advanced discussions).
- Disk is managed by the OS: files are read into RAM; only then can the CPU load them into caches/registers.
Conclusion
Arrays are fast not because of magic, but because they’re contiguous. That design plays perfectly with the CPU’s memory hierarchy and spatial locality: one cache miss brings a 64‑byte chunk that feeds many subsequent accesses from ultra‑fast caches. Code that scans linearly, keeps working sets tight, and respects data layout will feel dramatically faster—even with the same Big‑O.
TL;DR for practice
- Prefer arrays/typed arrays for hot loops.
- Iterate linearly; match loop order to layout.
- Know your hierarchy; expect ~100× gap between L1 and DRAM.
- Think in cache lines, not just elements.
Top comments (0)