Your CPU is phenomenally fast. Your RAM is phenomenally slow. The gap between them is the reason your code runs at 5% of the speed it theoretically could.
Object-oriented design teaches us to model concepts: Users, Orders, Products. We build elegant hierarchies, encapsulate state, chain abstractions. The code is readable, maintainable, testable. It's also scattering your data across memory like shrapnel, forcing the CPU to wait hundreds of cycles for each piece.
This article is a map of the Data-Oriented Design (DOD) territory: what it is, why it matters, and which concepts you need to know. I'm not going to dive deep into every topic—that would take a whole book—but you will walk away knowing exactly what to look for when you need to squeeze out real performance.
1. Why Your Code Is Slower Than It Should Be
The CPU Waits for Memory
Here is the fact that changes everything: your processor is between 100 and 1000 times faster than your RAM.
When the CPU needs a piece of data that isn't in the cache, it waits. Literally. Hundreds of clock cycles doing absolutely nothing while the data travels from the RAM.
- Register access: ~1 cycle
- L1 Cache access: ~4 cycles
- L2 Cache access: ~12 cycles
- L3 Cache access: ~40 cycles
- RAM access: ~200-300 cycles
Read those numbers again. The difference between having data in L1 vs. RAM is two orders of magnitude. Your code can be algorithmically perfect and still be slow because it spends 90% of its time waiting for data.
This is the "Von Neumann Bottleneck": the processor and memory are separated, and that channel between them is the bottleneck of modern computing.
Abstractions Have a Cost
Object-Oriented Programming (OOP) taught us to model the world: a User has Orders, each Order has Products. Beautiful, readable, maintainable.
The problem is, hardware doesn't understand objects. It understands contiguous bytes in memory.
When you create scattered objects, use inheritance with virtual methods, or chain pointers, you are generating:
- Scattered memory: Each allocation can place the object anywhere in the heap.
- Indirections: Pointers pointing to pointers pointing to data.
- Vtables: Tables of virtual functions that the CPU must look up at runtime.
All of this scatters your data across the heap. That destroys spatial locality, which is what makes cache lines efficient.
Spatial Locality: The Cache Line Is the Unit of Transfer
When the CPU requests a byte at address 1000, it doesn't fetch just that byte. It brings in the entire cache line (typically 64 bytes) starting from an aligned address. If the next thing you need is at byte 1001, you already have it for free. If it's at byte 50000, you pay for another trip to RAM.
Spatial locality is everything. Contiguous data in memory = fast access. Scattered data = constant cache misses.
That is why a linked list is orders of magnitude slower to iterate than a vector, even if both are O(n). The vector is contiguous; the list has every node in some random location on the heap.
The CPU Tries to Predict What You'll Do
The modern processor doesn't execute instructions one by one. It has a pipeline that processes multiple instructions in parallel, in different stages.
To keep the pipeline full, the CPU does two things:
Prefetching: It tries to guess what memory you are going to ask for next. If you are iterating through an array sequentially, it detects the pattern and pulls in the next cache lines before you even ask for them. Random pointers break this mechanism because there is no pattern to detect.
Branch prediction: When it reaches an
if, it can't wait to evaluate it—the pipeline would empty out. So it guesses which branch it will take based on history. If it guesses right, perfect. If it guesses wrong (branch misprediction), it has to throw away all the speculative work and start over. Typical penalty: 15-20 cycles.
// The CPU can predict this easily
for i in 0..1_000_000 {
// always enters the loop
}
// This is unpredictable if data is random
for &value in &data {
if value > threshold { // 50% true, 50% false
sum += value;
}
}
3. Organize Your Data For the Machine
AoS vs SoA: The Fundamental Shift
This is the heart of Data-Oriented Design.
Array of Structs (AoS) — How you naturally think:
struct Entity {
x: f32,
y: f32,
z: f32,
vx: f32,
vy: f32,
vz: f32,
health: i32,
team: i32,
name: String,
}
let entities: Vec<Entity> = Vec::with_capacity(10_000);
In memory it looks like this:
[x,y,z,vx,vy,vz,health,team,name][x,y,z,vx,vy,vz,health,team,name]...
Struct of Arrays (SoA) — How hardware thinks:
struct Entities {
x: Vec<f32>,
y: Vec<f32>,
z: Vec<f32>,
vx: Vec<f32>,
vy: Vec<f32>,
vz: Vec<f32>,
health: Vec<i32>,
team: Vec<i32>,
names: Vec<String>,
}
impl Entities {
fn with_capacity(cap: usize) -> Self {
Self {
x: Vec::with_capacity(cap),
y: Vec::with_capacity(cap),
z: Vec::with_capacity(cap),
vx: Vec::with_capacity(cap),
vy: Vec::with_capacity(cap),
vz: Vec::with_capacity(cap),
health: Vec::with_capacity(cap),
team: Vec::with_capacity(cap),
names: Vec::with_capacity(cap),
}
}
}
In memory:
[x,x,x,x,x,...][y,y,y,y,y,...][z,z,z,z,z,...]...
Why does this matter? Imagine you want to update only the positions:
// With AoS: for each entity you fetch ~60 bytes, but use 24
for entity in &mut entities {
entity.x += entity.vx * dt;
entity.y += entity.vy * dt;
entity.z += entity.vz * dt;
}
// With SoA: you fetch exactly what you need
for i in 0..entities.x.len() {
entities.x[i] += entities.vx[i] * dt;
entities.y[i] += entities.vy[i] * dt;
entities.z[i] += entities.vz[i] * dt;
}
With AoS, each 64-byte cache line brings in ~1 entity. With SoA, it brings in ~16 values of x. You are utilizing 100% of every cache line instead of ~20%.
Hot/Cold Splitting
Not all fields are used equally. In the previous example, name is probably only read when rendering UI. Why pull it into the cache every time you update physics?
// "Hot" data - accessed constantly
struct EntitiesHot {
x: Vec<f32>,
y: Vec<f32>,
z: Vec<f32>,
vx: Vec<f32>,
vy: Vec<f32>,
vz: Vec<f32>,
}
// "Cold" data - accessed rarely
struct EntitiesCold {
health: Vec<i32>,
team: Vec<i32>,
names: Vec<String>,
}
Separating hot and cold avoids "polluting" the cache with data you don't need in the hot path.
Alignment and Padding
Compilers align the fields of your structs to specific addresses (usually multiples of their size). This can leave gaps:
use std::mem::size_of;
struct Bad {
a: u8, // 1 byte + 7 padding
b: f64, // 8 bytes
c: u8, // 1 byte + 7 padding
}
struct Good {
b: f64, // 8 bytes
a: u8, // 1 byte
c: u8, // 1 byte + 6 padding
}
fn main() {
println!("Bad: {} bytes", size_of::<Bad>()); // 24 bytes
println!("Good: {} bytes", size_of::<Good>()); // 16 bytes
}
Same content, 8 bytes less. In a vector of millions of elements, that's megabytes of memory and cache lines wasted.
Rule of thumb: order fields from largest size to smallest.
In Rust, you can use #[repr(C)] to control the exact layout, or tools like cargo-bloat to analyze the size of your structures.
4. "Free" Gains By Ordering Well
SIMD: Processing Multiple Data at Once
Your CPU has special registers that can operate on multiple values simultaneously:
- SSE: 128 bits → 4 floats at once
- AVX: 256 bits → 8 floats at once
- AVX-512: 512 bits → 16 floats at once
Scalar Operation: 4 instructions
a[0] += b[0]
a[1] += b[1]
a[2] += b[2]
a[3] += b[3]
SIMD Operation: 1 instruction
[a0,a1,a2,a3] += [b0,b1,b2,b3]
But there is a requirement: data must be contiguous and aligned in memory. Remember SoA? Exactly. With SoA, the compiler can auto-vectorize your loops. With AoS, it can't.
// SoA: the compiler can vectorize this automatically
for i in 0..n {
x[i] = x[i] + vx[i] * dt;
}
// AoS: impossible to vectorize efficiently
for entity in &mut entities {
entity.x += entity.vx * dt;
}
You don't need to write intrinsics by hand. Compile with --release (which activates -O3) and let LLVM do the work. But your data layout must allow it.
Branchless: Stopping the CPU from Guessing Wrong
When you have conditionals in hot loops with unpredictable data, the branch predictor will fail ~50% of the time.
// Branch version - constant mispredictions
let mut sum = 0i64;
for &value in &data {
if value >= 128 {
sum += value as i64;
}
}
// Branchless version - no prediction needed
let mut sum = 0i64;
for &value in &data {
let mask = (value >= 128) as i64; // 0 or 1
sum += mask * value as i64;
}
The branchless version performs more operations per iteration, but it eliminates mispredictions. On random data, it can be 2-5x faster.
Common techniques:
- Arithmetic operations instead of
if - Bit masks
-
Sorting data first: Sometimes, simply sorting the array makes the
ifpredictable (all false first, all true later) and the code flies without changing the logic.
// The sort trick: makes the branch 100% predictable
data.sort_unstable();
for &value in &data {
if value >= 128 { // Now it's: false,false,...,true,true,true
sum += value as i64;
}
}
5. Smart Memory Management
Why Frequent Allocations Are Slow
Every call to the allocator:
- Searches for a free slot in the heap (complex algorithm).
- Updates internal allocator structures.
- Possibly asks the OS for more memory (syscall).
- Can place your data at any random address.
Furthermore, freeing memory frequently causes fragmentation: the heap ends up full of small holes between used blocks, worsening locality.
Arena Allocators
The idea is simple: you ask for a large block of memory at the start and then hand out pieces of that block.
use bumpalo::Bump;
// Create an arena with a large block
let arena = Bump::new();
// O(1) Allocations - just increments a pointer
let x = arena.alloc(42i32);
let y = arena.alloc([1, 2, 3, 4, 5]);
let s = arena.alloc_str("hello");
// When exiting scope, EVERYTHING is freed at once
// No individual free(), no fragmentation
Advantages:
- Allocation O(1): just increment a counter.
- Deallocation O(1): reset or drop the entire arena.
- Guaranteed locality: everything is contiguous.
- Zero fragmentation.
Handles vs. Pointers
When you need "references" between entities, pointers (or in Rust, scattered indices or Rc<T>) scatter your access all over memory.
The alternative is to use handles: indices into dense arrays.
// With scattered references - memory all over the place
struct Entity {
target: Option<Rc<RefCell<Entity>>>,
}
// With handles - contiguous memory, cache-friendly
#[derive(Clone, Copy)]
struct EntityHandle(u32);
struct World {
entities: Vec<Entity>,
}
impl World {
fn get(&self, handle: EntityHandle) -> &Entity {
&self.entities[handle.0 as usize]
}
fn get_mut(&mut self, handle: EntityHandle) -> &mut Entity {
&mut self.entities[handle.0 as usize]
}
}
Handles keep your data in contiguous arrays while allowing relationships between entities. Plus:
- They make serialization easy (an index is trivial to save).
- They allow you to detect invalid references (you can validate the index).
- They avoid Rust's lifetime headaches with reference graphs.
This pattern is the foundation of most ECS (Entity Component Systems) like bevy_ecs, hecs, or legion.
6. When To Apply This (And When Not To)
The Honest Filter
Data-Oriented Design is not a silver bullet. Before rewriting your code, ask yourself:
Where is the bottleneck?
| Bottleneck Type | Does DOD help? |
|---|---|
| I/O bound (network, disk, DB) | ❌ No |
| Memory bound | ✅ Exactly for this |
| CPU bound (pure math) | ⚠️ Helps, but check algorithms first |
If your API spends 95% of its time waiting for database responses, optimizing the memory layout of your objects is like polishing the windshield of a car with no engine.
Is it in a hot path?
90% of execution time usually happens in 10% of the code. Optimize that 10%.
// This runs once at startup - DO NOT optimize
let config = parse_config_file()?;
// This runs 60 times per second with 100K entities - YES optimize
for i in 0..entities.len() {
update_physics(&mut entities, i, dt);
}
How many elements are you processing?
| Quantity | Is DOD worth it? |
|---|---|
| < 1,000 | Probably doesn't matter |
| 1,000 - 100,000 | Can help in hot paths |
| > 100,000 | Makes a real difference |
7. What If I Don't Use Rust?
DOD principles are universal. Here is how to apply them in other languages:
Java
// Primitive arrays instead of ArrayList<Integer>
float[] x = new float[10000]; // Contiguous, cache-friendly
// vs
ArrayList<Float> x; // Each Float is a separate object on the heap
// For SoA, use parallel arrays
float[] posX = new float[n];
float[] posY = new float[n];
float[] posZ = new float[n];
Note: Project Valhalla (in development) will bring Value Types to Java, allowing inline structs like in C#.
Python
# NumPy IS Data-Oriented Design
import numpy as np
# This is SoA under the hood - contiguous arrays in C
positions = np.zeros((10000, 3), dtype=np.float32)
velocities = np.zeros((10000, 3), dtype=np.float32)
# Vectorized operations - automatic SIMD
positions += velocities * dt # Processes the whole array at once
The reason NumPy/Pandas are fast is exactly DOD: contiguous arrays in C that leverage cache and SIMD.
JavaScript/TypeScript
// TypedArrays are the way to have contiguous memory
const positions = new Float32Array(10000 * 3);
const velocities = new Float32Array(10000 * 3);
// Instead of:
const entities = []; // Array of scattered objects
for (let i = 0; i < 10000; i++) {
entities.push({ x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0 });
}
In WebGL/WebGPU, TypedArrays are mandatory. That is not a coincidence.
Conclusion
Data-Oriented Design comes down to one idea: organize your data for how you process it, not for how you conceptualize it.
Modern hardware is brutally fast if you feed it correctly. Cache lines, prefetching, SIMD, branch prediction — all these features exist and are waiting. You just need contiguous data and predictable access.
You don't need to rewrite all your code. Start by:
- Identifying hot paths with a profiler.
- Measuring cache misses in those sections.
- Considering SoA where you iterate over specific fields.
- Using contiguous vectors instead of structures with scattered pointers.
Performance isn't magic. It's physics: bytes traveling through wires, transistors switching in nanoseconds. When you understand the machine, you can write code that flies.
Resources To Go Deeper
You've seen the map. Here are the paths to explore each territory:
-
Hardware Fundamentals
- What Every Programmer Should Know About Memory — Ulrich Drepper. The canonical document.
- Computer Architecture: A Quantitative Approach — Hennessy & Patterson.
-
Data-Oriented Design
- Data-Oriented Design (free book) — Richard Fabian.
- CppCon 2014: Data-Oriented Design and C++ — Mike Acton. The talk that popularized DOD.
-
Rust Specific
- The Rust Performance Book
- Rust SIMD Performance Guide
-
Real Implementations (ECS)
- Bevy ECS — Game engine in Rust with ECS as its core.
- EnTT — Modern C++ ECS.
- Flecs — C ECS with excellent documentation.
Top comments (0)