DEV Community

Cover image for Your Code Is Slow Because You Think in Objects, Not Data
Rafa Calderon
Rafa Calderon

Posted on

Your Code Is Slow Because You Think in Objects, Not Data

Your CPU is phenomenally fast. Your RAM is phenomenally slow. The gap between them is the reason your code runs at 5% of the speed it theoretically could.

Object-oriented design teaches us to model concepts: Users, Orders, Products. We build elegant hierarchies, encapsulate state, chain abstractions. The code is readable, maintainable, testable. It's also scattering your data across memory like shrapnel, forcing the CPU to wait hundreds of cycles for each piece.

This article is a map of the Data-Oriented Design (DOD) territory: what it is, why it matters, and which concepts you need to know. I'm not going to dive deep into every topic—that would take a whole book—but you will walk away knowing exactly what to look for when you need to squeeze out real performance.

1. Why Your Code Is Slower Than It Should Be

The CPU Waits for Memory

Here is the fact that changes everything: your processor is between 100 and 1000 times faster than your RAM.

When the CPU needs a piece of data that isn't in the cache, it waits. Literally. Hundreds of clock cycles doing absolutely nothing while the data travels from the RAM.

  • Register access: ~1 cycle
  • L1 Cache access: ~4 cycles
  • L2 Cache access: ~12 cycles
  • L3 Cache access: ~40 cycles
  • RAM access: ~200-300 cycles

Read those numbers again. The difference between having data in L1 vs. RAM is two orders of magnitude. Your code can be algorithmically perfect and still be slow because it spends 90% of its time waiting for data.

This is the "Von Neumann Bottleneck": the processor and memory are separated, and that channel between them is the bottleneck of modern computing.

Abstractions Have a Cost

Object-Oriented Programming (OOP) taught us to model the world: a User has Orders, each Order has Products. Beautiful, readable, maintainable.

The problem is, hardware doesn't understand objects. It understands contiguous bytes in memory.

When you create scattered objects, use inheritance with virtual methods, or chain pointers, you are generating:

  1. Scattered memory: Each allocation can place the object anywhere in the heap.
  2. Indirections: Pointers pointing to pointers pointing to data.
  3. Vtables: Tables of virtual functions that the CPU must look up at runtime.

All of this scatters your data across the heap. That destroys spatial locality, which is what makes cache lines efficient.

Spatial Locality: The Cache Line Is the Unit of Transfer

When the CPU requests a byte at address 1000, it doesn't fetch just that byte. It brings in the entire cache line (typically 64 bytes) starting from an aligned address. If the next thing you need is at byte 1001, you already have it for free. If it's at byte 50000, you pay for another trip to RAM.

Spatial locality is everything. Contiguous data in memory = fast access. Scattered data = constant cache misses.

That is why a linked list is orders of magnitude slower to iterate than a vector, even if both are O(n). The vector is contiguous; the list has every node in some random location on the heap.

The CPU Tries to Predict What You'll Do

The modern processor doesn't execute instructions one by one. It has a pipeline that processes multiple instructions in parallel, in different stages.

To keep the pipeline full, the CPU does two things:

  1. Prefetching: It tries to guess what memory you are going to ask for next. If you are iterating through an array sequentially, it detects the pattern and pulls in the next cache lines before you even ask for them. Random pointers break this mechanism because there is no pattern to detect.

  2. Branch prediction: When it reaches an if, it can't wait to evaluate it—the pipeline would empty out. So it guesses which branch it will take based on history. If it guesses right, perfect. If it guesses wrong (branch misprediction), it has to throw away all the speculative work and start over. Typical penalty: 15-20 cycles.

// The CPU can predict this easily
for i in 0..1_000_000 {
    // always enters the loop
}

// This is unpredictable if data is random
for &value in &data {
    if value > threshold {  // 50% true, 50% false
        sum += value;
    }
}
Enter fullscreen mode Exit fullscreen mode

3. Organize Your Data For the Machine

AoS vs SoA: The Fundamental Shift

This is the heart of Data-Oriented Design.

Array of Structs (AoS) — How you naturally think:

struct Entity {
    x: f32,
    y: f32,
    z: f32,
    vx: f32,
    vy: f32,
    vz: f32,
    health: i32,
    team: i32,
    name: String,
}

let entities: Vec<Entity> = Vec::with_capacity(10_000);
Enter fullscreen mode Exit fullscreen mode

In memory it looks like this:

[x,y,z,vx,vy,vz,health,team,name][x,y,z,vx,vy,vz,health,team,name]...
Enter fullscreen mode Exit fullscreen mode

Struct of Arrays (SoA) — How hardware thinks:

struct Entities {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
    vx: Vec<f32>,
    vy: Vec<f32>,
    vz: Vec<f32>,
    health: Vec<i32>,
    team: Vec<i32>,
    names: Vec<String>,
}

impl Entities {
    fn with_capacity(cap: usize) -> Self {
        Self {
            x: Vec::with_capacity(cap),
            y: Vec::with_capacity(cap),
            z: Vec::with_capacity(cap),
            vx: Vec::with_capacity(cap),
            vy: Vec::with_capacity(cap),
            vz: Vec::with_capacity(cap),
            health: Vec::with_capacity(cap),
            team: Vec::with_capacity(cap),
            names: Vec::with_capacity(cap),
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

In memory:

[x,x,x,x,x,...][y,y,y,y,y,...][z,z,z,z,z,...]...
Enter fullscreen mode Exit fullscreen mode

Why does this matter? Imagine you want to update only the positions:

// With AoS: for each entity you fetch ~60 bytes, but use 24
for entity in &mut entities {
    entity.x += entity.vx * dt;
    entity.y += entity.vy * dt;
    entity.z += entity.vz * dt;
}

// With SoA: you fetch exactly what you need
for i in 0..entities.x.len() {
    entities.x[i] += entities.vx[i] * dt;
    entities.y[i] += entities.vy[i] * dt;
    entities.z[i] += entities.vz[i] * dt;
}
Enter fullscreen mode Exit fullscreen mode

With AoS, each 64-byte cache line brings in ~1 entity. With SoA, it brings in ~16 values of x. You are utilizing 100% of every cache line instead of ~20%.

Hot/Cold Splitting

Not all fields are used equally. In the previous example, name is probably only read when rendering UI. Why pull it into the cache every time you update physics?

// "Hot" data - accessed constantly
struct EntitiesHot {
    x: Vec<f32>,
    y: Vec<f32>,
    z: Vec<f32>,
    vx: Vec<f32>,
    vy: Vec<f32>,
    vz: Vec<f32>,
}

// "Cold" data - accessed rarely
struct EntitiesCold {
    health: Vec<i32>,
    team: Vec<i32>,
    names: Vec<String>,
}
Enter fullscreen mode Exit fullscreen mode

Separating hot and cold avoids "polluting" the cache with data you don't need in the hot path.

Alignment and Padding

Compilers align the fields of your structs to specific addresses (usually multiples of their size). This can leave gaps:

use std::mem::size_of;

struct Bad {
    a: u8,       // 1 byte + 7 padding
    b: f64,      // 8 bytes
    c: u8,       // 1 byte + 7 padding
}

struct Good {
    b: f64,      // 8 bytes
    a: u8,       // 1 byte
    c: u8,       // 1 byte + 6 padding
}

fn main() {
    println!("Bad: {} bytes", size_of::<Bad>());   // 24 bytes
    println!("Good: {} bytes", size_of::<Good>()); // 16 bytes
}
Enter fullscreen mode Exit fullscreen mode

Same content, 8 bytes less. In a vector of millions of elements, that's megabytes of memory and cache lines wasted.

Rule of thumb: order fields from largest size to smallest.

In Rust, you can use #[repr(C)] to control the exact layout, or tools like cargo-bloat to analyze the size of your structures.

4. "Free" Gains By Ordering Well

SIMD: Processing Multiple Data at Once

Your CPU has special registers that can operate on multiple values simultaneously:

  • SSE: 128 bits → 4 floats at once
  • AVX: 256 bits → 8 floats at once
  • AVX-512: 512 bits → 16 floats at once

Scalar Operation: 4 instructions

a[0] += b[0]
a[1] += b[1]
a[2] += b[2]
a[3] += b[3]
Enter fullscreen mode Exit fullscreen mode

SIMD Operation: 1 instruction

[a0,a1,a2,a3] += [b0,b1,b2,b3]
Enter fullscreen mode Exit fullscreen mode

But there is a requirement: data must be contiguous and aligned in memory. Remember SoA? Exactly. With SoA, the compiler can auto-vectorize your loops. With AoS, it can't.

// SoA: the compiler can vectorize this automatically
for i in 0..n {
    x[i] = x[i] + vx[i] * dt;
}

// AoS: impossible to vectorize efficiently
for entity in &mut entities {
    entity.x += entity.vx * dt;
}
Enter fullscreen mode Exit fullscreen mode

You don't need to write intrinsics by hand. Compile with --release (which activates -O3) and let LLVM do the work. But your data layout must allow it.

Branchless: Stopping the CPU from Guessing Wrong

When you have conditionals in hot loops with unpredictable data, the branch predictor will fail ~50% of the time.

// Branch version - constant mispredictions
let mut sum = 0i64;
for &value in &data {
    if value >= 128 {
        sum += value as i64;
    }
}

// Branchless version - no prediction needed
let mut sum = 0i64;
for &value in &data {
    let mask = (value >= 128) as i64;  // 0 or 1
    sum += mask * value as i64;
}
Enter fullscreen mode Exit fullscreen mode

The branchless version performs more operations per iteration, but it eliminates mispredictions. On random data, it can be 2-5x faster.

Common techniques:

  • Arithmetic operations instead of if
  • Bit masks
  • Sorting data first: Sometimes, simply sorting the array makes the if predictable (all false first, all true later) and the code flies without changing the logic.
// The sort trick: makes the branch 100% predictable
data.sort_unstable();
for &value in &data {
    if value >= 128 {  // Now it's: false,false,...,true,true,true
        sum += value as i64;
    }
}
Enter fullscreen mode Exit fullscreen mode

5. Smart Memory Management

Why Frequent Allocations Are Slow

Every call to the allocator:

  1. Searches for a free slot in the heap (complex algorithm).
  2. Updates internal allocator structures.
  3. Possibly asks the OS for more memory (syscall).
  4. Can place your data at any random address.

Furthermore, freeing memory frequently causes fragmentation: the heap ends up full of small holes between used blocks, worsening locality.

Arena Allocators

The idea is simple: you ask for a large block of memory at the start and then hand out pieces of that block.

use bumpalo::Bump;

// Create an arena with a large block
let arena = Bump::new();

// O(1) Allocations - just increments a pointer
let x = arena.alloc(42i32);
let y = arena.alloc([1, 2, 3, 4, 5]);
let s = arena.alloc_str("hello");

// When exiting scope, EVERYTHING is freed at once
// No individual free(), no fragmentation
Enter fullscreen mode Exit fullscreen mode

Advantages:

  • Allocation O(1): just increment a counter.
  • Deallocation O(1): reset or drop the entire arena.
  • Guaranteed locality: everything is contiguous.
  • Zero fragmentation.

Handles vs. Pointers

When you need "references" between entities, pointers (or in Rust, scattered indices or Rc<T>) scatter your access all over memory.

The alternative is to use handles: indices into dense arrays.

// With scattered references - memory all over the place
struct Entity {
    target: Option<Rc<RefCell<Entity>>>,
}

// With handles - contiguous memory, cache-friendly
#[derive(Clone, Copy)]
struct EntityHandle(u32);

struct World {
    entities: Vec<Entity>,
}

impl World {
    fn get(&self, handle: EntityHandle) -> &Entity {
        &self.entities[handle.0 as usize]
    }

    fn get_mut(&mut self, handle: EntityHandle) -> &mut Entity {
        &mut self.entities[handle.0 as usize]
    }
}
Enter fullscreen mode Exit fullscreen mode

Handles keep your data in contiguous arrays while allowing relationships between entities. Plus:

  • They make serialization easy (an index is trivial to save).
  • They allow you to detect invalid references (you can validate the index).
  • They avoid Rust's lifetime headaches with reference graphs.

This pattern is the foundation of most ECS (Entity Component Systems) like bevy_ecs, hecs, or legion.

6. When To Apply This (And When Not To)

The Honest Filter

Data-Oriented Design is not a silver bullet. Before rewriting your code, ask yourself:

Where is the bottleneck?

Bottleneck Type Does DOD help?
I/O bound (network, disk, DB) ❌ No
Memory bound ✅ Exactly for this
CPU bound (pure math) ⚠️ Helps, but check algorithms first

If your API spends 95% of its time waiting for database responses, optimizing the memory layout of your objects is like polishing the windshield of a car with no engine.

Is it in a hot path?

90% of execution time usually happens in 10% of the code. Optimize that 10%.

// This runs once at startup - DO NOT optimize
let config = parse_config_file()?;

// This runs 60 times per second with 100K entities - YES optimize
for i in 0..entities.len() {
    update_physics(&mut entities, i, dt);
}
Enter fullscreen mode Exit fullscreen mode

How many elements are you processing?

Quantity Is DOD worth it?
< 1,000 Probably doesn't matter
1,000 - 100,000 Can help in hot paths
> 100,000 Makes a real difference

7. What If I Don't Use Rust?

DOD principles are universal. Here is how to apply them in other languages:

Java

// Primitive arrays instead of ArrayList<Integer>
float[] x = new float[10000];  // Contiguous, cache-friendly
// vs
ArrayList<Float> x;  // Each Float is a separate object on the heap

// For SoA, use parallel arrays
float[] posX = new float[n];
float[] posY = new float[n];
float[] posZ = new float[n];
Enter fullscreen mode Exit fullscreen mode

Note: Project Valhalla (in development) will bring Value Types to Java, allowing inline structs like in C#.

Python

# NumPy IS Data-Oriented Design
import numpy as np

# This is SoA under the hood - contiguous arrays in C
positions = np.zeros((10000, 3), dtype=np.float32)
velocities = np.zeros((10000, 3), dtype=np.float32)

# Vectorized operations - automatic SIMD
positions += velocities * dt  # Processes the whole array at once
Enter fullscreen mode Exit fullscreen mode

The reason NumPy/Pandas are fast is exactly DOD: contiguous arrays in C that leverage cache and SIMD.

JavaScript/TypeScript

// TypedArrays are the way to have contiguous memory
const positions = new Float32Array(10000 * 3);
const velocities = new Float32Array(10000 * 3);

// Instead of:
const entities = []; // Array of scattered objects
for (let i = 0; i < 10000; i++) {
    entities.push({ x: 0, y: 0, z: 0, vx: 0, vy: 0, vz: 0 });
}
Enter fullscreen mode Exit fullscreen mode

In WebGL/WebGPU, TypedArrays are mandatory. That is not a coincidence.

Conclusion

Data-Oriented Design comes down to one idea: organize your data for how you process it, not for how you conceptualize it.

Modern hardware is brutally fast if you feed it correctly. Cache lines, prefetching, SIMD, branch prediction — all these features exist and are waiting. You just need contiguous data and predictable access.

You don't need to rewrite all your code. Start by:

  1. Identifying hot paths with a profiler.
  2. Measuring cache misses in those sections.
  3. Considering SoA where you iterate over specific fields.
  4. Using contiguous vectors instead of structures with scattered pointers.

Performance isn't magic. It's physics: bytes traveling through wires, transistors switching in nanoseconds. When you understand the machine, you can write code that flies.

Resources To Go Deeper

You've seen the map. Here are the paths to explore each territory:

  • Hardware Fundamentals

    • What Every Programmer Should Know About Memory — Ulrich Drepper. The canonical document.
    • Computer Architecture: A Quantitative Approach — Hennessy & Patterson.
  • Data-Oriented Design

    • Data-Oriented Design (free book) — Richard Fabian.
    • CppCon 2014: Data-Oriented Design and C++ — Mike Acton. The talk that popularized DOD.
  • Rust Specific

    • The Rust Performance Book
    • Rust SIMD Performance Guide
  • Real Implementations (ECS)

    • Bevy ECS — Game engine in Rust with ECS as its core.
    • EnTT — Modern C++ ECS.
    • Flecs — C ECS with excellent documentation.

Top comments (0)