Vladimir Zem

Posted on Mar 13 • Originally published at habr.com

DOD Principles in C++: Part 1. Struct Optimization

#cpp #performance #optimization #dataorienteddesign

Greetings to everyone who wants to write fast and efficient code. In this article, we'll look at a few straightforward ways to optimize your programs when working with structs.

Data Placement in Memory: L1, L2, L3 Caches and RAM

We all know that data (variables, class fields, etc.) is stored in "memory." But most programmers don't give much thought to what this abstract "memory" actually is. Let's dig a little deeper, because understanding this can speed up your code by double-digit percentages.

A computer's memory doesn't consist solely of RAM and files — it also includes so-called L1, L2, and L3 caches. We won't dive into their internal architecture; what matters for us is the fact that they are significantly faster than main memory.

The tradeoff for that speed is limited capacity. The exact numbers vary by CPU model, but the approximate sizes and latencies are:

L1: ~100 KB, 2–3 cycles (16–100× faster than RAM);
L2: ~500 KB, 3–5 cycles (10–66× faster than RAM);
L3: ~10–15 MB, 30–50 cycles (1–6.6× faster than RAM).

Cache Lines and Cache Misses

Data doesn't end up in these caches by magic. The CPU reads from RAM in fixed-size blocks called cache lines. On modern x86/x64 architectures, a single cache line is typically 64 bytes.

How the CPU fetches data from RAM. Source

This means that if the CPU needs to read a 1-byte variable from RAM, it won't read just 1 byte — it will fetch an entire 64-byte cache line.

Here's where it gets interesting for us C++ programmers. If the data we need is packed tightly (within those 64 bytes), the CPU processes it almost instantly. If the data is scattered, we get a cache miss, and the CPU stalls for a hundred or so cycles waiting for the next cache line from RAM.

But that's not all. The CPU also needs to move data from caches into registers to perform computations.

Machine Word

Without going too deep into CPU internals, the key point is this: data moves from caches into registers not as 64-byte cache lines, but as machine words, whose size depends on the register width (either 32 or 64 bits). This rigid "grid-aligned" reading creates two possible scenarios:

Good scenario. The machine word boundary can fully contain the data — no issues, no delays.
Bad scenario. The data straddles a machine word boundary. The CPU must then read two machine words and "stitch" them together using bit shifts. Example: an int occupies 1 byte in one machine word and 3 bytes in the next.

To prevent the bad scenario, programming languages include built-in mechanisms. Let's look at how C++ handles this.

C++: Alignment, Padding, and Wasted Space Out of Nowhere

To avoid the straddling problem, C++ uses padding and alignment. You can find formal definitions in the standard, but let's look at how they work in practice.

Consider a simple struct with fields in an arbitrary (spoiler: worst possible) order:

struct BadStruct {
  bool active;     // 1 byte
  double position; // 8 bytes
  int id;          // 4 bytes
  bool is_liquid;  // 1 byte
  int energy;      // 4 bytes
};

At first glance, this struct should be 18 bytes. But if we check sizeof(BadStruct), the result is, to put it mildly, not quite that:

std::cout << sizeof(BadStruct);
// Output: 32

32 bytes instead of 18 — a 44% difference! Where does all that extra size come from? It's that machine word issue and the alignment that follows from it.

To prevent data from straddling machine word boundaries, C++ enforces an alignment rule: a variable's address in memory must be a multiple of its size. For example, an int (4 bytes) can only reside at addresses 0, 4, 8, 12, and so on. A double (8 bytes) can only be at addresses 0, 8, 16, etc.

Size and alignment for each type. Source

When the compiler sees that the next field wouldn't land on a properly aligned address, it inserts empty bytes — that's the padding. Let's trace through each byte of our struct:

bool active (1 byte) — occupies address 0.
double position (8 bytes) — must be at an address divisible by 8. The nearest such address is 8. The compiler inserts 7 bytes of padding (addresses 1–7).
int id (4 bytes) — lands at addresses 16..19. Address 16 is divisible by 4 — perfect.
bool is_liquid (1 byte) — occupies address 20.
int energy (4 bytes) — requires an address divisible by 4. The nearest is 24. The compiler inserts 3 bytes of padding (addresses 21–23).

Our data and internal padding end at byte 27 (current size: 28 bytes). But why did sizeof report 32?

This is where the non-obvious tail alignment rule kicks in. The total size of a struct must be a multiple of the alignment of its largest field. In our case, that's double (8 bytes). The nearest multiple of 8 that is ≥ 28 is 32.

The compiler adds 4 more bytes of padding at the end. This ensures that in an array of such structs (BadStruct array[2]), the second element also starts at an address divisible by 8.

The fix is simple — sort the fields in descending order of size:

struct GoodStruct {
  double position; // 8 bytes
  int id;          // 4 bytes
  int energy;      // 4 bytes
  bool active;     // 1 byte
  bool is_liquid;  // 1 byte
};

Let's check the size:

std::cout << sizeof(GoodStruct);
// Output: 24

Remarkable — just by reordering the fields, we reduced the struct's memory footprint by 25%! But let's not take the theory at face value — let's back it up with benchmarks.

C++: The Cost of Padding — Benchmarks

We'll write a simple performance test for our "bad" and "good" structs using Google Benchmark. The test iterates over an array of 1,000,000 structs and performs a trivial math operation: adding 1 to the position field.

Test for the BadStruct array:

static void BM_BadStructIteration(benchmark::State& state) {
    std::vector<BadStruct> data(state.range(0));
    for (auto _ : state) {
        for (auto& item : data) {
            if (item.active) {
                benchmark::DoNotOptimize(item.position += 1.0);
            }
        }
        benchmark::ClobberMemory();
    }
}

Test for the GoodStruct array:

static void BM_GoodStructIteration(benchmark::State& state) {
    std::vector<GoodStruct> data(state.range(0));
    for (auto _ : state) {
        for (auto& item : data) {
            if (item.active) {
                benchmark::DoNotOptimize(item.position += 1.0);
            }
        }
        benchmark::ClobberMemory();
    }
}

Note: benchmark::DoNotOptimize is there to prevent the compiler from eliminating the loop entirely (Dead Code Elimination).

Running the benchmarks:

BENCHMARK(BM_BadStructIteration)->Range(10000, 1000000);
BENCHMARK(BM_GoodStructIteration)->Range(10000, 1000000);
BENCHMARK_MAIN();

Results:

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_BadStructIteration/10000          4011 ns         4011 ns       165946
BM_BadStructIteration/32768         13408 ns        13407 ns        50500
BM_BadStructIteration/262144       107153 ns       107146 ns         6240
BM_BadStructIteration/1000000      940122 ns       939709 ns         1013
BM_GoodStructIteration/10000         4230 ns         4226 ns       169877
BM_GoodStructIteration/32768        14302 ns        14302 ns        48910
BM_GoodStructIteration/262144      119729 ns       119669 ns         6144
BM_GoodStructIteration/1000000     579492 ns       579507 ns         1103

The Iterations column shows how many times Google Benchmark ran the loop to gather statistically reliable data. Fast tests (10,000 elements) ran over 160,000 times; heavy ones (1,000,000 elements) ran about a thousand. The Time and CPU columns show the average time per single function execution.

Hard to interpret raw numbers, right? Let's plot them.

What do we see? The results are nonlinear. Up to 262,144 elements, the difference is minimal. But at 1,000,000 elements, it reaches 38%! What causes this?

It's all about data volume. Arrays of 10,000 and 32,768 bad structs (312.5 KB and 1,024 KB respectively) fit comfortably in the cache. But once the element count reaches 262,144 (8,192 KB), the L3 cache starts running out of space, and data has to spill into slow RAM. That's where the cache line becomes critical.

Let's recall the 64-byte cache line:

BadStruct is 32 bytes. Exactly 2 structs fit in one cache line. To process a million elements, the CPU must make 500,000 requests to RAM.
GoodStruct is 24 bytes. About 2.66 structs fit in one cache line. To process a million elements, the CPU only needs about 375,000 requests.

See what happened? We cut the number of accesses to the slowest memory in the computer by a quarter — just by sorting the variables in our class from largest to smallest. No changes to logic, no fancy algorithms — pure Data-Oriented Design.

Note: Why do we count RAM reads for the entire million elements? Wouldn't some stay in the L3 cache? They will, but not for long. The array size exceeds the CPU's cache capacity. By the time the CPU reaches the end of the million-element array, the beginning has already been evicted. On the next benchmark iteration, everything has to be fetched from RAM again.

Conclusion

In this part of the "DOD Principles" series, we looked at a simple way to optimize struct sizes, tested its real impact on performance, and explored why it works the way it does.

TL;DR: Sort your struct/class fields from largest to smallest, and you'll be fine.

In the next part, we'll go further and examine the AoS (Array of Structures) and SoA (Structure of Arrays) patterns, which let us squeeze even more performance out of the CPU — for instance, when building physics engines and complex simulations.

Thanks for reading — write fast code and enjoy the process!

If you found this useful, a ❤️ and a follow would mean a lot. See you in Part 2!