DEV Community: Vladimir Zem

Data-Oriented 3D Math: Structuring quaternions and matrices for auto-vectorization in C++

Vladimir Zem — Wed, 17 Jun 2026 05:57:21 +0000

Rendering pipelines, spatial audio, physics solvers. In these areas the CPU is chewing through millions of matrix mults and quaternion rotations. Every single frame. Hardware is monstrously fast today. But somehow, math routines still manage to bottleneck the whole application.

Actually the bottleneck is almost never the math itself. It’s the memory layout. Wrap geometry primitives in heavy object-oriented abstractions, and you basically throw sand in the gears. You stop the CPU from doing the one thing it is actually built for. Blasting instructions over flat, contiguous memory.

The OOP Penalty

The standard textbook way to write a 3D math lib is all about encapsulation. You hide data to keep state safe. So you get classes with private members, custom constructors, getters, setters. Maybe even a virtual destructor if someone wanted to build a polymorphic hierarchy. Looks correct on a UML diagram. But the hardware penalty is brutal.

Give an object a non-trivial constructor, a v-table or just some padding for alignment - and you instantly break the CPU’s data locality assumptions. CPUs operate in cache lines. Usually that’s 64 bytes fetched from RAM straight into L1 cache. Let’s say a 16-byte quaternion gets padded to 24 bytes just to hold a virtual table pointer. Cache line utilization drops. You burn memory bandwidth loading structural garbage. Stuff that has absolutely zero to do with the actual math. And worse. This OOP boilerplate actively blocks the compiler from touching SIMD registers.

Compiler Paranoia

Clang, GCC and MSVC are highly aggressive at auto-vectorizing loops nowadays. But they are deeply paranoid. They operate strictly inside the ABI bounds and static analysis limits. For the auto-vectorizer to safely replace scalar float ops with vectorized instructions (like AVX vfmadd231ps), the compiler needs hard proof of two things. First, contiguous memory. Flat layout with zero hidden padding. Second, type transparency. Meaning it can verify memory ranges don't overlap. Strict aliasing.

If a C++ class is not evaluated as trivially copyable (std::is_trivially_copyable_v == true), the compiler gets scared. It emits defensive machine code. It might pass the object by a hidden pointer instead of shoving it directly into CPU registers like XMM/YMM. Iterate over a big array of matrices, and these memory indirection chains basically stall the hardware prefetcher. The CPU just sits there. Waiting for RAM fetches. Total pipeline starvation.

To get maximum throughput, math primitives must map directly to raw memory blocks. No exceptions.

DOD in C++23

Let’s look at how hardware-sympathetic geometry works in practice. If you inspect the core headers of modern C++23 math libs like Dichotomia (quat.hpp, mat4.hpp), you see strict Data-Oriented Design. No heavy classes. Primitives are just flat standard-layout structs constrained by C++ concepts. Roughly looks like this:

#include <concepts>
#include <type_traits>
#include <cstddef>

namespace dich {

// Constraining the primitive to floats
template <typename T>
concept floating_point = std::is_floating_point_v<T>;

// Flat, data-oriented Quaternion
template <floating_point T>
struct quat {
    T w, x, y, z;

    constexpr quat() noexcept = default;
    constexpr quat(T _w, T _x, T _y, T _z) noexcept 
        : w(_w), x(_x), y(_y), z(_z) {}
};

// Flat Matrix 4x4
template <floating_point T>
struct alignas(alignof(T) * 4) mat4 {
    T data[16];

    constexpr mat4() noexcept = default;
};

// Forcing compiler layout guarantees
static_assert(std::is_standard_layout_v<quat<float>>);
static_assert(std::is_trivially_copyable_v<quat<float>>);
static_assert(sizeof(quat<float>) == 16); // fits cleanly in a 128-bit register

static_assert(std::is_standard_layout_v<mat4<float>>);
static_assert(std::is_trivially_copyable_v<mat4<float>>);
static_assert(sizeof(mat4<float>) == 64); // exactly one 64-byte L1 cache line

} // namespace dich

Notice what is missing here. No private members, zero virtual functions, no user-defined destructors. By enforcing std::is_trivially_copyable_v and standard layout rules, the code guarantees a mat4 takes up exactly one 64-byte cache line. And because it is trivially copyable, the ABI passes instances directly in registers. No stack pushing.

Write a matrix multiplication over these structs, and the compiler easily sees the independent arithmetic ops and the strict alignment.

template<floating_point T>
[[nodiscard]] constexpr mat4<T> multiply(const mat4<T>& a, const mat4<T>& b) noexcept {
    mat4<T> result{};
    // Because both 'a' and 'b' are contiguous float arrays,
    // Clang/GCC unroll this loop and map it directly to SIMD instructions.
    for (std::size_t i = 0; i < 4; ++i) {
        for (std::size_t j = 0; j < 4; ++j) {
            T sum = 0;
            for (std::size_t k = 0; k < 4; ++k) {
                sum += a.data[i * 4 + k] * b.data[k * 4 + j];
            }
            result.data[i * 4 + j] = sum;
        }
    }
    return result;
}

Compile this with -O3 -march=native and Clang naturally spits out vectorized FMA instructions. The C++23 abstractions cost literally zero cycles at runtime. Those static_assert statements? They act as a hard compile-time regression test. If a future developer accidentally adds a virtual method, the build just fails instantly. Performance baseline protected.

The Hardware Reality

Dropping OOP for a flat DOD layout gives very predictable hardware-level returns. Run bulk operations - say, applying transforms to a massive array of entities. The lack of hidden pointers basically kills cache line thrashing completely. The hardware prefetcher predicts the linear memory access pattern like it’s supposed to.

In benchmarks against standard OOP wrappers, instruction cache misses drop massively because branch validation and stack teardown logic are just gone. Throughput scales up hard. And if you check the generated assembly, it confirms a clean 1:1 translation to vfmadd231ps instructions. Basically intrinsic-level performance out of pure standard C++.

To give you an idea of the raw throughput difference on a modern CPU (e.g., AMD Ryzen 7 5800X, compiled with gcc 14 -O3):

Standard OOP Matrix (Scalar): 18.5 ms per 1,000,000 multiplications.
Dichotomia DOD Matrix (Auto-Vectorized): ~4.2 ms per 1,000,000 multiplications.

Compilers are smart, but they are deeply conservative. Give them opaque or fragmented memory layouts and the optimizer will always fall back to the safe, slow scalar path. Performance here is just about structuring data so the hardware reads it without friction. High-level developer ergonomics don’t actually need runtime overhead. Using standard layouts and C++23 constraints, you can build robust math tools. But under the hood, they just act as transparent data pipes for the CPU.

If you are interested in examining the complete data-oriented implementation of these primitives, including the Python bindings for zero-copy FFI, you can inspect the architecture in the Dichotomia repository on GitHub.

zem-invictus / dichotomia

A high-performance, header-only C++23 math library for 3D game engines. Built from scratch focusing on modern C++ features (Value Semantics, Deducing this), strict angle typing, and performance.

Dichotomia

A minimalistic, modern C++23 math library for basic 3D graphics applications. It provides core linear algebra components with an emphasis on constexpr and modern C++ features, alongside seamless, high-performance Python bindings via nanobind (with full NumPy buffer protocol support).

Features

Vectors (Vec2, Vec3, Vec4): Fully templated, constexpr arithmetic, strict ISO C++ operator[] using std::unreachable().
Matrices (Mat4): 4x4 matrix operations, fast Inverse and Determinant, Perspective, Orthographic, LookAt (RH Zero-to-One standard).
Quaternions (Quat): Fast Euler-to-Quaternion conversion, Spherical Linear Interpolation (Slerp), rotation matrices.
Angles (Radians, Degrees): Type-safe angle structs with user-defined literals (180.0_deg, 3.14_rad).
Standardized: Zero-warning compilation, 100% Google C++ Style Guide compliant, complete Google Test coverage.

Performance

Dichotomia leverages C++23 [[assume]] contracts and explicit object parameters (Deducing This) to achieve zero-overhead abstractions. Thanks to aggressive compiler auto-vectorization (tested…

View on GitHub

DOD Principles in C++: Part 1. Struct Optimization

Vladimir Zem — Fri, 13 Mar 2026 06:43:07 +0000

Greetings to everyone who wants to write fast and efficient code. In this article, we'll look at a few straightforward ways to optimize your programs when working with structs.

Data Placement in Memory: L1, L2, L3 Caches and RAM

We all know that data (variables, class fields, etc.) is stored in "memory." But most programmers don't give much thought to what this abstract "memory" actually is. Let's dig a little deeper, because understanding this can speed up your code by double-digit percentages.

A computer's memory doesn't consist solely of RAM and files — it also includes so-called L1, L2, and L3 caches. We won't dive into their internal architecture; what matters for us is the fact that they are significantly faster than main memory.

The tradeoff for that speed is limited capacity. The exact numbers vary by CPU model, but the approximate sizes and latencies are:

L1: ~100 KB, 2–3 cycles (16–100× faster than RAM);
L2: ~500 KB, 3–5 cycles (10–66× faster than RAM);
L3: ~10–15 MB, 30–50 cycles (1–6.6× faster than RAM).

Cache Lines and Cache Misses

Data doesn't end up in these caches by magic. The CPU reads from RAM in fixed-size blocks called cache lines. On modern x86/x64 architectures, a single cache line is typically 64 bytes.

How the CPU fetches data from RAM. Source

This means that if the CPU needs to read a 1-byte variable from RAM, it won't read just 1 byte — it will fetch an entire 64-byte cache line.

Here's where it gets interesting for us C++ programmers. If the data we need is packed tightly (within those 64 bytes), the CPU processes it almost instantly. If the data is scattered, we get a cache miss, and the CPU stalls for a hundred or so cycles waiting for the next cache line from RAM.

But that's not all. The CPU also needs to move data from caches into registers to perform computations.

Machine Word

Without going too deep into CPU internals, the key point is this: data moves from caches into registers not as 64-byte cache lines, but as machine words, whose size depends on the register width (either 32 or 64 bits). This rigid "grid-aligned" reading creates two possible scenarios:

Good scenario. The machine word boundary can fully contain the data — no issues, no delays.
Bad scenario. The data straddles a machine word boundary. The CPU must then read two machine words and "stitch" them together using bit shifts. Example: an int occupies 1 byte in one machine word and 3 bytes in the next.

To prevent the bad scenario, programming languages include built-in mechanisms. Let's look at how C++ handles this.

C++: Alignment, Padding, and Wasted Space Out of Nowhere

To avoid the straddling problem, C++ uses padding and alignment. You can find formal definitions in the standard, but let's look at how they work in practice.

Consider a simple struct with fields in an arbitrary (spoiler: worst possible) order:

struct BadStruct {
  bool active;     // 1 byte
  double position; // 8 bytes
  int id;          // 4 bytes
  bool is_liquid;  // 1 byte
  int energy;      // 4 bytes
};

At first glance, this struct should be 18 bytes. But if we check sizeof(BadStruct), the result is, to put it mildly, not quite that:

std::cout << sizeof(BadStruct);
// Output: 32

32 bytes instead of 18 — a 44% difference! Where does all that extra size come from? It's that machine word issue and the alignment that follows from it.

To prevent data from straddling machine word boundaries, C++ enforces an alignment rule: a variable's address in memory must be a multiple of its size. For example, an int (4 bytes) can only reside at addresses 0, 4, 8, 12, and so on. A double (8 bytes) can only be at addresses 0, 8, 16, etc.

Size and alignment for each type. Source

When the compiler sees that the next field wouldn't land on a properly aligned address, it inserts empty bytes — that's the padding. Let's trace through each byte of our struct:

bool active (1 byte) — occupies address 0.
double position (8 bytes) — must be at an address divisible by 8. The nearest such address is 8. The compiler inserts 7 bytes of padding (addresses 1–7).
int id (4 bytes) — lands at addresses 16..19. Address 16 is divisible by 4 — perfect.
bool is_liquid (1 byte) — occupies address 20.
int energy (4 bytes) — requires an address divisible by 4. The nearest is 24. The compiler inserts 3 bytes of padding (addresses 21–23).

Our data and internal padding end at byte 27 (current size: 28 bytes). But why did sizeof report 32?

This is where the non-obvious tail alignment rule kicks in. The total size of a struct must be a multiple of the alignment of its largest field. In our case, that's double (8 bytes). The nearest multiple of 8 that is ≥ 28 is 32.

The compiler adds 4 more bytes of padding at the end. This ensures that in an array of such structs (BadStruct array[2]), the second element also starts at an address divisible by 8.

The fix is simple — sort the fields in descending order of size:

struct GoodStruct {
  double position; // 8 bytes
  int id;          // 4 bytes
  int energy;      // 4 bytes
  bool active;     // 1 byte
  bool is_liquid;  // 1 byte
};

Let's check the size:

std::cout << sizeof(GoodStruct);
// Output: 24

Remarkable — just by reordering the fields, we reduced the struct's memory footprint by 25%! But let's not take the theory at face value — let's back it up with benchmarks.

C++: The Cost of Padding — Benchmarks

We'll write a simple performance test for our "bad" and "good" structs using Google Benchmark. The test iterates over an array of 1,000,000 structs and performs a trivial math operation: adding 1 to the position field.

Test for the BadStruct array:

static void BM_BadStructIteration(benchmark::State& state) {
    std::vector<BadStruct> data(state.range(0));
    for (auto _ : state) {
        for (auto& item : data) {
            if (item.active) {
                benchmark::DoNotOptimize(item.position += 1.0);
            }
        }
        benchmark::ClobberMemory();
    }
}

Test for the GoodStruct array:

static void BM_GoodStructIteration(benchmark::State& state) {
    std::vector<GoodStruct> data(state.range(0));
    for (auto _ : state) {
        for (auto& item : data) {
            if (item.active) {
                benchmark::DoNotOptimize(item.position += 1.0);
            }
        }
        benchmark::ClobberMemory();
    }
}

Note: benchmark::DoNotOptimize is there to prevent the compiler from eliminating the loop entirely (Dead Code Elimination).

Running the benchmarks:

BENCHMARK(BM_BadStructIteration)->Range(10000, 1000000);
BENCHMARK(BM_GoodStructIteration)->Range(10000, 1000000);
BENCHMARK_MAIN();

Results:

-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_BadStructIteration/10000          4011 ns         4011 ns       165946
BM_BadStructIteration/32768         13408 ns        13407 ns        50500
BM_BadStructIteration/262144       107153 ns       107146 ns         6240
BM_BadStructIteration/1000000      940122 ns       939709 ns         1013
BM_GoodStructIteration/10000         4230 ns         4226 ns       169877
BM_GoodStructIteration/32768        14302 ns        14302 ns        48910
BM_GoodStructIteration/262144      119729 ns       119669 ns         6144
BM_GoodStructIteration/1000000     579492 ns       579507 ns         1103

The Iterations column shows how many times Google Benchmark ran the loop to gather statistically reliable data. Fast tests (10,000 elements) ran over 160,000 times; heavy ones (1,000,000 elements) ran about a thousand. The Time and CPU columns show the average time per single function execution.

Hard to interpret raw numbers, right? Let's plot them.

What do we see? The results are nonlinear. Up to 262,144 elements, the difference is minimal. But at 1,000,000 elements, it reaches 38%! What causes this?

It's all about data volume. Arrays of 10,000 and 32,768 bad structs (312.5 KB and 1,024 KB respectively) fit comfortably in the cache. But once the element count reaches 262,144 (8,192 KB), the L3 cache starts running out of space, and data has to spill into slow RAM. That's where the cache line becomes critical.

Let's recall the 64-byte cache line:

BadStruct is 32 bytes. Exactly 2 structs fit in one cache line. To process a million elements, the CPU must make 500,000 requests to RAM.
GoodStruct is 24 bytes. About 2.66 structs fit in one cache line. To process a million elements, the CPU only needs about 375,000 requests.

See what happened? We cut the number of accesses to the slowest memory in the computer by a quarter — just by sorting the variables in our class from largest to smallest. No changes to logic, no fancy algorithms — pure Data-Oriented Design.

Note: Why do we count RAM reads for the entire million elements? Wouldn't some stay in the L3 cache? They will, but not for long. The array size exceeds the CPU's cache capacity. By the time the CPU reaches the end of the million-element array, the beginning has already been evicted. On the next benchmark iteration, everything has to be fetched from RAM again.

Conclusion

In this part of the "DOD Principles" series, we looked at a simple way to optimize struct sizes, tested its real impact on performance, and explored why it works the way it does.

TL;DR: Sort your struct/class fields from largest to smallest, and you'll be fine.

In the next part, we'll go further and examine the AoS (Array of Structures) and SoA (Structure of Arrays) patterns, which let us squeeze even more performance out of the CPU — for instance, when building physics engines and complex simulations.

Thanks for reading — write fast code and enjoy the process!

If you found this useful, a ❤️ and a follow would mean a lot. See you in Part 2!