<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vladimir Zem</title>
    <description>The latest articles on DEV Community by Vladimir Zem (@zem_code).</description>
    <link>https://dev.to/zem_code</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2637499%2F2268038a-0ee0-4082-9217-026bd594d8ce.jpg</url>
      <title>DEV Community: Vladimir Zem</title>
      <link>https://dev.to/zem_code</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/zem_code"/>
    <language>en</language>
    <item>
      <title>Data-Oriented 3D Math: Structuring quaternions and matrices for auto-vectorization in C++</title>
      <dc:creator>Vladimir Zem</dc:creator>
      <pubDate>Wed, 17 Jun 2026 05:57:21 +0000</pubDate>
      <link>https://dev.to/zem_code/data-oriented-3d-math-structuring-quaternions-and-matrices-for-auto-vectorization-in-c-1a01</link>
      <guid>https://dev.to/zem_code/data-oriented-3d-math-structuring-quaternions-and-matrices-for-auto-vectorization-in-c-1a01</guid>
      <description>&lt;p&gt;Rendering pipelines, spatial audio, physics solvers. In these areas the CPU is chewing through millions of matrix mults and quaternion rotations. Every single frame. Hardware is monstrously fast today. But somehow, math routines still manage to bottleneck the whole application.&lt;/p&gt;

&lt;p&gt;Actually the bottleneck is almost never the math itself. It’s the memory layout. Wrap geometry primitives in heavy object-oriented abstractions, and you basically throw sand in the gears. You stop the CPU from doing the one thing it is actually built for. Blasting instructions over flat, contiguous memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  The OOP Penalty
&lt;/h2&gt;

&lt;p&gt;The standard textbook way to write a 3D math lib is all about encapsulation. You hide data to keep state safe. So you get classes with private members, custom constructors, getters, setters. Maybe even a virtual destructor if someone wanted to build a polymorphic hierarchy. Looks correct on a UML diagram. But the hardware penalty is brutal.&lt;/p&gt;

&lt;p&gt;Give an object a non-trivial constructor, a v-table or just some padding for alignment - and you instantly break the CPU’s data locality assumptions. CPUs operate in cache lines. Usually that’s 64 bytes fetched from RAM straight into L1 cache. Let’s say a 16-byte quaternion gets padded to 24 bytes just to hold a virtual table pointer. Cache line utilization drops. You burn memory bandwidth loading structural garbage. Stuff that has absolutely zero to do with the actual math. And worse. This OOP boilerplate actively blocks the compiler from touching SIMD registers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sgv2hdd1fzrvpq3yrq2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2sgv2hdd1fzrvpq3yrq2.png" alt="_Memory layout comparison: Fragmented OOP allocation vs. strictly aligned Data-Oriented standard layout in a 64-byte cache line._" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Compiler Paranoia
&lt;/h2&gt;

&lt;p&gt;Clang, GCC and MSVC are highly aggressive at auto-vectorizing loops nowadays. But they are deeply paranoid. They operate strictly inside the ABI bounds and static analysis limits. For the auto-vectorizer to safely replace scalar float ops with vectorized instructions (like AVX vfmadd231ps), the compiler needs hard proof of two things. First, contiguous memory. Flat layout with zero hidden padding. Second, type transparency. Meaning it can verify memory ranges don't overlap. Strict aliasing.&lt;/p&gt;

&lt;p&gt;If a C++ class is not evaluated as trivially copyable (std::is_trivially_copyable_v == true), the compiler gets scared. It emits defensive machine code. It might pass the object by a hidden pointer instead of shoving it directly into CPU registers like XMM/YMM. Iterate over a big array of matrices, and these memory indirection chains basically stall the hardware prefetcher. The CPU just sits there. Waiting for RAM fetches. Total pipeline starvation.&lt;/p&gt;

&lt;p&gt;To get maximum throughput, math primitives must map directly to raw memory blocks. No exceptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  DOD in C++23
&lt;/h2&gt;

&lt;p&gt;Let’s look at how hardware-sympathetic geometry works in practice. If you inspect the core headers of modern C++23 math libs like Dichotomia (quat.hpp, mat4.hpp), you see strict Data-Oriented Design. No heavy classes. Primitives are just flat standard-layout structs constrained by C++ concepts. Roughly looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;concepts&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;type_traits&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;cstddef&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;dich&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

&lt;span class="c1"&gt;// Constraining the primitive to floats&lt;/span&gt;
&lt;span class="k"&gt;template&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;typename&lt;/span&gt; &lt;span class="nc"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;concept&lt;/span&gt; &lt;span class="n"&gt;floating_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;is_floating_point_v&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Flat, data-oriented Quaternion&lt;/span&gt;
&lt;span class="k"&gt;template&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;floating_point&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;quat&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;constexpr&lt;/span&gt; &lt;span class="n"&gt;quat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;noexcept&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;constexpr&lt;/span&gt; &lt;span class="n"&gt;quat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;_w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;_x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;_y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;_z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;noexcept&lt;/span&gt; 
        &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_w&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_x&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_y&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_z&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Flat Matrix 4x4&lt;/span&gt;
&lt;span class="k"&gt;template&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;floating_point&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;alignas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;alignof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;mat4&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="k"&gt;constexpr&lt;/span&gt; &lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;noexcept&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Forcing compiler layout guarantees&lt;/span&gt;
&lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;is_standard_layout_v&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;quat&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;is_trivially_copyable_v&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;quat&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quat&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// fits cleanly in a 128-bit register&lt;/span&gt;

&lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;is_standard_layout_v&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;is_trivially_copyable_v&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// exactly one 64-byte L1 cache line&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;// namespace dich&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice what is missing here. No private members, zero virtual functions, no user-defined destructors. By enforcing std::is_trivially_copyable_v and standard layout rules, the code guarantees a mat4 takes up exactly one 64-byte cache line. And because it is trivially copyable, the ABI passes instances directly in registers. No stack pushing.&lt;/p&gt;

&lt;p&gt;Write a matrix multiplication over these structs, and the compiler easily sees the independent arithmetic ops and the strict alignment.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;floating_point&lt;/span&gt; &lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;nodiscard&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="k"&gt;constexpr&lt;/span&gt; &lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;multiply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;noexcept&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;mat4&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;{};&lt;/span&gt;
    &lt;span class="c1"&gt;// Because both 'a' and 'b' are contiguous float arrays,&lt;/span&gt;
    &lt;span class="c1"&gt;// Clang/GCC unroll this loop and map it directly to SIMD instructions.&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;T&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdkl9x9dyzgwv5xl52mg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjdkl9x9dyzgwv5xl52mg.png" alt="The reality of Data-Oriented C++23: Clang completely unrolls the matrix multiplication loops into a branchless pipeline of 256-bit AVX FMA instructions (vfmadd231ps). Zero abstraction overhead, zero manual intrinsics." width="720" height="580"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Compile this with -O3 -march=native and Clang naturally spits out vectorized FMA instructions. The C++23 abstractions cost literally zero cycles at runtime. Those static_assert statements? They act as a hard compile-time regression test. If a future developer accidentally adds a virtual method, the build just fails instantly. Performance baseline protected.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Hardware Reality
&lt;/h2&gt;

&lt;p&gt;Dropping OOP for a flat DOD layout gives very predictable hardware-level returns. Run bulk operations - say, applying transforms to a massive array of entities. The lack of hidden pointers basically kills cache line thrashing completely. The hardware prefetcher predicts the linear memory access pattern like it’s supposed to.&lt;/p&gt;

&lt;p&gt;In benchmarks against standard OOP wrappers, instruction cache misses drop massively because branch validation and stack teardown logic are just gone. Throughput scales up hard. And if you check the generated assembly, it confirms a clean 1:1 translation to vfmadd231ps instructions. Basically intrinsic-level performance out of pure standard C++.&lt;/p&gt;

&lt;p&gt;To give you an idea of the raw throughput difference on a modern CPU (e.g., AMD Ryzen 7 5800X, compiled with gcc 14 &lt;code&gt;-O3&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard OOP Matrix (Scalar): 18.5 ms per 1,000,000 multiplications.&lt;/li&gt;
&lt;li&gt;Dichotomia DOD Matrix (Auto-Vectorized): ~4.2 ms per 1,000,000 multiplications.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compilers are smart, but they are deeply conservative. Give them opaque or fragmented memory layouts and the optimizer will always fall back to the safe, slow scalar path. Performance here is just about structuring data so the hardware reads it without friction. High-level developer ergonomics don’t actually need runtime overhead. Using standard layouts and C++23 constraints, you can build robust math tools. But under the hood, they just act as transparent data pipes for the CPU.&lt;/p&gt;

&lt;p&gt;If you are interested in examining the complete data-oriented implementation of these primitives, including the Python bindings for zero-copy FFI, you can inspect the architecture in the Dichotomia repository on GitHub.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/zem-invictus" rel="noopener noreferrer"&gt;
        zem-invictus
      &lt;/a&gt; / &lt;a href="https://github.com/zem-invictus/dichotomia" rel="noopener noreferrer"&gt;
        dichotomia
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A high-performance, header-only C++23 math library for 3D game engines. Built from scratch focusing on modern C++ features (Value Semantics, Deducing this), strict angle typing, and performance.
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Dichotomia&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;A minimalistic, modern C++23 math library for basic 3D graphics applications. It provides core linear algebra components with an emphasis on &lt;code&gt;constexpr&lt;/code&gt; and modern C++ features, alongside seamless, high-performance Python bindings via &lt;code&gt;nanobind&lt;/code&gt; (with full NumPy buffer protocol support).&lt;/p&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Waldemarsch/dichotomia/actions/workflows/ci.yml/badge.svg"&gt;&lt;img src="https://github.com/Waldemarsch/dichotomia/actions/workflows/ci.yml/badge.svg" alt="CI"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/7013272bd27ece47364536a221edb554cd69683b68a46fc0ee96881174c4214c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d626c75652e737667"&gt;&lt;img src="https://camo.githubusercontent.com/7013272bd27ece47364536a221edb554cd69683b68a46fc0ee96881174c4214c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d626c75652e737667" alt="License"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Features&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Vectors&lt;/strong&gt; (&lt;code&gt;Vec2&lt;/code&gt;, &lt;code&gt;Vec3&lt;/code&gt;, &lt;code&gt;Vec4&lt;/code&gt;): Fully templated, &lt;code&gt;constexpr&lt;/code&gt; arithmetic, strict ISO C++ &lt;code&gt;operator[]&lt;/code&gt; using &lt;code&gt;std::unreachable()&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Matrices&lt;/strong&gt; (&lt;code&gt;Mat4&lt;/code&gt;): 4x4 matrix operations, fast &lt;code&gt;Inverse&lt;/code&gt; and &lt;code&gt;Determinant&lt;/code&gt;, &lt;code&gt;Perspective&lt;/code&gt;, &lt;code&gt;Orthographic&lt;/code&gt;, &lt;code&gt;LookAt&lt;/code&gt; (RH Zero-to-One standard).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quaternions&lt;/strong&gt; (&lt;code&gt;Quat&lt;/code&gt;): Fast Euler-to-Quaternion conversion, Spherical Linear Interpolation (&lt;code&gt;Slerp&lt;/code&gt;), rotation matrices.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Angles&lt;/strong&gt; (&lt;code&gt;Radians&lt;/code&gt;, &lt;code&gt;Degrees&lt;/code&gt;): Type-safe angle structs with user-defined literals (&lt;code&gt;180.0_deg&lt;/code&gt;, &lt;code&gt;3.14_rad&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized&lt;/strong&gt;: Zero-warning compilation, 100% Google C++ Style Guide compliant, complete Google Test coverage.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Performance&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Dichotomia leverages C++23 &lt;code&gt;[[assume]]&lt;/code&gt; contracts and explicit object parameters (&lt;code&gt;Deducing This&lt;/code&gt;) to achieve zero-overhead abstractions. Thanks to aggressive compiler auto-vectorization (tested…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/zem-invictus/dichotomia" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


</description>
      <category>cpp</category>
      <category>c</category>
      <category>gamedev</category>
      <category>programming</category>
    </item>
    <item>
      <title>DOD Principles in C++: Part 1. Struct Optimization</title>
      <dc:creator>Vladimir Zem</dc:creator>
      <pubDate>Fri, 13 Mar 2026 06:43:07 +0000</pubDate>
      <link>https://dev.to/zem_code/dod-principles-in-c-part-1-struct-optimization-1anm</link>
      <guid>https://dev.to/zem_code/dod-principles-in-c-part-1-struct-optimization-1anm</guid>
      <description>&lt;p&gt;Greetings to everyone who wants to write fast and efficient code. In this article, we'll look at a few straightforward ways to optimize your programs when working with structs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Placement in Memory: L1, L2, L3 Caches and RAM
&lt;/h2&gt;

&lt;p&gt;We all know that data (variables, class fields, etc.) is stored in "memory." But most programmers don't give much thought to what this abstract "memory" actually is. Let's dig a little deeper, because understanding this can speed up your code by double-digit percentages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b0uq5mmhgshl0vassei.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2b0uq5mmhgshl0vassei.png" alt="Memory hierarchy" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A computer's memory doesn't consist solely of RAM and files — it also includes so-called &lt;strong&gt;L1, L2, and L3 caches&lt;/strong&gt;. We won't dive into their internal architecture; what matters for us is the fact that they are significantly faster than main memory.&lt;/p&gt;

&lt;p&gt;The tradeoff for that speed is limited capacity. The exact numbers vary by CPU model, but the approximate sizes and latencies are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;L1:&lt;/strong&gt; ~100 KB, 2–3 cycles (16–100× faster than RAM);&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L2:&lt;/strong&gt; ~500 KB, 3–5 cycles (10–66× faster than RAM);&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L3:&lt;/strong&gt; ~10–15 MB, 30–50 cycles (1–6.6× faster than RAM).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Cache Lines and Cache Misses
&lt;/h2&gt;

&lt;p&gt;Data doesn't end up in these caches by magic. The CPU reads from RAM in fixed-size blocks called &lt;strong&gt;cache lines&lt;/strong&gt;. On modern x86/x64 architectures, a single cache line is typically &lt;strong&gt;64 bytes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y2uqf4uh2n2d0z1ayw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y2uqf4uh2n2d0z1ayw2.png" alt="How the CPU fetches data from RAM" width="533" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How the CPU fetches data from RAM. &lt;a href="https://pikuma.com/blog/understanding-computer-cache" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This means that if the CPU needs to read a 1-byte variable from RAM, it won't read just 1 byte — it will fetch an entire 64-byte cache line.&lt;/p&gt;

&lt;p&gt;Here's where it gets interesting for us C++ programmers. If the data we need is packed tightly (within those 64 bytes), the CPU processes it almost instantly. If the data is scattered, we get a &lt;strong&gt;cache miss&lt;/strong&gt;, and the CPU stalls for a hundred or so cycles waiting for the next cache line from RAM.&lt;/p&gt;

&lt;p&gt;But that's not all. The CPU also needs to move data from caches into registers to perform computations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Machine Word
&lt;/h2&gt;

&lt;p&gt;Without going too deep into CPU internals, the key point is this: data moves from caches into registers not as 64-byte cache lines, but as &lt;strong&gt;machine words&lt;/strong&gt;, whose size depends on the register width (either 32 or 64 bits). This rigid "grid-aligned" reading creates two possible scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Good scenario.&lt;/strong&gt; The machine word boundary can fully contain the data — no issues, no delays.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bad scenario.&lt;/strong&gt; The data straddles a machine word boundary. The CPU must then read two machine words and "stitch" them together using bit shifts. Example: an &lt;code&gt;int&lt;/code&gt; occupies 1 byte in one machine word and 3 bytes in the next.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To prevent the bad scenario, programming languages include built-in mechanisms. Let's look at how C++ handles this.&lt;/p&gt;

&lt;h3&gt;
  
  
  C++: Alignment, Padding, and Wasted Space Out of Nowhere
&lt;/h3&gt;

&lt;p&gt;To avoid the straddling problem, C++ uses &lt;strong&gt;padding&lt;/strong&gt; and &lt;strong&gt;alignment&lt;/strong&gt;. You can find formal definitions in the standard, but let's look at how they work in practice.&lt;/p&gt;

&lt;p&gt;Consider a simple struct with fields in an arbitrary (spoiler: worst possible) order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;BadStruct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
  &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 8 bytes&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
  &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;is_liquid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, this struct should be 18 bytes. But if we check &lt;code&gt;sizeof(BadStruct)&lt;/code&gt;, the result is, &lt;strong&gt;to put it mildly&lt;/strong&gt;, not quite that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BadStruct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Output: 32&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;32 bytes instead of 18 — a 44% difference! Where does all that extra size come from? It's that &lt;strong&gt;machine word&lt;/strong&gt; issue and the &lt;strong&gt;alignment&lt;/strong&gt; that follows from it.&lt;/p&gt;

&lt;p&gt;To prevent data from straddling machine word boundaries, C++ enforces an &lt;strong&gt;alignment rule:&lt;/strong&gt; a variable's address in memory must be a multiple of its size. For example, an &lt;code&gt;int&lt;/code&gt; (4 bytes) can only reside at addresses 0, 4, 8, 12, and so on. A &lt;code&gt;double&lt;/code&gt; (8 bytes) can only be at addresses 0, 8, 16, etc.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdm8oy7vm4a5mzgybvzmz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdm8oy7vm4a5mzgybvzmz.png" alt="Size and alignment for each type" width="800" height="987"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Size and alignment for each type. &lt;a href="https://pvs-studio.ru/ru/blog/lessons/0021/" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When the compiler sees that the next field wouldn't land on a properly aligned address, it inserts empty bytes — that's the &lt;strong&gt;padding&lt;/strong&gt;. Let's trace through each byte of our struct:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;bool active&lt;/code&gt; (1 byte) — occupies address &lt;code&gt;0&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;double position&lt;/code&gt; (8 bytes) — must be at an address divisible by 8. The nearest such address is &lt;code&gt;8&lt;/code&gt;. The compiler inserts 7 bytes of padding (addresses 1–7).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;int id&lt;/code&gt; (4 bytes) — lands at addresses &lt;code&gt;16..19&lt;/code&gt;. Address 16 is divisible by 4 — perfect.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bool is_liquid&lt;/code&gt; (1 byte) — occupies address &lt;code&gt;20&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;int energy&lt;/code&gt; (4 bytes) — requires an address divisible by 4. The nearest is &lt;code&gt;24&lt;/code&gt;. The compiler inserts 3 bytes of padding (addresses 21–23).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Our data and internal padding end at byte 27 (current size: 28 bytes). But why did &lt;code&gt;sizeof&lt;/code&gt; report 32?&lt;/p&gt;

&lt;p&gt;This is where the non-obvious &lt;strong&gt;tail alignment rule&lt;/strong&gt; kicks in. The total size of a struct must be a multiple of the alignment of its &lt;em&gt;largest field&lt;/em&gt;. In our case, that's &lt;code&gt;double&lt;/code&gt; (8 bytes). The nearest multiple of 8 that is ≥ 28 is 32.&lt;/p&gt;

&lt;p&gt;The compiler adds 4 more bytes of padding at the end. This ensures that in an array of such structs (&lt;code&gt;BadStruct array[2]&lt;/code&gt;), the second element also starts at an address divisible by 8.&lt;/p&gt;

&lt;p&gt;The fix is simple — &lt;strong&gt;sort the fields in descending order of size&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;GoodStruct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 8 bytes&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;          &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
  &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;energy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
  &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
  &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="n"&gt;is_liquid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's check the size:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GoodStruct&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Output: 24&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remarkable — just by reordering the fields, we reduced the struct's memory footprint by 25%! But let's not take the theory at face value — let's back it up with benchmarks.&lt;/p&gt;

&lt;h2&gt;
  
  
  C++: The Cost of Padding — Benchmarks
&lt;/h2&gt;

&lt;p&gt;We'll write a simple performance test for our "bad" and "good" structs using Google Benchmark. The test iterates over an array of 1,000,000 structs and performs a trivial math operation: adding 1 to the &lt;code&gt;position&lt;/code&gt; field.&lt;/p&gt;

&lt;p&gt;Test for the &lt;code&gt;BadStruct&lt;/code&gt; array:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;BM_BadStructIteration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;State&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;BadStruct&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;DoNotOptimize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ClobberMemory&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test for the &lt;code&gt;GoodStruct&lt;/code&gt; array:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;static&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;BM_GoodStructIteration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;State&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GoodStruct&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;DoNotOptimize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;benchmark&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ClobberMemory&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note:&lt;/em&gt; &lt;code&gt;benchmark::DoNotOptimize&lt;/code&gt; is there to prevent the compiler from eliminating the loop entirely (Dead Code Elimination).&lt;/p&gt;

&lt;p&gt;Running the benchmarks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;BENCHMARK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BM_BadStructIteration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;BENCHMARK&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BM_GoodStructIteration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;Range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;BENCHMARK_MAIN&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_BadStructIteration/10000          4011 ns         4011 ns       165946
BM_BadStructIteration/32768         13408 ns        13407 ns        50500
BM_BadStructIteration/262144       107153 ns       107146 ns         6240
BM_BadStructIteration/1000000      940122 ns       939709 ns         1013
BM_GoodStructIteration/10000         4230 ns         4226 ns       169877
BM_GoodStructIteration/32768        14302 ns        14302 ns        48910
BM_GoodStructIteration/262144      119729 ns       119669 ns         6144
BM_GoodStructIteration/1000000     579492 ns       579507 ns         1103
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Iterations&lt;/code&gt; column shows how many times Google Benchmark ran the loop to gather statistically reliable data. Fast tests (10,000 elements) ran over 160,000 times; heavy ones (1,000,000 elements) ran about a thousand. The &lt;code&gt;Time&lt;/code&gt; and &lt;code&gt;CPU&lt;/code&gt; columns show the average time per single function execution.&lt;/p&gt;

&lt;p&gt;Hard to interpret raw numbers, right? Let's plot them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zoqnnq0agqs0fzcnur5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zoqnnq0agqs0fzcnur5.png" alt="Iteration time chart" width="800" height="506"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What do we see? The results are nonlinear. Up to 262,144 elements, the difference is minimal. But at 1,000,000 elements, it reaches &lt;strong&gt;38%&lt;/strong&gt;! What causes this?&lt;/p&gt;

&lt;p&gt;It's all about data volume. Arrays of 10,000 and 32,768 bad structs (312.5 KB and 1,024 KB respectively) fit comfortably in the cache. But once the element count reaches 262,144 (8,192 KB), the L3 cache starts running out of space, and data has to spill into slow RAM. That's where the cache line becomes critical.&lt;/p&gt;

&lt;p&gt;Let's recall the 64-byte cache line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BadStruct&lt;/code&gt; is 32 bytes. Exactly &lt;strong&gt;2 structs&lt;/strong&gt; fit in one cache line. To process a million elements, the CPU must make &lt;strong&gt;500,000&lt;/strong&gt; requests to RAM.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GoodStruct&lt;/code&gt; is 24 bytes. About &lt;strong&gt;2.66 structs&lt;/strong&gt; fit in one cache line. To process a million elements, the CPU only needs about &lt;strong&gt;375,000&lt;/strong&gt; requests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See what happened? &lt;strong&gt;We cut the number of accesses to the slowest memory in the computer by a quarter — just by sorting the variables in our class from largest to smallest. No changes to logic, no fancy algorithms — pure Data-Oriented Design.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Note: Why do we count RAM reads for the entire million elements? Wouldn't some stay in the L3 cache? They will, but not for long. The array size exceeds the CPU's cache capacity. By the time the CPU reaches the end of the million-element array, the beginning has already been evicted. On the next benchmark iteration, everything has to be fetched from RAM again.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this part of the "DOD Principles" series, we looked at a simple way to optimize struct sizes, tested its real impact on performance, and explored why it works the way it does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Sort your struct/class fields from largest to smallest, and you'll be fine.&lt;/p&gt;

&lt;p&gt;In the next part, we'll go further and examine the &lt;strong&gt;AoS (Array of Structures)&lt;/strong&gt; and &lt;strong&gt;SoA (Structure of Arrays)&lt;/strong&gt; patterns, which let us squeeze even more performance out of the CPU — for instance, when building physics engines and complex simulations.&lt;/p&gt;

&lt;p&gt;Thanks for reading — write fast code and enjoy the process! &lt;/p&gt;

&lt;p&gt;If you found this useful, a ❤️ and a follow would mean a lot. See you in Part 2!&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>performance</category>
      <category>optimization</category>
      <category>dataorienteddesign</category>
    </item>
  </channel>
</rss>
