<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hamza Hasanain</title>
    <description>The latest articles on DEV Community by Hamza Hasanain (@hamzahassanain0).</description>
    <link>https://dev.to/hamzahassanain0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F898068%2Fe02732de-5e39-4f5f-a340-cd7d4b1b9b58.png</url>
      <title>DEV Community: Hamza Hasanain</title>
      <link>https://dev.to/hamzahassanain0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hamzahassanain0"/>
    <language>en</language>
    <item>
      <title>What Every Programmer Should Know About Memory Part 4</title>
      <dc:creator>Hamza Hasanain</dc:creator>
      <pubDate>Wed, 07 Jan 2026 13:45:40 +0000</pubDate>
      <link>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-4-4bh5</link>
      <guid>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-4-4bh5</guid>
      <description>&lt;h2&gt;
  
  
  What Programmers Can Do: Writing Hardware-Sympathetic Code
&lt;/h2&gt;

&lt;p&gt;In the previous article, we learned that memory geography matters. Now, we arrive at the finale—the most actionable part of Ulrich Drepper's paper: &lt;strong&gt;Section 6&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is not about choosing a better algorithm &lt;code&gt;(O(n) vs O(log n))&lt;/code&gt;. This is about writing code that respects how the hardware physically works. We will cover &lt;strong&gt;Cache Bypassing&lt;/strong&gt;, &lt;strong&gt;TLB Optimization&lt;/strong&gt;, &lt;strong&gt;Concurrency Pitfalls&lt;/strong&gt;, and &lt;strong&gt;Code Layout&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
Subsection A: Cache Optimization

&lt;ul&gt;
&lt;li&gt;1.1. Data Placement: std::vector vs std::list
&lt;/li&gt;
&lt;li&gt;1.2. The Double Indirection Trap
&lt;/li&gt;
&lt;li&gt;1.3. Bypassing the Cache (Non-Temporal Stores)
&lt;/li&gt;
&lt;li&gt;1.4. Access Patterns &amp;amp; Blocking (Tiling)
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Subsection B: The Virtual Memory (TLB)

&lt;ul&gt;
&lt;li&gt;2.1. The High Cost of Translation
&lt;/li&gt;
&lt;li&gt;2.2. The Solution: Huge Pages
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Subsection C: Data &amp;amp; Code Layout

&lt;ul&gt;
&lt;li&gt;3.1. The Tetris Game: Struct Packing
&lt;/li&gt;
&lt;li&gt;3.2. Hot/Cold Data Splitting
&lt;/li&gt;
&lt;li&gt;3.3. Struct of Arrays (SoA) vs Array of Structs (AoS)
&lt;/li&gt;
&lt;li&gt;3.4. Alignment Matters
&lt;/li&gt;
&lt;li&gt;3.5. Instruction Cache &amp;amp; Branch Prediction
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Subsection D: Concurrency &amp;amp; NUMA

&lt;ul&gt;
&lt;li&gt;4.1. The Silent Killer: False Sharing
&lt;/li&gt;
&lt;li&gt;4.2. Thread Affinity (Pinning)
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Subsection E: Prefetching

&lt;ul&gt;
&lt;li&gt;5.1. Helping the Hardware
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Subsection A: Cache Optimization
&lt;/h2&gt;

&lt;p&gt;The most significant performance cliff in modern computing is missing the L1 Cache. Accessing L1 takes ~4 cycles. Accessing RAM takes ~200+ cycles. Your goal is to keep data in L1 as long as possible (Temporal Locality) and use every byte you load (Spatial Locality).&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Data Placement: std::vector Beats std::list
&lt;/h3&gt;

&lt;p&gt;This is the &lt;strong&gt;Hello World&lt;/strong&gt; of memory optimization. It teaches the fundamental rule: Linked Lists are cache poison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; A linked list scatters nodes across the heap (0x1000, 0x8004, 0x200). The CPU cannot predict the next address, breaking the Hardware Prefetcher. You pay the full RAM latency tax for every node.&lt;/p&gt;

&lt;p&gt;In contrast, &lt;code&gt;std::vector&lt;/code&gt; stores elements contiguously in memory (0x1000, 0x1004, 0x1008). Accessing one element brings the next few into the cache line, leveraging spatial locality and prefetching. This drastically reduces cache misses and improves performance.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Using std::list
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="nf"&gt;sum_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Using std::vector
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="nf"&gt;sum_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1.2 The Double Indirection Trap: &lt;code&gt;std::vector&amp;lt;std::vector&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Developers often use &lt;code&gt;std::vector&amp;lt;std::vector&amp;lt;int&amp;gt;&amp;gt;&lt;/code&gt; for grids. This is a pointer to an array of pointers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; To access &lt;code&gt;grid[i][j]&lt;/code&gt;, the CPU must fetch grid &lt;code&gt;-&amp;gt;&lt;/code&gt; fetch pointer at &lt;code&gt;grid[i]&lt;/code&gt; (cache miss 1) &lt;code&gt;-&amp;gt;&lt;/code&gt; fetch data at &lt;code&gt;[j]&lt;/code&gt; (cache miss 2). Rows are not contiguous in physical memory.&lt;/p&gt;

&lt;p&gt;To solve this, we use a clever trick: flatten the 2D structure into a 1D vector.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Using &lt;code&gt;std::vector&amp;lt;std::vector&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// Double indirection, two cache misses&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Flattening the 2D Structure
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// [Row 1 Data... | Row 2 Data... | Row 3 Data...] (Contiguous)&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt; &lt;span class="c1"&gt;// Single access, better cache locality&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1.3 Bypassing the Cache (Non-Temporal Stores)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Hidden Cost of Writing:&lt;/strong&gt;&lt;br&gt;
Normally, when you write to memory (e.g., &lt;code&gt;data[i] = 0&lt;/code&gt;), the CPU must ensure cache coherency. Since it writes to a 64-byte cache line, it must first &lt;strong&gt;Read-For-Ownership (RFO)&lt;/strong&gt;. It fetches the existing 64 bytes from RAM into L1, modifies the 4 bytes you changed, and marks the line as "Modified".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem (Cache Pollution):&lt;/strong&gt;&lt;br&gt;
If you are initializing a massive array (e.g., &lt;code&gt;memset&lt;/code&gt; of 1GB), the CPU will:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Read 1GB of old data from RAM (wasting bandwidth).&lt;/li&gt;
&lt;li&gt; Fill almost the entire L1/L2/L3 cache with this zeroed data.&lt;/li&gt;
&lt;li&gt; Evict your application's hot data (code, stack, other variables) to make room.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is called &lt;strong&gt;Cache Pollution&lt;/strong&gt;, and it destroys performance for code running immediately after the write.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Non-Temporal Stores (Streaming Stores)&lt;/strong&gt;&lt;br&gt;
You can instruct the CPU to use a &lt;strong&gt;Write-Combining Buffer (WCB)&lt;/strong&gt; instead of the cache. You tell the CPU: &lt;em&gt;"I promise I will overwrite this entire line. Don't read it. Do not pollute the cache with it. Just write it to RAM."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Example (Intel Intrinsics):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;immintrin.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;stream_memset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 1. Create a 128-bit vector filled with 'value' (4 integers)&lt;/span&gt;
    &lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_set1_epi32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Note: ensure 'size' is a multiple of 4 integers (16 bytes)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 2. The Streaming Store (The Magic)&lt;/span&gt;
        &lt;span class="c1"&gt;// Writes to 16-byte aligned memory, bypassing L1/L2.&lt;/span&gt;
        &lt;span class="c1"&gt;// It tells the CPU to NOT fetch the old data (No Read-For-Ownership).&lt;/span&gt;
        &lt;span class="n"&gt;_mm_stream_si128&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// 3. The Fence&lt;/span&gt;
    &lt;span class="c1"&gt;// Streaming stores are "weakly ordered". This instruction&lt;/span&gt;
    &lt;span class="c1"&gt;// Forces all Write-Combining Buffers to flush to RAM immediately.&lt;/span&gt;
    &lt;span class="n"&gt;_mm_sfence&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  The Constraint: Memory Alignment
&lt;/h4&gt;

&lt;p&gt;The specific intrinsic &lt;code&gt;_mm_stream_si128&lt;/code&gt; physically requires the memory address to be &lt;strong&gt;16-byte aligned&lt;/strong&gt; (divisible by 16).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  If you access address &lt;code&gt;0x1000&lt;/code&gt;, it works (Ends in 0).&lt;/li&gt;
&lt;li&gt;  If you access address &lt;code&gt;0x1004&lt;/code&gt;, it &lt;strong&gt;Crashes&lt;/strong&gt; (Segfault).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Using standard &lt;code&gt;new&lt;/code&gt; or &lt;code&gt;malloc&lt;/code&gt; does not guarantee this alignment. You must use specific allocators:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Modern C++ (C++17):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;cstdlib&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="c1"&gt;// std::aligned_alloc(alignment, size)&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;aligned_alloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. The "Intel" Way (Intrinsics):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;immintrin.h&amp;gt;&lt;/span&gt;&lt;span class="c1"&gt; &lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;_mm_malloc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;_mm_free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Must use _mm_free matching _mm_malloc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. The POSIX Way (Linux/Unix):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;cstdlib&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;posix_memalign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;ptr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;free&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;For AVX/AVX2, use &lt;code&gt;_mm256_stream_si256&lt;/code&gt; which requires 32-byte alignment.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Access Patterns &amp;amp; Blocking (Tiling)
&lt;/h3&gt;

&lt;p&gt;Hardware prefetchers are good at linear access (Row-Major), but they fail when access patterns are strided (Column-Major) or random.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Row-Major vs Column-Major:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Fast: Row-major access (Sequential)&lt;/span&gt;
&lt;span class="c1"&gt;// All on the same page/cache line.&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Slow: Column-major access (Strided)&lt;/span&gt;
&lt;span class="c1"&gt;// High Cache miss rate &amp;amp; TLB miss rate!&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The Fix: Blocking (Loop Tiling)&lt;/strong&gt;&lt;br&gt;
Divide the problem into small sub-problems that fit &lt;strong&gt;entirely&lt;/strong&gt; inside the L1 Cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choosing the Block Size (B):&lt;/strong&gt;&lt;br&gt;
For a square block of &lt;code&gt;B x B&lt;/code&gt; elements, you want the working set (&lt;code&gt;3 * B^2 * sizeof(element)&lt;/code&gt;) to fit in L1.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;B ≈ sqrt( L1_Size / (3 * Element_Size) )&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example:&lt;/em&gt; &lt;code&gt;L1 = 32KB&lt;/code&gt;, &lt;code&gt;float = 4B&lt;/code&gt; &lt;code&gt;-&amp;gt; B ≈ sqrt(32768 / 12) ≈ 52&lt;/code&gt;. Choose &lt;code&gt;B=48&lt;/code&gt; or &lt;code&gt;B=32&lt;/code&gt; for alignment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Algorithm:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Load a small &lt;code&gt;B x B&lt;/code&gt; block of A and B into L1.&lt;/li&gt;
&lt;li&gt; Compute &lt;em&gt;all possible results&lt;/em&gt; for that block.&lt;/li&gt;
&lt;li&gt; Only move to the next block when finished.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This maximizes &lt;strong&gt;Temporal Locality&lt;/strong&gt; (reuse). The data goes into L1 and stays there.&lt;/p&gt;


&lt;h2&gt;
  
  
  Subsection B: The Virtual Memory (TLB)
&lt;/h2&gt;

&lt;p&gt;This is a critical section often ignored by developers. Every time your code touches a virtual address, the CPU must translate it to a physical address using the &lt;strong&gt;TLB (Translation Lookaside Buffer)&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.1 The High Cost of Translation
&lt;/h3&gt;

&lt;p&gt;The TLB is a tiny cache for logical-to-physical address translations. It typically has distinct levels (L1/L2) with entry counts in the dozens to hundreds (e.g., 64 L1 entries, 512 L2 entries).&lt;br&gt;
Standard memory pages are &lt;strong&gt;4KB&lt;/strong&gt;. If you access 2GB of memory sequentially, you need 524,288 page table entries. Your TLB will thrash constantly.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.2 The Solution: Huge Pages
&lt;/h3&gt;

&lt;p&gt;Modern CPUs support &lt;strong&gt;Huge Pages&lt;/strong&gt; (e.g., &lt;strong&gt;2MB&lt;/strong&gt; or &lt;strong&gt;1GB&lt;/strong&gt;).&lt;br&gt;
Using 2MB pages for that same 2GB array reduces entries to just &lt;strong&gt;1,024&lt;/strong&gt;. The entire mapping can now fit in the L2 TLB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enabling Huge Pages (Linux):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Allocate 512 hugepages of 2MB each (Total 1GB)&lt;/span&gt;
sysctl &lt;span class="nt"&gt;-w&lt;/span&gt; vm.nr_hugepages&lt;span class="o"&gt;=&lt;/span&gt;512
&lt;span class="c"&gt;# Verify&lt;/span&gt;
&lt;span class="nb"&gt;grep &lt;/span&gt;Huge /proc/meminfo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Code Example (Using &lt;code&gt;mmap&lt;/code&gt;):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;sys/mman.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="c1"&gt;// Request a 2MB Huge Page explicitly&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;huge_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mmap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                       &lt;span class="n"&gt;PROT_READ&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;PROT_WRITE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                       &lt;span class="n"&gt;MAP_PRIVATE&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;MAP_ANONYMOUS&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;MAP_HUGETLB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                       &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;huge_data&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;MAP_FAILED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Fallback (or check if user has privileges/OS support)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: Linux also supports **Transparent Huge Pages (THP)&lt;/em&gt;&lt;em&gt;, which tries to use huge pages automatically. However, explicit &lt;code&gt;mmap&lt;/code&gt; or &lt;code&gt;madvise&lt;/code&gt; gives you deterministic control.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Subsection C: Data &amp;amp; Code Layout
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 The Tetris Game: Struct Packing
&lt;/h3&gt;

&lt;p&gt;The compiler aligns data to memory boundaries. If you order your variables poorly, you create holes (padding) in your cache lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does the compiler add padding?&lt;/strong&gt; To ensure that data types are aligned to their natural boundaries (e.g., &lt;code&gt;4-byte&lt;/code&gt; integers on &lt;code&gt;4-byte&lt;/code&gt; boundaries).&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Poorly Ordered Struct
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Bad&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
    &lt;span class="c1"&gt;// 7 bytes padding&lt;/span&gt;
    &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// 8 bytes&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
    &lt;span class="c1"&gt;// 4 bytes padding&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="c1"&gt;// Size: 24 bytes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Well-Ordered Struct
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Good&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// 8 bytes&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
    &lt;span class="c1"&gt;// 3 bytes padding&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="c1"&gt;// Size: 16 bytes (no padding between members)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2 Hot/Cold Data Splitting
&lt;/h3&gt;

&lt;p&gt;Objects often contain data we check frequently (ID, Health) and data we rarely check (Name, Biography).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; If a struct is &lt;code&gt;200 bytes&lt;/code&gt; (mostly text strings), only &lt;code&gt;3&lt;/code&gt; structs fit in a cache line. Iterating over them fills the cache with Cold text data you aren't reading, flushing out useful data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to do&lt;/strong&gt;: Move rare data to a separate pointer or array.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Mixed Hot/Cold Data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;              &lt;span class="c1"&gt;// HOT&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;         &lt;span class="c1"&gt;// HOT&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;  &lt;span class="c1"&gt;// COLD (Pollutes cache)&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Split Hot/Cold Data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;UserHot&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;UserCold&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;coldData&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Pointer to cold data&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;UserCold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.3 Struct of Arrays (SoA) vs Array of Structs (AoS)
&lt;/h3&gt;

&lt;p&gt;This is a classic battle in Game Development and Data-Oriented Design.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Array of Structs (AoS) - The OOP Way:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Point&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="n"&gt;Point&lt;/span&gt; &lt;span class="n"&gt;points&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is good if you always access &lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, and &lt;code&gt;z&lt;/code&gt; together. But often, you loop over just &lt;code&gt;x&lt;/code&gt; to do a physics calculation.&lt;br&gt;
&lt;strong&gt;The cost:&lt;/strong&gt; Every time you load &lt;code&gt;points[i].x&lt;/code&gt;, you also load &lt;code&gt;y&lt;/code&gt; and &lt;code&gt;z&lt;/code&gt; into the cache line, wasting 66% of your bandwidth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Struct of Arrays (SoA) - The Data-Oriented Way:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Points&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, &lt;code&gt;x&lt;/code&gt; values are packed contiguously. One cache line load brings in 16 &lt;code&gt;x&lt;/code&gt; values at once. This is also &lt;strong&gt;perfect for SIMD&lt;/strong&gt; (Single Instruction Multiple Data) auto-vectorization.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.4 Alignment Matters
&lt;/h3&gt;

&lt;p&gt;CPUs love boundaries. Ideally, your data structures should start at addresses divisible by 64 (cache line size).&lt;br&gt;
&lt;strong&gt;C++ Solution:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;alignas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;AlignedData&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;critical_value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.5 Instruction Cache &amp;amp; Branch Prediction
&lt;/h3&gt;

&lt;p&gt;It's not just data that gets cached—instructions do too (L1i Cache). If your code jumps around unpredictably, the CPU pipeline stalls.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Branch Hints:&lt;/strong&gt;&lt;br&gt;
Modern CPUs have powerful dynamic branch predictors that often figure out patterns better than you can. However, for static branches (like error checking), you can give the compiler a hint to move cold code away from hot code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;process_transaction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// "Cold" path: Compiler moves this assembly block to the end of the function&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unlikely&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nb"&gt;nullptr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;handle_error&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; 
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// "Hot" path: Continues immediately in memory, keeping L1i efficient&lt;/span&gt;
    &lt;span class="n"&gt;do_math&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Subsection D: Concurrency &amp;amp; NUMA
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 The Silent Killer: False Sharing
&lt;/h3&gt;

&lt;p&gt;This is the most insidious performance bug in multithreading.&lt;br&gt;
Two threads on different cores modify variables that happen to sit on the &lt;strong&gt;same 64-byte cache line&lt;/strong&gt;. The cache coherence protocol (MESI) forces the line to bounce back and forth ("ping-ponging"), executing slowly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix (Padding):&lt;/strong&gt;&lt;br&gt;
Align critical shared data to 64 bytes to ensure it lives on its own island.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;PaddedCounter&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;alignas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;atomic&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// Padding is implicit due to alignas, but explicit padding &lt;/span&gt;
    &lt;span class="c1"&gt;// can also be used: char pad[60];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="n"&gt;PaddedCounter&lt;/span&gt; &lt;span class="n"&gt;counters&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;NUM_THREADS&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// Each counter is now on a separate line&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Result: often a 10x-50x speedup in contended write workloads.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Thread Affinity (Pinning)
&lt;/h3&gt;

&lt;p&gt;In a NUMA system, memory is local to a specific CPU socket. If the OS scheduler moves your thread to a different socket, it must access memory remotely (high latency).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution:&lt;/strong&gt; Pin the thread to a specific core (or socket).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;pthread.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;pin_thread_to_core&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;core_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;cpu_set_t&lt;/span&gt; &lt;span class="n"&gt;cpuset&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;CPU_ZERO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cpuset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;CPU_SET&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;core_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cpuset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;pthread_setaffinity_np&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pthread_self&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu_set_t&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cpuset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tooling:&lt;/strong&gt; Use &lt;code&gt;numactl&lt;/code&gt; to bind processes: &lt;code&gt;numactl --physcpubind=0-3 --membind=0 ./myapp&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Subsection E: Prefetching
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Helping the Hardware
&lt;/h3&gt;

&lt;p&gt;Hardware prefetchers are great at standard patterns (&lt;code&gt;i++&lt;/code&gt;), but they struggle with pointer lookups (&lt;code&gt;p = p-&amp;gt;next&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Software Prefetching:&lt;/strong&gt;&lt;br&gt;
You can issue a non-blocking instruction to fetch a line into L1 before you need it. Use &lt;code&gt;__builtin_prefetch(addr, rw, locality)&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// locality: 3 = heavy reuse (L1), 0 = no reuse (streaming)&lt;/span&gt;
    &lt;span class="c1"&gt;// rw: 0 = read, 1 = write&lt;/span&gt;
    &lt;span class="n"&gt;__builtin_prefetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;do_heavy_work&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; 
    &lt;span class="c1"&gt;// By the time work is done, node-&amp;gt;next is hopefully in L1.&lt;/span&gt;

    &lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Warning:&lt;/strong&gt; Tuning this is hard. Prefetch too early, and you evict useful data. Prefetch too late, and it hasn't arrived. &lt;strong&gt;Measure everything.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Tools for Performance Engineers
&lt;/h2&gt;

&lt;p&gt;Don't guess—measure.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;perf (Linux):&lt;/strong&gt; The gold standard.

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;perf stat -e cycles,cache-misses,instructions ./app&lt;/code&gt;: Check IPC and miss rates.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;perf record -g ./app&lt;/code&gt; &amp;amp; &lt;code&gt;perf report&lt;/code&gt;: Find exactly &lt;em&gt;where&lt;/em&gt; cache misses happen.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;valgrind (Cachegrind):&lt;/strong&gt; &lt;code&gt;valgrind --tool=cachegrind ./app&lt;/code&gt;. Slow, but gives deterministic cache simulation.&lt;/li&gt;

&lt;li&gt;  &lt;strong&gt;lscpu / hwloc:&lt;/strong&gt; View your topology (L1 sizes, NUMA nodes).&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Quick Cheat Sheet
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanic&lt;/th&gt;
&lt;th&gt;Do ...&lt;/th&gt;
&lt;th&gt;Don't ...&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Containers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prefer &lt;code&gt;std::vector&lt;/code&gt; (Contiguous).&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;std::list&lt;/code&gt; (Linked Lists are cache poison).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Indirection&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Flatten 2D arrays to 1D vectors.&lt;/td&gt;
&lt;td&gt;Use &lt;code&gt;vector&amp;lt;vector&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt; (Double Indirection).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Struct Packing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Order members: Largest to Smallest.&lt;/td&gt;
&lt;td&gt;Order randomly (creates padding/holes).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hot/Cold Data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Split rare fields into separate structs.&lt;/td&gt;
&lt;td&gt;Pollute cache lines with unused data strings.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Layout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Use &lt;strong&gt;Struct of Arrays (SoA)&lt;/strong&gt; for bulk processing.&lt;/td&gt;
&lt;td&gt;Use Array of Structs (AoS) for everything.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Alignment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Align structs/arrays to 64B.&lt;/td&gt;
&lt;td&gt;Use unaligned addresses for SIMD/Streaming.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pad atomic counters to 64B.&lt;/td&gt;
&lt;td&gt;Let threads fight over the same cache line.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Huge Pages&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Use 2MB pages for &amp;gt;100MB arrays.&lt;/td&gt;
&lt;td&gt;Rely on 4KB pages for massive working sets.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I hope this overview of Drepper's work helps you write code that the hardware loves. Happy Coding!&lt;/p&gt;

</description>
      <category>programming</category>
      <category>computerscience</category>
      <category>ai</category>
      <category>cpp</category>
    </item>
    <item>
      <title>What Every Programmer Should Know About Memory Part 3</title>
      <dc:creator>Hamza Hasanain</dc:creator>
      <pubDate>Fri, 02 Jan 2026 08:35:00 +0000</pubDate>
      <link>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-3-2i6k</link>
      <guid>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-3-2i6k</guid>
      <description>&lt;h3&gt;
  
  
  Geography Matters: NUMA Support
&lt;/h3&gt;

&lt;p&gt;In the previous article What Every Programmer Should Know About Memory Part 2, we talked about Virtual Memory and how it translates the lies of the OS into physical reality. We covered page tables, the TLB, and how the hardware walks the tree to find your data.&lt;/p&gt;

&lt;p&gt;In this article, we continue from where we left off and cover &lt;strong&gt;section 5&lt;/strong&gt; from the paper &lt;a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf" rel="noopener noreferrer"&gt;What Every Programmer Should Know About Memory&lt;/a&gt; by Ulrich Drepper.&lt;/p&gt;

&lt;p&gt;Up until now, we've mostly pretended that all RAM is created equal. We assumed that if you have &lt;code&gt;16GB&lt;/code&gt; of RAM, accessing byte &lt;code&gt;0&lt;/code&gt; is just as fast as accessing byte &lt;code&gt;15,999,999,999&lt;/code&gt;. In the old days of &lt;strong&gt;SMP&lt;/strong&gt; (Symmetric Multi-Processing), this was true. All CPUs connected to a single memory controller via a single bus.&lt;/p&gt;

&lt;p&gt;But as core counts exploded, that single bus became a bottleneck. The solution was to split the memory up and give each CPU its own local memory. This created &lt;strong&gt;NUMA (Non-Uniform Memory Access)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;From SMP to NUMA: Why equality is dead&lt;/li&gt;
&lt;li&gt;
The Hardware Topology: Nodes and Interconnects

&lt;ul&gt;
&lt;li&gt;2.1. Local vs. Remote Memory
&lt;/li&gt;
&lt;li&gt;2.2. The Interconnect Penalty
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
OS Policies: The "First Touch" Trap

&lt;ul&gt;
&lt;li&gt;3.1. How Linux Allocates Memory
&lt;/li&gt;
&lt;li&gt;3.2. The Trap: Main Thread Initialization
&lt;/li&gt;
&lt;li&gt;3.3. The "Spillover" Behavior (Zone Reclaim)
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Tools of the Trade

&lt;ul&gt;
&lt;li&gt;4.1. Analyzing with &lt;code&gt;lscpu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;4.2. The Distance Matrix (&lt;code&gt;numactl&lt;/code&gt;)
&lt;/li&gt;
&lt;li&gt;4.3. Controlling Policy with &lt;code&gt;numactl&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;4.4. Programming with &lt;code&gt;libnuma&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. UMA vs. NUMA: The Death of Equality
&lt;/h2&gt;

&lt;p&gt;To understand why modern servers behave the way they do, we need to look at the evolution of memory architectures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuv9ey5tt8thmii0dzc6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuv9ey5tt8thmii0dzc6.jpg" alt="UMA vs NUMA Architecture" width="520" height="308"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 UMA (Uniform Memory Access)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Old Way:&lt;/strong&gt; In the days of &lt;strong&gt;SMP (Symmetric Multi-Processing)&lt;/strong&gt;, we had a single memory controller and a single system bus. All CPUs connected to this bus.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; "Uniform" means the cost to access RAM is the same for every core. Accessing address &lt;code&gt;0x0&lt;/code&gt; takes 100ns for Core 0 and 100ns for Core 1.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why it failed:&lt;/strong&gt; The shared bus became a bottleneck. As we added more cores (2, 4, 8...), they all fought for the same bandwidth. It was like having 64 cars trying to use a single lane highway.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.2 NUMA (Non-Uniform Memory Access)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The New Way:&lt;/strong&gt; To solve the bottleneck, hardware architects split the memory up.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What:&lt;/strong&gt; Instead of one giant bank of RAM, we attach a dedicated chunk of RAM to each processor socket. Each Processor + its Local RAM is called a &lt;strong&gt;NUMA Node&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How:&lt;/strong&gt; The nodes are connected by a high-speed interconnect (like Intel UPI or AMD Infinity Fabric). If CPU 0 needs data from CPU 1's memory, it asks CPU 1 to fetch it and ship it over the wire.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture solves the bandwidth problem (multiple highways!) but introduces a new problem: &lt;strong&gt;Physics&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Cost of Remote Access
&lt;/h2&gt;

&lt;p&gt;Now that memory is physically distributed, distance matters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fencrypted-tbn0.gstatic.com%2Fimages%3Fq%3Dtbn%3AANd9GcTobdgw0WuTjvyQbh306uM_CATlYDLpj8Qmkg%26s" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fencrypted-tbn0.gstatic.com%2Fimages%3Fq%3Dtbn%3AANd9GcTobdgw0WuTjvyQbh306uM_CATlYDLpj8Qmkg%26s" alt="NUMA Local vs Remote Access" width="451" height="112"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If a CPU on &lt;strong&gt;Node 0&lt;/strong&gt; needs data located in &lt;strong&gt;Node 0's&lt;/strong&gt; RAM, the path is short and fast.&lt;br&gt;
If a CPU on &lt;strong&gt;Node 0&lt;/strong&gt; needs data located in &lt;strong&gt;Node 1's&lt;/strong&gt; RAM, the request must travel over the interconnect to Node 1, wait for Node 1's memory controller to fetch it, and ship it back.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.1 The Latency Penalty
&lt;/h3&gt;

&lt;p&gt;We often measure this cost as a "latency factor."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local Access:&lt;/strong&gt; 1.0 (Baseline)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remote Access:&lt;/strong&gt; 1.5x - 2.0x Slower&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means every cache miss that hits remote memory is twice as expensive as a local miss. In high-performance computing (HPC) or low-latency trading, this is a disaster.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.2 Bandwidth Saturation: The Clogged Pipe
&lt;/h3&gt;

&lt;p&gt;It's not just about speed; it's about capacity. The interconnect between sockets has a limited bandwidth.&lt;/p&gt;

&lt;p&gt;If you write a program where &lt;strong&gt;all&lt;/strong&gt; threads on all 64 cores are aggressively reading from &lt;strong&gt;Node 0's&lt;/strong&gt; memory, you create a traffic jam. The local cores on Node 0 might get their data fine, but the remote cores on other nodes will see massive stalls as they fight for space on the interconnect.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. OS Policies: The "First Touch" Trap
&lt;/h2&gt;

&lt;p&gt;So how does the OS decide where to put your memory? If you &lt;code&gt;malloc(1GB)&lt;/code&gt;, does it go to Node 0 or Node 1?&lt;/p&gt;

&lt;p&gt;Linux uses a policy called &lt;strong&gt;First-Touch Allocation&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.1 How Linux Allocates Memory
&lt;/h3&gt;

&lt;p&gt;When you call &lt;code&gt;malloc(1GB)&lt;/code&gt;, the kernel doesn't actually give you physical RAM. It gives you a promise (Virtual Memory).&lt;br&gt;
The physical RAM is allocated &lt;strong&gt;only when you write to that page for the first time&lt;/strong&gt;. This is called a Page Fault.&lt;/p&gt;

&lt;p&gt;At that exact moment, the kernel looks at &lt;strong&gt;which CPU&lt;/strong&gt; triggered the page fault. It says, "Ah, you are running on CPU 5, which belongs to Node 0. I will allocate this physical page from Node 0's RAM to make it fast for you."&lt;/p&gt;

&lt;p&gt;This is normally good, but it leads to a deadly trap.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.2 The Trap: Main Thread Initialization
&lt;/h3&gt;

&lt;p&gt;This policy leads to one of the most common performance bugs in high-performance applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Scenario:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You start your program. The &lt;strong&gt;Main Thread&lt;/strong&gt; (running on Node 0) allocates a huge array and initializes it to zero (&lt;code&gt;memset&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Because the Main Thread touched all the pages, the OS dutifully allocates &lt;strong&gt;100% of the RAM on Node 0&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;You spawn 64 worker threads (spread across Node 0, 1, 2, 3) to process the data in parallel.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh3.googleusercontent.com%2Frd-gg-dl%2FABS2GSml7htfQfBzhM0yWCtrNDoRxu2mG6BHRF-drK2WpOZwuoLsNkWwsj_XOAC0iegfodrA6G6DsZ_AAOqdpBOyKMwmunM7JBWM13wFzJdITdZr8CoiRI7RaaHm528IWkp6Pc3XWjybSPoKj8k5wmnkArBJ5tboqoj0gdB5VqC3k1QTo0KfjiwEigmkF8EiDyoqGz3V3k00ntjZAaV6-fO36KPSC4R4lSJQZ4Vt_b7or--FQj_UUP2PI7c6bwIn0ibR56uIJr5DOOnD6ZkFOO4YS40L8u8am12i7smhbruxAOmnej99pzWe56BD0CpgCIpz_qpXqjD2eT3ut6NYW8GOeEbEMD-EuG3ncthrZbxVF6vuabA2EX9-TFyUiRh2CfKGYu6uxb1NHdQzkkJuMb_9yAkwfiAqZrpP6GPyiv0iFRH-vpjBT7qX3ELLtr_Uapi6ygQOiK5qjdpBgEhVwUzylT_ll1R3Qg3keilQZs65lIV7csxBj5XMkGoeEX3sM9tGQcdjHukNl8-ZdJGi4451q0OJUrned1gaJNW_vFrQ2VAow2CaYc6pIrMSszFOiG1VtXCZUFJBmKqQPidBQr07uhBAO7M9rNYRLnp69A9c-35TbAzYh-c_HosOGN0-DuezAWcZiH5wjsa21ze_A3SYrtBTca-g4yylvWuIAdNwEIO-1qu4pZ-ut4AkXyWB6vmo0flExvSv8JZPYuMo9XT05v54BcwtnSHYrb5NJv-KGkewLAe7ZHD5WJoxZ45L5hrRNSa-pF9js6__l6zWBd-bevcDxwkgJMbK_OOe95tK3DP0x1kVmvMPeGGGQmMc8h_Bdu-kEglUM9kKgLDdTt1lV-11xg1MteaMCUSCKRbn-i4S5LLlsjP0WtU5WMuQ1hdSoPf0onGOZSpdvPOzJa0AiIMaOUiYI18exvIwzq9uLkIlp7zcVfQzMmTUNFOYqJ7iH0xAy77l3ThGGHbeS3mSt7cw_nx6GG4ZjQUlO6HsC53uiATAYakinZFxdOdaivsAyCais286twdhJgJRgavTcwidJ--x3USLAZ1kJjPLe1D0P_6p-aGSLdJbmezfqp-qbQacHwL6mTUpR-TvCFT98CGCpdHSegBMVALyBbMkx8i10Dv4ASo8_8LC-Q0F_XE2_G4fMM6BuoMLbAqsQ_XN-1lZR4JP0OOfQH4zUQrXPhmocWIf3rYEf9-iBS1r5uYlJopox99hzOYuXBNYhTn6HAl6wU2bpjYYfvUXrrSawrpTjFhwJcijIebD-QCD%3Ds1024-rj" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh3.googleusercontent.com%2Frd-gg-dl%2FABS2GSml7htfQfBzhM0yWCtrNDoRxu2mG6BHRF-drK2WpOZwuoLsNkWwsj_XOAC0iegfodrA6G6DsZ_AAOqdpBOyKMwmunM7JBWM13wFzJdITdZr8CoiRI7RaaHm528IWkp6Pc3XWjybSPoKj8k5wmnkArBJ5tboqoj0gdB5VqC3k1QTo0KfjiwEigmkF8EiDyoqGz3V3k00ntjZAaV6-fO36KPSC4R4lSJQZ4Vt_b7or--FQj_UUP2PI7c6bwIn0ibR56uIJr5DOOnD6ZkFOO4YS40L8u8am12i7smhbruxAOmnej99pzWe56BD0CpgCIpz_qpXqjD2eT3ut6NYW8GOeEbEMD-EuG3ncthrZbxVF6vuabA2EX9-TFyUiRh2CfKGYu6uxb1NHdQzkkJuMb_9yAkwfiAqZrpP6GPyiv0iFRH-vpjBT7qX3ELLtr_Uapi6ygQOiK5qjdpBgEhVwUzylT_ll1R3Qg3keilQZs65lIV7csxBj5XMkGoeEX3sM9tGQcdjHukNl8-ZdJGi4451q0OJUrned1gaJNW_vFrQ2VAow2CaYc6pIrMSszFOiG1VtXCZUFJBmKqQPidBQr07uhBAO7M9rNYRLnp69A9c-35TbAzYh-c_HosOGN0-DuezAWcZiH5wjsa21ze_A3SYrtBTca-g4yylvWuIAdNwEIO-1qu4pZ-ut4AkXyWB6vmo0flExvSv8JZPYuMo9XT05v54BcwtnSHYrb5NJv-KGkewLAe7ZHD5WJoxZ45L5hrRNSa-pF9js6__l6zWBd-bevcDxwkgJMbK_OOe95tK3DP0x1kVmvMPeGGGQmMc8h_Bdu-kEglUM9kKgLDdTt1lV-11xg1MteaMCUSCKRbn-i4S5LLlsjP0WtU5WMuQ1hdSoPf0onGOZSpdvPOzJa0AiIMaOUiYI18exvIwzq9uLkIlp7zcVfQzMmTUNFOYqJ7iH0xAy77l3ThGGHbeS3mSt7cw_nx6GG4ZjQUlO6HsC53uiATAYakinZFxdOdaivsAyCais286twdhJgJRgavTcwidJ--x3USLAZ1kJjPLe1D0P_6p-aGSLdJbmezfqp-qbQacHwL6mTUpR-TvCFT98CGCpdHSegBMVALyBbMkx8i10Dv4ASo8_8LC-Q0F_XE2_G4fMM6BuoMLbAqsQ_XN-1lZR4JP0OOfQH4zUQrXPhmocWIf3rYEf9-iBS1r5uYlJopox99hzOYuXBNYhTn6HAl6wU2bpjYYfvUXrrSawrpTjFhwJcijIebD-QCD%3Ds1024-rj" alt="First Touch Trap)" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Threads on Node 0 are happy (Local access).&lt;/li&gt;
&lt;li&gt;Threads on Node 1, 2, 3 are miserable. They are all being forced to fetch data remotely from Node 0.&lt;/li&gt;
&lt;li&gt;The interconnect to Node 0 becomes saturated.&lt;/li&gt;
&lt;li&gt;Performance scales poorly, and you wonder why adding more cores made it slower.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Fix:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Parallel Initialization&lt;/strong&gt;. Don't let the main thread &lt;code&gt;memset&lt;/code&gt; everything. Have your worker threads initialize the specific chunks of data they will be working on. This ensures the physical memory pages are allocated on the local nodes where the workers live.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.3 The "Spillover" Behavior (Zone Reclaim)
&lt;/h3&gt;

&lt;p&gt;What happens if Node 0 is full? By default, if a thread on Node 0 requests memory and Node 0 is full, Linux will attempt to allocate from Node 1 rather than crashing.&lt;/p&gt;

&lt;p&gt;This creates unpredictable latency spikes. Your application runs fast for the first 30 minutes, fills up Node 0, and suddenly slows down by 50% because new allocations are silently spilling over to Node 1. Monitoring &lt;code&gt;numa_miss&lt;/code&gt; stats in &lt;code&gt;/sys/devices/system/node/&lt;/code&gt; is the only way to catch this.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Tools of the Trade
&lt;/h2&gt;

&lt;p&gt;How do you know if you are running on a NUMA machine?&lt;/p&gt;
&lt;h3&gt;
  
  
  4.1 Analyzing with &lt;code&gt;lscpu&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Open your terminal and type &lt;code&gt;lscpu&lt;/code&gt;. It reveals the truth about your hardware.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;lscpu
...
NUMA node&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;:          2
NUMA node0 CPU&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;:     0-31
NUMA node1 CPU&lt;span class="o"&gt;(&lt;/span&gt;s&lt;span class="o"&gt;)&lt;/span&gt;:     32-63
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NUMA node(s): 2&lt;/strong&gt; -&amp;gt; You have 2 distinct memory banks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NUMA node0 CPU(s): 0-31&lt;/strong&gt; -&amp;gt; If you run a thread on Core 5, its local memory is Node 0. If it accesses Node 1, it pays the penalty.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.2 The Distance Matrix (&lt;code&gt;numactl&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;To see exactly how "remote" a node is, use &lt;code&gt;numactl --hardware&lt;/code&gt;. The "node distances" table at the bottom is key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node distances:
node   0   1
  0:  10  21
  1:  21  10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh3.googleusercontent.com%2Fgg%2FAIJ2gl_Aqgzjy6F5KIK462fWbk6xjjhWJ3A_nqf7OrhORuYPtxSRHSMFD9YVoVegmReUvx5BhWP3IW6xmuhSSJCxfO134O8k34FZ2iCgWmC1yxPcozixx2KlQKBVP23p0aWbAEIvvoVEWzWzg24k507b9D6U7q23VGVtRdABZBbT9PGtPatuA3b2%3Ds1024-rj-mp2" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh3.googleusercontent.com%2Fgg%2FAIJ2gl_Aqgzjy6F5KIK462fWbk6xjjhWJ3A_nqf7OrhORuYPtxSRHSMFD9YVoVegmReUvx5BhWP3IW6xmuhSSJCxfO134O8k34FZ2iCgWmC1yxPcozixx2KlQKBVP23p0aWbAEIvvoVEWzWzg24k507b9D6U7q23VGVtRdABZBbT9PGtPatuA3b2%3Ds1024-rj-mp2" alt="Distance Map)" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;10&lt;/strong&gt;: Represents local access (the baseline cost).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;21&lt;/strong&gt;: Represents the cost to cross the interconnect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you saw a value like 30 or 40, that would imply an even longer path (like jumping over two sockets in a 4-socket server).&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Controlling Policy with &lt;code&gt;numactl&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;You can override the default OS behavior using &lt;code&gt;numactl&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interleaving:&lt;/strong&gt;&lt;br&gt;
If you have a read-only lookup table that every thread accesses randomly, "First Touch" is bad (it unfairly burdens one node). Instead, you can force the OS to spread the pages round-robin across all nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Interleave memory allocation across all nodes&lt;/span&gt;
numactl &lt;span class="nt"&gt;--interleave&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;all ./my_application
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Binding:&lt;/strong&gt;&lt;br&gt;
You can also strict-bind a process to a specific node, ensuring it never inadvertently runs on a remote core or allocates remote memory.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run only on Node 0's CPUs, allocate only from Node 0's RAM&lt;/span&gt;
numactl &lt;span class="nt"&gt;--cpunodebind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nt"&gt;--membind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 ./my_application
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.4 Programming with &lt;code&gt;libnuma&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Sometimes you can't control how the user runs your binary. You can enforce memory policy directly in C++ using &lt;code&gt;libnuma&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;numa.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="c1"&gt;// Allocate 10MB specifically on Node 0&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;numa_alloc_onnode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Or run this thread only on Node 0&lt;/span&gt;
&lt;span class="n"&gt;numa_run_on_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This requires linking with &lt;code&gt;-lnuma&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5. Conclusion
&lt;/h2&gt;

&lt;p&gt;Ignoring NUMA is ignoring the laws of physics in your server. As programmers, we can't change the hardware, but we can change how we behave on it.&lt;/p&gt;

&lt;p&gt;By respecting concepts like &lt;strong&gt;First-Touch&lt;/strong&gt;, understanding the &lt;strong&gt;Interconnect Penalty&lt;/strong&gt;, and pinning our threads appropriately, we can stop fighting the hardware and start working with it.&lt;/p&gt;

&lt;p&gt;In the next and final part, we will cover &lt;strong&gt;Section 6: What Programmers Can Do&lt;/strong&gt;. This will be a massive deep dive into cache blocking, data layout (SoA vs AoS), and the infamous False Sharing effect.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>webdev</category>
      <category>computerscience</category>
      <category>lowcode</category>
    </item>
    <item>
      <title>What Every Programmer Should Know About Memory Part 2</title>
      <dc:creator>Hamza Hasanain</dc:creator>
      <pubDate>Tue, 25 Nov 2025 09:13:52 +0000</pubDate>
      <link>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-2-125m</link>
      <guid>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-2-125m</guid>
      <description>&lt;h3&gt;
  
  
  Why does your pointer not point where you think it does?.
&lt;/h3&gt;

&lt;p&gt;In the previous article &lt;a href="https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-1-385e"&gt;What Every Programmer Should Know About Memory (Part 1)&lt;/a&gt;, we covered sections &lt;strong&gt;2 and 3&lt;/strong&gt; from the article: &lt;strong&gt;What Every Programmer Should Know About Memory&lt;/strong&gt; by Ulrich Drepper. In this article, we will continue from where we left off and cover section &lt;strong&gt;4&lt;/strong&gt; (yes, section 4 only).&lt;/p&gt;

&lt;p&gt;The previous article explored memory hierarchies from the ground up — how DRAM hardware works, why CPU caches exist, and practical optimization techniques like cache-line awareness and data structure layout. We examined the physical reality behind the "flat array" abstraction and learned why memory access patterns matter for performance.&lt;/p&gt;

&lt;p&gt;In this article, we continue with &lt;strong&gt;section 4&lt;/strong&gt; of Ulrich Drepper's paper, diving deep into &lt;strong&gt;Virtual Memory&lt;/strong&gt; — the translation layer that gives every process its own address space while sharing physical RAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Prerequisites: The Basics&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;0.1. Paging?
&lt;/li&gt;
&lt;li&gt;0.2. More Concepts
&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;The Illusion of Ownership: Virtual vs. Physical&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;1.1. The Sandbox: How the MMU makes every process believe it owns the entire RAM
&lt;/li&gt;
&lt;li&gt;1.2. The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data
&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;The Page Table Walk: A Tree Structure&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;2.1. Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process
&lt;/li&gt;
&lt;li&gt;2.2. The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)
&lt;/li&gt;
&lt;li&gt;2.3. The Hardware Walker: How the processor "walks the tree" to find physical pages
&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;The Accelerator: The TLB (Translation Look-Aside Buffer)&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;3.1. Caching the Address: The TLB as a tiny, ultra-fast cache specifically for address translations
&lt;/li&gt;
&lt;li&gt;3.2. TLB Thrashing: A Practical Example
&lt;/li&gt;
&lt;li&gt;3.3. The Context Switch Penalty: Why switching processes forces a TLB flush (and why it's expensive)
&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Optimization: Making the TLB Bigger (Without Hardware Changes)&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;4.1. The Page Size Limit: Why 4KB pages clog up the TLB
&lt;/li&gt;
&lt;li&gt;4.2. Huge Pages (2MB/1GB): Increasing the range of a single TLB entry to reduce misses
&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Conclusion: Respecting the Translation Layer&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  0 Prerequisites: The Basics
&lt;/h2&gt;

&lt;p&gt;Before diving into the details of Virtual Memory, let's define a few key concepts that will help you understand the rest of the article.&lt;/p&gt;

&lt;h3&gt;
  
  
  0.1 &lt;strong&gt;Paging?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;A page is a fixed-length contiguous block of virtual memory. In most systems, the default page size is 4KB (4096 bytes), although larger page sizes (like 2MB or 1GB) can also be used for specific applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Paging:&lt;/strong&gt; is a memory management scheme that eliminates the need for &lt;strong&gt;contiguous&lt;/strong&gt; allocation of physical memory. Instead, it divides virtual memory into pages and maps them to physical memory &lt;strong&gt;frames&lt;/strong&gt;, allowing for more efficient use of RAM and enabling features like virtual memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Physical Frame:&lt;/strong&gt; is a fixed-length block of physical memory that corresponds to a page in virtual memory. The operating system maintains a mapping between virtual pages and physical frames, allowing processes to access memory without needing to know the actual physical location of their data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frames VS Pages:&lt;/strong&gt; In the context of paging, a "page" refers to a block of virtual memory, while a "frame" refers to a block of physical memory. The operating system maps virtual pages to physical frames, allowing processes to access memory without needing to know the actual physical location of their data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gt29aiuu0vn79imsmui.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gt29aiuu0vn79imsmui.png" alt="Diagram showing Virtual Address Space (contiguous blocks 0, 1, 2) vs Physical RAM (scattered frames 55, 12, 9)" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  0.2 &lt;strong&gt;More Concepts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;There is much more to virtualization, but for the sake of saving time, we will see only one-line definitions of some important concepts that will help you understand the rest of the article.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Address Space:&lt;/strong&gt; The range of memory addresses that a process can use. Each process has its own virtual address space, which is mapped to physical memory by the operating system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory Management Unit (MMU):&lt;/strong&gt; A hardware component that handles the translation of virtual addresses to physical addresses. It works in conjunction with the operating system to manage memory access and enforce protection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Page Table:&lt;/strong&gt; A data structure used by the operating system to keep track of the mapping between virtual pages and physical frames. Each process has its own page table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TLB (Translation Lookaside Buffer):&lt;/strong&gt; A small, fast cache that stores recent translations of virtual addresses to physical addresses. It helps speed up the address translation process by reducing the number of memory accesses needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  1 The Illusion of Ownership: Virtual vs. Physical
&lt;/h2&gt;

&lt;p&gt;As you probably understand by now (reading the prerequisite section of course, Paging?), virtual memory creates an &lt;strong&gt;illusion&lt;/strong&gt; for each process that it has its own dedicated physical memory. In reality, the operating system manages the physical memory and allocates it to processes as needed.&lt;/p&gt;

&lt;p&gt;Now, we will explore in a bit more detail how this illusion is created and maintained.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 The Sandbox: How the MMU makes every process believe it owns the entire RAM
&lt;/h3&gt;

&lt;p&gt;We know what an &lt;strong&gt;MMU&lt;/strong&gt; is and what it does (see More Concepts), but how does it handle this translation? How does it know which virtual address maps to which physical address?&lt;/p&gt;

&lt;p&gt;Let's talk about the &lt;strong&gt;Levels Of Translation&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single-Level Translation:&lt;/strong&gt; In a simple system, the MMU uses a single-level page table to map virtual addresses to physical addresses. Each entry in the page table corresponds to a virtual page and contains the physical frame number where that page is stored.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxbmapzhexkn6mlh71k6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxbmapzhexkn6mlh71k6.png" alt="Single Level Translation" width="720" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Level Translation:&lt;/strong&gt; In more complex systems, the MMU uses a multi-level page table to reduce memory overhead. The virtual address is divided into multiple parts, each part indexing into a different level of the page table. This hierarchical structure allows for more efficient use of memory.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhaxibnyr008wd90e59e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnhaxibnyr008wd90e59e.png" alt="Multi Level Translation" width="800" height="605"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data
&lt;/h3&gt;

&lt;p&gt;Let's discuss the trade-offs between single-level and multi-level page tables.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single-level tables&lt;/strong&gt; are simple and fast (one lookup) but waste a massive amount of RAM for the table itself. &lt;strong&gt;Multi-level tables&lt;/strong&gt; save RAM by only allocating what is needed, but they are slower because they require multiple memory lookups to find the address.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Math of Latency:&lt;/strong&gt;&lt;br&gt;
Imagine a single memory access takes &lt;strong&gt;100ns&lt;/strong&gt;. If you have a 4-level page table and a TLB miss, you don't just wait 100ns for your data. You wait:&lt;br&gt;
100ns (L4) + 100ns (L3) + 100ns (L2) + 100ns (L1) + 100ns (Actual Data) = &lt;strong&gt;500ns&lt;/strong&gt;.&lt;br&gt;
That is a &lt;strong&gt;5x slowdown&lt;/strong&gt; just for translation!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  2 The Page Table Walk: A Tree Structure
&lt;/h2&gt;

&lt;p&gt;The page table walk is the process by which the &lt;strong&gt;MMU&lt;/strong&gt; translates a virtual address to a physical address using the page table. In a multi-level page table, this involves traversing a tree-like structure to find the correct mapping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it like a Library Index:&lt;/strong&gt;&lt;br&gt;
If you had a single flat list of every book in the world, it would be impossible to hold. Instead, we use a hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;L4:&lt;/strong&gt; Which Floor?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L3:&lt;/strong&gt; Which Aisle?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L2:&lt;/strong&gt; Which Shelf?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;L1:&lt;/strong&gt; Which Book?&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2.1 Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process
&lt;/h3&gt;

&lt;p&gt;As we discussed before, using a flat page table for every process would require a massive amount of memory, especially for systems with large address spaces. For example, in a 32-bit system with 4KB pages, a flat page table would require 4MB of memory per process (2^20 entries * 4 bytes per entry). This is impractical for systems with many processes or limited memory resources.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.2 The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)
&lt;/h3&gt;

&lt;p&gt;To address the memory overhead issue, multi-level page tables break down the virtual address into multiple parts, each part indexing into a different level of the page table. This hierarchical structure allows the operating system to allocate page table entries only for used virtual pages, significantly reducing memory usage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56dzdo8ndtxo90gtk3z2.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F56dzdo8ndtxo90gtk3z2.jpg" alt="Multi Level Page Table" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  2.3 The Hardware Walker: How the processor "walks the tree" to find physical pages
&lt;/h3&gt;

&lt;p&gt;This is the interesting part! Here, we learn how the &lt;strong&gt;CPU&lt;/strong&gt; finds the physical address corresponding to a given virtual address using the multi-level page table structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Hardware Walker&lt;/strong&gt; is a component of the &lt;strong&gt;MMU&lt;/strong&gt; that is responsible for traversing the multi-level page table to find the physical address corresponding to a given virtual address.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyj8xmdp2grzsilb0ldcz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyj8xmdp2grzsilb0ldcz.png" alt="Diagram showing the CR3 Register (or TTBR) pointing to the physical address of the Level 4 Page Table" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;CR3 Register (or TTBR):&lt;/strong&gt; This special CPU register holds the physical address of the root of the page table (Level 4). When a context switch occurs, the operating system updates this register to point to the page table of the new process.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When a process accesses a virtual address, the hardware walker performs the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extract the Indices:&lt;/strong&gt; The hardware walker extracts the indices for each level of the page table from the virtual address. For example, in a 4-level page table, it would extract indices for L4, L3, L2, and L1.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Traverse the Page Table:&lt;/strong&gt; Starting from the root of the page table (L4), the hardware walker uses the extracted indices to navigate through each level of the page table. At each level, it reads the corresponding entry to find the address of the next level's page table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Find the Physical Address:&lt;/strong&gt; Once the hardware walker reaches the final level (L1), it retrieves the physical frame number from the page table entry. It then combines this frame number with the offset from the original virtual address to compute the final physical address.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  3 The Accelerator: The TLB (Translation Look-Aside Buffer)
&lt;/h2&gt;

&lt;p&gt;To avoid the performance hit of walking page tables for every access, processors cache the computed physical addresses in a specialized cache called the TLB.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.1 Caching the Address: The TLB as a tiny, ultra-fast cache specifically for address translations
&lt;/h3&gt;

&lt;p&gt;The TLB (Translation Lookaside Buffer) is a &lt;strong&gt;small&lt;/strong&gt;, fast cache that stores recent translations of &lt;strong&gt;virtual addresses&lt;/strong&gt; to &lt;strong&gt;physical addresses&lt;/strong&gt;. It is designed to speed up the address translation process by reducing the number of memory accesses needed to translate a virtual address.&lt;/p&gt;

&lt;p&gt;When a process accesses a virtual address, the MMU first checks the TLB to see if the translation for that address is already cached. If it is, the MMU can quickly retrieve the corresponding physical address from the TLB &lt;strong&gt;(Cache Hit)&lt;/strong&gt;, avoiding the need to walk the page table, if not, it has to walk the page table &lt;strong&gt;(Cache Miss)&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  3.2 TLB Thrashing: A Practical Example
&lt;/h3&gt;

&lt;p&gt;This is where theory meets practice. If you access memory in a pattern that constantly jumps to new pages, you will cause &lt;strong&gt;TLB Thrashing&lt;/strong&gt;. The TLB is small; if you touch too many pages too quickly, you evict useful entries.&lt;/p&gt;

&lt;p&gt;Consider iterating over a large 2D array:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Fast: Row-major access (Sequential)&lt;/span&gt;
&lt;span class="c1"&gt;// We access matrix[0][0], matrix[0][1], matrix[0][2]...&lt;/span&gt;
&lt;span class="c1"&gt;// These are all on the same page. One TLB miss per page (4096 bytes).&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Slow: Column-major access (Strided)&lt;/span&gt;
&lt;span class="c1"&gt;// We access matrix[0][0], matrix[1][0], matrix[2][0]...&lt;/span&gt;
&lt;span class="c1"&gt;// Each access jumps N * sizeof(int) bytes forward.&lt;/span&gt;
&lt;span class="c1"&gt;// We likely hit a NEW page every single time. High TLB miss rate!&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;matrix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.3 The Context Switch Penalty: Why switching processes forces a TLB flush (and why it's expensive)
&lt;/h3&gt;

&lt;p&gt;We did not discuss context switching before, so let's define it first:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Switching:&lt;/strong&gt; is the process of saving the state of a currently running process and loading the state of another process to allow multiple processes to share a single CPU. This involves saving and restoring the CPU registers, program counter, and other process-specific information.&lt;/p&gt;

&lt;p&gt;When a context switch occurs, the &lt;strong&gt;TLB&lt;/strong&gt; must be flushed &lt;strong&gt;(cleared)&lt;/strong&gt; because the cached translations in the TLB are specific to the virtual address space of the currently running process. If the TLB were not flushed, the new process could potentially access incorrect physical addresses based on stale TLB entries from the &lt;strong&gt;previous&lt;/strong&gt; process, leading to data corruption or security vulnerabilities.&lt;/p&gt;

&lt;p&gt;Flushing the TLB is &lt;strong&gt;expensive&lt;/strong&gt; because it requires the &lt;strong&gt;MMU&lt;/strong&gt; to &lt;strong&gt;walk the page tables again for each memory access&lt;/strong&gt; made by the new process, resulting in increased latency and reduced performance. This is particularly problematic in systems with frequent context switches, as the overhead of flushing the TLB can significantly impact overall system performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;THE OPTIMIZATION:&lt;/strong&gt; &lt;strong&gt;Modern Processors&lt;/strong&gt; and operating systems implement various techniques to mitigate the performance impact of TLB flushes during context switches. One common approach is to use &lt;strong&gt;Address Space Identifiers (ASIDs)&lt;/strong&gt; or &lt;strong&gt;Process Context Identifiers (PCIDs)&lt;/strong&gt;, which allow the TLB to retain entries for multiple processes simultaneously. This way, when a context switch occurs, the TLB does not need to be completely flushed; instead, only entries associated with the previous process are invalidated, while entries for other processes remain valid. This significantly reduces the overhead of context switches and improves overall system performance.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on Threads vs. Processes:&lt;/strong&gt;&lt;br&gt;
It is important to note that &lt;strong&gt;Threads&lt;/strong&gt; within the same process share the same Page Table (and thus the same TLB entries). Context switching between threads is much cheaper than switching between processes because the TLB does not need to be flushed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  4 Optimization: Making the TLB Bigger (Without Hardware Changes)
&lt;/h2&gt;

&lt;p&gt;To improve TLB performance without changing the hardware, operating systems can use techniques like &lt;strong&gt;huge pages&lt;/strong&gt; to increase the effective size of TLB entries.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 The Page Size Limit: Why 4KB pages clog up the TLB
&lt;/h3&gt;

&lt;p&gt;The default page size of 4KB can lead to TLB misses when a process accesses a large amount of memory, as each TLB entry only covers a small portion of the address space. This can result in frequent TLB misses and increased latency due to page table walks.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Huge Pages (2MB/1GB): Increasing the range of a single TLB entry to reduce misses
&lt;/h3&gt;

&lt;p&gt;Huge pages are larger memory pages that can be used to reduce the number of TLB entries needed to cover a given address space. By using huge pages (e.g., 2MB or 1GB), a single TLB entry can cover a much larger portion of the address space, reducing the likelihood of TLB misses and improving performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw0ydiuiax12ry93tjyi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjw0ydiuiax12ry93tjyi.png" alt="Visual comparison of TLB Reach. Box A (4KB pages) covers small area. Box B (2MB pages) covers huge area." width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with Huge Pages:&lt;/strong&gt; While huge pages can improve TLB performance, they also come with some challenges. &lt;strong&gt;Allocating large contiguous blocks of physical memory can be difficult&lt;/strong&gt;, especially in systems with fragmented memory. Additionally, using huge pages can lead to increased memory usage, as smaller pages may be wasted if a process does not fully utilize the allocated huge page (&lt;strong&gt;internal fragmentation&lt;/strong&gt;).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Real World Use Case:&lt;/strong&gt;&lt;br&gt;
Database engines like &lt;strong&gt;PostgreSQL&lt;/strong&gt; or &lt;strong&gt;Oracle&lt;/strong&gt; often manage buffer pools (cached data) that are dozens of GBs in size. Mapping 64GB of RAM using 4KB pages would require millions of TLB entries, causing constant thrashing. Using Huge Pages makes this manageable and significantly improves database throughput.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  5 Conclusion: Respecting the Translation Layer
&lt;/h2&gt;

&lt;p&gt;Virtual memory and the associated translation mechanisms are fundamental to modern computing. Understanding how virtual addresses are translated to physical addresses, the role of the TLB, and optimization techniques like huge pages is crucial for developers aiming to write efficient software.&lt;/p&gt;

</description>
      <category>computerscience</category>
      <category>programming</category>
      <category>architecture</category>
      <category>books</category>
    </item>
    <item>
      <title>What Every Programmer Should Know About Memory Part 1</title>
      <dc:creator>Hamza Hasanain</dc:creator>
      <pubDate>Fri, 21 Nov 2025 13:18:51 +0000</pubDate>
      <link>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-1-385e</link>
      <guid>https://dev.to/hamzahassanain0/what-every-programmer-should-know-about-memory-part-1-385e</guid>
      <description>&lt;p&gt;I recently came across an interesting paper titled&lt;br&gt;
&lt;a href="https://people.freebsd.org/~lstewart/articles/cpumemory.pdf" rel="noopener noreferrer"&gt;What Every Programmer Should Know About Memory&lt;/a&gt; by Ulrich Drepper. The paper dives into the structure of memory subsystems in use on modern commodity hardware,and what programs should do to achieve optimal performance by utilizing them.&lt;/p&gt;

&lt;p&gt;What I will be doing is just summarizing what I (as a semi-intelligent being) have learned from reading the paper. I highly recommend reading the paper as the title says, what every programmer should know about memory.&lt;/p&gt;

&lt;p&gt;Needless to say, some parts of the paper where quite complex for my brain, I did my best to understand everything, but I might have missed some details. If you find any mistakes, please let me know!&lt;/p&gt;

&lt;h3&gt;
  
  
  What I Will Cover Here
&lt;/h3&gt;

&lt;p&gt;I just finished reading the first 3 sections of the paper, which cover the following topics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Basic Architecture of Modern Computers&lt;/li&gt;
&lt;li&gt;Main Memory&lt;/li&gt;
&lt;li&gt;Caches&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post will be structured around these topics, so here is my table of contents:&lt;/p&gt;

&lt;h3&gt;
  
  
  Table of Contents
&lt;/h3&gt;

&lt;p&gt;1- Introduction: The Lie of the Flat Array&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;1.1. The O(1) Myth of Pointer Access&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;1.2. The Latency Numbers (Approximate)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2- The Hardware Reality (RAM Physics)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;2.1. SRAM vs. DRAM: Why Main Memory Uses Leaky Capacitors&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;2.2. The Refresh Tax: Why Execution Stalls for Memory Maintenance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;2.3. Address Multiplexing: Sharing Pins to Save Money (and Costing Time)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;2.4. The Latency Chain: The Precharge → RAS → CAS Protocol&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;2.5. Burst Mode: Why We Never Read Just One Byte&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3- The Caching Solution (CPU Caches)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;3.1. The Hierarchy: L1 (Brain), L2 (Buffer), L3 (Bridge)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;3.2. Spatial Locality: How Cache Lines (64 Bytes) Hide Latency&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;3.3. Associativity: The Parking Lot Problem&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;3.4. Write Policies: Write-Through, Write-Back, Dirty Bits, Lazy Eviction&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;3.5. Multi-Core Complexity: MESI Protocol and False Sharing&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;4- Programmer Takeaways&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;4.1. Data Placement: Why std::vector Beats std::list&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;4.2. The Double Indirection Trap&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;4.3. The Tetris Game: Struct Packing and Alignment&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;4.4. Spatial Locality: Hot/Cold Data Splitting&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;4.5. Data Oriented Design: AoS (Array of Struct) vs. SoA (Struct of Arrays)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;4.6. Hardware Topology: NUMA &amp;amp; Context Switching&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;5- Conclusion&lt;/p&gt;




&lt;h2&gt;
  
  
  Introduction: The Lie of the Flat Array
&lt;/h2&gt;

&lt;p&gt;The Flat Memory Model (also known as the Linear Memory Model) is one of the most fundamental lies operating systems tell programmers. It is an abstraction that presents memory to your program as a single, continuous, contiguous tape of bytes, addressable from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;2^N - 1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This abstraction is crucial for programming, as it allows us to think of memory in simple terms, using pointers and offsets to access data. However, in reality, memory is far from flat. It is a complex hierarchy of storage types, each with different speeds, sizes, and costs.&lt;/p&gt;

&lt;p&gt;You might ask the OS for a &lt;code&gt;2GB&lt;/code&gt; contiguous block for a game engine or database. The Limitation: The OS might not have &lt;code&gt;2GB&lt;/code&gt; of physically contiguous RAM. It might have &lt;code&gt;2GB&lt;/code&gt; free, but scattered in &lt;code&gt;4KB&lt;/code&gt; chunks all over the physical chips.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality&lt;/strong&gt;: The OS uses &lt;strong&gt;Virtual Memory Paging&lt;/strong&gt; to map your flat virtual addresses to scattered physical addresses. The abstraction holds, but the OS has to work hard (using Page Tables and the TLB - Translation Lookaside Buffer) to maintain the illusion. If you access memory too sporadically, you thrash the TLB, causing performance degradation.&lt;/p&gt;

&lt;p&gt;This will be discussed in more detail in the next posts (where we will talk about Virtual Memory and NUMA support in more details), but for now, just remember: Memory is not flat. Access times vary wildly depending on where your data resides in the memory hierarchy.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 The O(1) Myth of Pointer Access
&lt;/h3&gt;

&lt;p&gt;When talking about pointer access, most programmers assume that dereferencing a pointer is an &lt;code&gt;O(1)&lt;/code&gt; operation - meaning it takes a constant amount of time regardless of the size of the dataset or where the data is located. However, in modern systems, that constant time can vary by a factor of &lt;code&gt;1,000,000&lt;/code&gt; depending on where the data physically lives.&lt;/p&gt;

&lt;p&gt;If your data is in the L1 Cache (closest to the CPU core), the dereference is nearly unnoticeable. If it is in Main RAM, the CPU must stall and wait. If it is swapped out to the Disk, the CPU could execute &lt;code&gt;millions&lt;/code&gt; of instructions in the time it takes to fetch that one value.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 The Latency Numbers (Approximate)
&lt;/h3&gt;

&lt;p&gt;To put this in perspective, let's look at the cost in CPU cycles:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Location of Data&lt;/th&gt;
&lt;th&gt;Approximate Latency&lt;/th&gt;
&lt;th&gt;Simple Analogy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;L1 Cache&lt;/td&gt;
&lt;td&gt;3−4 cycles&lt;/td&gt;
&lt;td&gt;Grabbing a pen from your desk.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L2 Cache&lt;/td&gt;
&lt;td&gt;10−12 cycles&lt;/td&gt;
&lt;td&gt;Picking a book off a nearby shelf.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;L3 Cache&lt;/td&gt;
&lt;td&gt;30−70 cycles&lt;/td&gt;
&lt;td&gt;Walking to a colleague's desk.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Main RAM&lt;/td&gt;
&lt;td&gt;100−300 cycles&lt;/td&gt;
&lt;td&gt;Walking to the coffee machine down the hall.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSD/NVMe&lt;/td&gt;
&lt;td&gt;10,000+ cycles&lt;/td&gt;
&lt;td&gt;Driving to the supermarket.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HDD (Page Fault)&lt;/td&gt;
&lt;td&gt;10,000,000+ cycles&lt;/td&gt;
&lt;td&gt;Flying to the moon.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The Hardware Reality (RAM Physics)
&lt;/h2&gt;

&lt;p&gt;If you accidentally came across the terms &lt;strong&gt;CPU Caches&lt;/strong&gt;, &lt;strong&gt;RAM&lt;/strong&gt;, and ever wondered what they are, why they are different, and why not just use one type of memory, this section will give you a basic understanding of how modern memory systems are structured.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CPU Caches&lt;/strong&gt; are referred to as fast memory because they are built using &lt;strong&gt;SRAM&lt;/strong&gt; (Static RAM) technology, while the &lt;strong&gt;Main RAM&lt;/strong&gt; is built using &lt;strong&gt;DRAM&lt;/strong&gt; (Dynamic RAM) technology. The two have different trade-offs in terms of speed, cost, and density.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 SRAM vs. DRAM: Why Main Memory Uses Leaky Capacitors
&lt;/h3&gt;

&lt;p&gt;Let's start with the SRAM, DRAM, and see how they are built:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SRAM&lt;/strong&gt; uses a set of transistors to store each bit of data. A typical SRAM cell uses &lt;code&gt;6 transistors&lt;/code&gt; to store a single bit, forming a flip-flop circuit that can hold its state as long as power is supplied. This design allows for very fast access times (on the order of nanoseconds) because the data can be read or written directly without any additional steps.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mexghbm6r3rgk3zvt4n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4mexghbm6r3rgk3zvt4n.png" alt="SRAM Cell Diagram" width="755" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DRAM&lt;/strong&gt;, on the other hand, uses a single transistor and a capacitor to store each bit of data. The capacitor holds an electrical charge to represent a &lt;code&gt;1&lt;/code&gt; and no charge to represent a &lt;code&gt;0&lt;/code&gt;. However, capacitors &lt;code&gt;leak charge&lt;/code&gt; over time, so the data must be periodically refreshed (every few milliseconds) to maintain its integrity. This refresh process introduces latency and complexity but allows DRAM to be much denser and cheaper than SRAM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn13nbuuaygdvugezciov.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn13nbuuaygdvugezciov.jpeg" alt="DRAM Cell Diagram" width="554" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://passive-components.eu/capacitors-insulation-resistance/" rel="noopener noreferrer"&gt;Check this article if you want to know why real-world capacitors are not perfect insulators&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 The Refresh Tax: Why Execution Stalls for Memory Maintenance
&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, The problem with DRAM is that capacitors are imperfect, they leak electrons and lose their charge over time. To prevent data loss, The solution is that the MC (Memory Controller) must read every single row of memory and write it back (recharge it) before the data fades away.&lt;/p&gt;

&lt;p&gt;This maintenance operation is called a Refresh Cycle. During a refresh cycle, the memory controller temporarily &lt;strong&gt;halts&lt;/strong&gt; normal memory operations to perform the refresh. This can lead to delays in servicing memory requests from the CPU, causing stalls in execution.&lt;/p&gt;

&lt;p&gt;Keep in mind that, it does not &lt;strong&gt;halt&lt;/strong&gt; the entire RAM chip at once, instead it refreshes rows sequentially. However, the refresh operation occupies the &lt;strong&gt;Bank's&lt;/strong&gt; sense amplifiers. Therefore, if the CPU requests data from ANY row within that specific Bank (not just the row being refreshed), it must wait until the Bank becomes available again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a Bank?&lt;/strong&gt; A Bank is a subdivision of the DRAM chip that can be accessed independently. Each Bank has its own &lt;strong&gt;sense amplifiers&lt;/strong&gt; and can be refreshed or accessed separately from other Banks. This allows for some level of parallelism and reduces the impact of refresh cycles on overall memory access latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is sense amplifiers?&lt;/strong&gt; Sense amplifiers are specialized circuits within the DRAM that detect and amplify the small voltage changes on the bit lines during read operations. They are crucial for accurately reading the data stored in the capacitors of the DRAM cells.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaivxp2keoo12dy890q4.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftaivxp2keoo12dy890q4.webp" alt="Actual DRAM" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 Address Multiplexing: Sharing Pins to Save Money (and Costing Time)
&lt;/h3&gt;

&lt;p&gt;Before we dive into the latency chain, why it exists, and how it works, we need to understand why DRAM chips use Address Multiplexing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Address Multiplexing&lt;/strong&gt; is a technique used in DRAM to reduce the number of pins required on the chip package. Instead of having separate pins for each bit of the address, the address is sent in two parts: the Row Address and the Column Address. This allows the same set of pins to be reused for both parts of the address, effectively halving the number of pins needed (while simultaneously increasing the time it takes to access data).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why is this important?&lt;/strong&gt; The number of pins on a chip package directly impacts its cost and complexity. By using address multiplexing, manufacturers can produce DRAM chips that are more affordable and easier to integrate into systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 The Latency Chain: The Precharge → RAS → CAS Protocol
&lt;/h3&gt;

&lt;p&gt;To understand why memory latency exists, you have to stop thinking of RAM as a magic bucket and start thinking of it as a physical Matrix (a grid of rows and columns).&lt;/p&gt;

&lt;p&gt;To read a single byte of data, the Memory Controller cannot just say Give me index 400. It has to manipulate the physical grid using a strict three-step protocol. This sequence is determined by the physical construction of the DRAM chip and the need to minimize the number of pins on the chip package.&lt;/p&gt;

&lt;p&gt;Imagine a massive warehouse &lt;strong&gt;(the DRAM Bank)&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Cells&lt;/strong&gt;: The data lives in millions of tiny boxes (capacitors) arranged in Rows and Columns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Row Buffer (Sense Amps)&lt;/strong&gt;: There is a loading dock (Row Buffer).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Rule:&lt;/strong&gt; You cannot read a box while it is on the shelf. You must first move the &lt;strong&gt;entire&lt;/strong&gt; row of boxes to the loading dock.&lt;/p&gt;

&lt;p&gt;As we said before, when the CPU asks for a memory address, the controller breaks it down into a Row Address and a Column Address.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Precharge&lt;/strong&gt;: If there is already a row loaded in the Row Buffer, it must be precharged (written back to the shelf) before loading a new row. This step ensures that the Row Buffer is ready for the next operation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: RAS (Row Address Strobe)&lt;/strong&gt;: The controller sends the Row Address to the DRAM chip, which activates the corresponding row and loads it into the Row Buffer (sense amplifiers). This step is crucial because it prepares the data for access.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: CAS (Column Address Strobe)&lt;/strong&gt;: Finally, the controller sends the Column Address to select the specific byte within the loaded row. The data is then read from the Row Buffer and sent back to the CPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.5 Burst Mode: Why We Never Read Just One Byte
&lt;/h3&gt;

&lt;p&gt;We have established that accessing a single byte from DRAM involves a multi-step process that requires a significant amount of taxes (Latency). If we paid that tax every time we wanted a single byte (8 bits), our computers would be extraordinarily slow. So engineers came up with a clever solution called &lt;strong&gt;Burst Mode&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Burst Mode&lt;/strong&gt; allows the memory controller to read or write multiple consecutive bytes in a single operation after the initial access. Instead of fetching just one byte, the controller can fetch a block of data (typically &lt;code&gt;64&lt;/code&gt; bytes) in one go, that means (for the simplicity) if you have an array &lt;code&gt;arr[0..63]&lt;/code&gt;, of &lt;code&gt;32-bit&lt;/code&gt; integers, &lt;code&gt;(4-bytes)&lt;/code&gt; and you request &lt;code&gt;arr[0]&lt;/code&gt;, the memory controller will fetch &lt;code&gt;arr[0]&lt;/code&gt; to &lt;code&gt;arr[15]&lt;/code&gt; in one operation, because they are all located in the same row and can be accessed sequentially.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why 64 bytes?&lt;/strong&gt; See section 3.2 for details&lt;/p&gt;

&lt;h2&gt;
  
  
  The Caching Solution (CPU Caches)
&lt;/h2&gt;

&lt;p&gt;Time for a quick history lesson.&lt;/p&gt;

&lt;p&gt;In the early days of computers (1940s-1970s), CPUs and RAM were about the same speed. The CPU could ask for data and get it right away, so there was no waiting. Life was simple, and there was no need for a cache because the CPU wasn't sitting around with nothing to do.&lt;/p&gt;

&lt;p&gt;But in the 1980s, things changed. CPUs started getting much, much faster every year, while RAM speed only improved a little. This created a huge speed difference, known as the &lt;strong&gt;Memory Wall.&lt;/strong&gt; The super-fast CPU now had to spend most of its time waiting for the slower RAM to deliver data, like a sports car stuck in a traffic jam.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9a6dvd5njsjz2l8uht7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn9a6dvd5njsjz2l8uht7.png" alt="The Memory Wall" width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To solve this problem, engineers invented the cache. A cache is a small, extremely fast memory that lives right next to the CPU. It started with big, expensive computers in the 60s. By 1989, the Intel 486 brought a small L1 cache to personal computers. As the speed gap grew, we added a bigger, slightly slower L2 cache, and later an even bigger L3 cache for multiple CPU cores to share. The idea is to keep the most frequently used data in the fastest memory, so the CPU can keep working instead of waiting.&lt;/p&gt;

&lt;p&gt;The next few sections will explain how this cache system works.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 The Hierarchy: L1 (Brain), L2 (Buffer), L3 (Bridge)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2u3ng0u15s171ampb7y.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2u3ng0u15s171ampb7y.webp" alt="CPU Cache Hierarchy" width="602" height="314"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First, if you take a look at the image, you notice, there are 2 parts of the L1 cache: D-Cache (Data Cache) and I-Cache (Instruction Cache), but you notice that L2 and L3 caches are unified (store both instructions and data). This is because modern CPUs use a technique called &lt;strong&gt;Harvard Architecture&lt;/strong&gt; for the L1 cache, which separates instructions and data to allow simultaneous access. This improves performance by allowing the CPU to fetch instructions and data in parallel.&lt;/p&gt;

&lt;p&gt;The cache hierarchy is designed to balance speed, size, and cost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;L1 Cache&lt;/strong&gt;: This is the smallest and fastest cache, located directly on the CPU core. It typically ranges from &lt;code&gt;16KB&lt;/code&gt; to &lt;code&gt;64KB&lt;/code&gt; in size and has the lowest latency (around &lt;code&gt;3-4 cycles&lt;/code&gt;). The L1 cache is split into two parts: one for instructions (I-Cache) and one for data (D-Cache). Its primary role is to provide the CPU with the most frequently accessed data and instructions as quickly as possible.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;L2 Cache&lt;/strong&gt;: This cache is larger than L1, typically ranging from &lt;code&gt;256KB&lt;/code&gt; to &lt;code&gt;1MB&lt;/code&gt;, and is still located on the CPU core. It has slightly higher latency (around &lt;code&gt;10-12 cycles&lt;/code&gt;) but can store more data. The L2 cache acts as a buffer between the fast L1 cache and the slower L3 cache or main memory, holding data that is not as frequently accessed as that in L1 but still needs to be retrieved quickly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;L3 Cache&lt;/strong&gt;: This is the largest and slowest cache in the hierarchy, often ranging from &lt;code&gt;MBs&lt;/code&gt; to &lt;code&gt;GBs&lt;/code&gt;. It is usually shared among multiple CPU cores and has higher latency (around &lt;code&gt;30-70 cycles&lt;/code&gt;). The L3 cache serves as a bridge between the CPU cores and the main memory, storing data that is less frequently accessed but still benefits from being cached. It only exists due to the terrifying fact that main memory is so slow compared to the CPU.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Spatial Locality: How Cache Lines (64 Bytes) Hide Latency
&lt;/h3&gt;

&lt;p&gt;As we discussed in the section Burst Mode: Why We Never Read Just One Byte, when the CPU requests data from memory, it doesn't just fetch a single byte. Instead, it fetches a block of data known as a &lt;strong&gt;cache line&lt;/strong&gt;. In modern systems, a cache line is typically &lt;code&gt;64 bytes&lt;/code&gt; in size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why 64 bytes?&lt;/strong&gt; This size is chosen because it aligns well with the &lt;strong&gt;cache line&lt;/strong&gt; size of modern CPUs. By fetching data in blocks that match the cache line size, the system can take advantage of spatial locality, reducing the number of memory accesses required for sequential data access patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache Lines&lt;/strong&gt;: A cache line is the smallest unit of data that can be transferred between the CPU cache and main memory. Modern CPUs typically use a cache line size of &lt;code&gt;64 bytes&lt;/code&gt;. When the CPU requests data from memory, it fetches an entire cache line, even if only a small portion of that data is needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;spatial locality&lt;/strong&gt;: Spatial locality refers to the tendency of programs to access data locations that are close to each other within a short time frame. When a program accesses a particular memory address, it is likely to access nearby addresses soon after. By fetching data in blocks (cache lines), the system can take advantage of this behavior, reducing the number of memory accesses and improving overall performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Associativity: The Parking Lot Problem
&lt;/h3&gt;

&lt;p&gt;We are talking about caches, and they are fast and all that, but they are also small, so you need to have a strategy to decide where to put data when it comes into the cache, and where to find it when you need it again.&lt;/p&gt;

&lt;p&gt;Let's discuss 3 strategies for organizing data in the cache, known as &lt;strong&gt;cache associativity&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Direct-Mapped Cache&lt;/strong&gt;: In a direct-mapped cache, each block of main memory maps to exactly one location in the cache. This is like having a parking lot where each car has a designated parking spot. If two cars (memory blocks) want to park in the same spot, one has to leave (be evicted). This method is simple and fast but can lead to many conflicts if multiple frequently accessed memory blocks map to the same cache line.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fully Associative Cache&lt;/strong&gt;: In a fully associative cache, any block of main memory can be stored in any location in the cache. This is like having a parking lot where cars can park anywhere. This method minimizes conflicts but requires more complex hardware to search the entire cache for a block, which can slow down access times.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Set-Associative Cache&lt;/strong&gt;: This is a compromise between direct-mapped and fully associative caches. The cache is divided into several sets, and each block of main memory maps to a specific set but can be stored in any location within that set. This is like having a parking lot divided into sections, where cars can park anywhere within their designated section. This method balances the speed of direct-mapped caches with the flexibility of fully associative caches.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.4 Write Policies: Write-Through, Write-Back, Dirty Bits, Lazy Eviction
&lt;/h3&gt;

&lt;p&gt;We have established that reading from RAM is slow. Writing to RAM is just as slow.&lt;/p&gt;

&lt;p&gt;If your program executes a loop that increments a counter &lt;code&gt;i++&lt;/code&gt; one million times, and every single increment forces a write to physical RAM, your CPU will spend &lt;code&gt;99.9%&lt;/code&gt; of its time waiting for the memory bus.&lt;/p&gt;

&lt;p&gt;To solve this, hardware engineers created two main policies for handling writes: Write-Through (safe but slow) and Write-Back (complex but fast).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write-Through&lt;/strong&gt;: In a write-through cache, every time the CPU writes data to the cache, it also immediately writes that data to the main memory. This ensures that the main memory always has the most up-to-date data, which is important for data integrity. However, this approach can be slow because every write operation incurs the latency of writing to main memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write-Back&lt;/strong&gt;: In a write-back cache, when the CPU writes data to the cache, it does not immediately write that data to main memory. Instead, it marks the cache line as &lt;strong&gt;dirty,&lt;/strong&gt; indicating that it has been modified. The data is only written back to main memory when the cache line is &lt;strong&gt;evicted (replaced)&lt;/strong&gt; or when certain conditions are met &lt;strong&gt;(like a flush operation)&lt;/strong&gt;. This approach improves performance by reducing the number of write operations to main memory, but it introduces complexity in managing dirty cache lines and ensuring data consistency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Dirty Bit&lt;/strong&gt;: The dirty bit is a flag associated with each cache line that indicates whether the data in that cache line has been modified (written to) since it was loaded from main memory. If the &lt;strong&gt;dirty&lt;/strong&gt; bit is set, it means the cache line contains data that is different from what is in main memory, and it must be written back to main memory before being evicted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Eviction Policy&lt;/strong&gt;: When the cache is full and a new block of data needs to be loaded, the cache must evict (remove) an existing block to make room. In a &lt;strong&gt;write-back&lt;/strong&gt; cache, if the evicted block is marked as dirty, the cache must first write the modified data back to main memory before loading the new block. This process is known as &lt;strong&gt;lazy eviction&lt;/strong&gt; because the write-back to main memory is deferred until eviction, rather than occurring immediately on every write.&lt;/p&gt;

&lt;p&gt;As guessed, the write-back policy is generally preferred in modern CPUs due to its performance advantages, despite the added complexity of managing dirty cache lines and ensuring data consistency. And This introduces the problem of cache coherence in multi-core systems, which we will discuss next.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.5 Multi-Core Complexity: MESI Protocol and False Sharing
&lt;/h3&gt;

&lt;p&gt;The problem with &lt;strong&gt;Lazy Eviction&lt;/strong&gt; (Write-Back) in a multi-core system is coherence. If &lt;code&gt;Core 1&lt;/code&gt; has a Dirty version of a variable, and &lt;code&gt;Core 2&lt;/code&gt; tries to read it from RAM, Core 2 will read garbage.&lt;/p&gt;

&lt;p&gt;To solve this, hardware engineers implemented a &lt;strong&gt;Social Contract&lt;/strong&gt; between cores. They don't just talk to RAM; they talk to each other. The most common standard for this negotiation is the &lt;strong&gt;MESI&lt;/strong&gt; Protocol.&lt;/p&gt;

&lt;p&gt;However, this strict protocol has a nasty side effect called False Sharing.&lt;/p&gt;

&lt;p&gt;Under &lt;strong&gt;MESI&lt;/strong&gt;, every Cache Line (that &lt;code&gt;64-byte&lt;/code&gt; chunk) has a &lt;code&gt;2-bit&lt;/code&gt; state tag attached to it. These bits tell the core what rights it has over that data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Modified (M)&lt;/strong&gt;: The cache line is dirty (modified) and is the only valid copy. Other caches do not have this data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exclusive (E)&lt;/strong&gt;: The cache line is clean (not modified) and is the only valid copy. Other caches do not have this data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Shared (S)&lt;/strong&gt;: The cache line is clean and may be present in other caches. Multiple caches can read this data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invalid (I)&lt;/strong&gt;: The cache line is not valid. It may have been modified by another core or is not present in this cache.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we know the modes, what they mean, let's now see how the &lt;code&gt;2&lt;/code&gt; Cores interact when accessing shared data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Snooping Bus&lt;/strong&gt;: All cores are connected to a shared communication channel called the snooping bus. Whenever a core wants to read or write data, it broadcasts its intention on this bus. Other cores listen (snoop) to these broadcasts and respond accordingly to maintain coherence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh4.googleusercontent.com%2Fproxy%2Foaf-VXCoS4pD2XmDrflqeUmeStrn3Vm-HxtyCLplAXXi-eqY-LOPTEpLoGEwsIha5gVZvt-yQ0cruiv_aF4Emt5lX0_49R93RT6BugCwjf9QeuBzTsCeC9gY67DjGCCtUg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Flh4.googleusercontent.com%2Fproxy%2Foaf-VXCoS4pD2XmDrflqeUmeStrn3Vm-HxtyCLplAXXi-eqY-LOPTEpLoGEwsIha5gVZvt-yQ0cruiv_aF4Emt5lX0_49R93RT6BugCwjf9QeuBzTsCeC9gY67DjGCCtUg" alt="The Snooping Bus" width="512" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Simple example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Core 1&lt;/code&gt; wants to write to a variable &lt;code&gt;X&lt;/code&gt;. It checks its cache and finds that &lt;code&gt;X&lt;/code&gt; is in the &lt;code&gt;Shared (S)&lt;/code&gt; state. To modify it, &lt;code&gt;Core 1&lt;/code&gt; must first broadcast an &lt;strong&gt;Invalidate&lt;/strong&gt; message on the snooping bus, telling all other cores to mark their copies of &lt;code&gt;X&lt;/code&gt; as &lt;code&gt;Invalid (I)&lt;/code&gt;. Once all other cores acknowledge the invalidation, &lt;code&gt;Core 1&lt;/code&gt; can change the state of &lt;code&gt;X&lt;/code&gt; to &lt;code&gt;Modified (M)&lt;/code&gt; and proceed with the write.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note in the example, we Say it checks its cache and finds that &lt;code&gt;X&lt;/code&gt; ...... This is not totaly correct, it dones not only find &lt;code&gt;X&lt;/code&gt;, it finds the entire cache line that contains &lt;code&gt;X&lt;/code&gt;. This is where &lt;strong&gt;False Sharing&lt;/strong&gt; comes into play, that is a performace desaster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;False Sharing&lt;/strong&gt; occurs when two or more cores modify different variables that happen to reside in the same cache line. Even though the variables are independent, the MESI protocol forces the cores to invalidate each other's cache lines because they share the same cache line. This is known widely as &lt;strong&gt;Cache Line Ping-Pong.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlfuyx6yf492lhdxi7lf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdlfuyx6yf492lhdxi7lf.png" alt="Cash Line Ping-Pong" width="557" height="603"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Programmer Takeaways
&lt;/h2&gt;

&lt;p&gt;Here is where the real fun begins. Now that we understand how memory works under the hood, let's discuss some practical takeaways for programmers to optimize their code for better performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 Data Placement: Why std::vector Beats std::list
&lt;/h3&gt;

&lt;p&gt;This is the &lt;strong&gt;Hello World&lt;/strong&gt; of memory optimization. It teaches the fundamental rule: Linked Lists are cache poison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; A linked list scatters nodes across the heap (0x1000, 0x8004, 0x200). The CPU cannot predict the next address, breaking the Hardware Prefetcher. You pay the full RAM latency tax for every node.&lt;/p&gt;

&lt;p&gt;In contrast, &lt;code&gt;std::vector&lt;/code&gt; stores elements contiguously in memory (0x1000, 0x1004, 0x1008). Accessing one element brings the next few into the cache line, leveraging spatial locality and prefetching. This drastically reduces cache misses and improves performance.&lt;/p&gt;

&lt;p&gt;Needless to say, prefer &lt;code&gt;std::vector&lt;/code&gt; over &lt;code&gt;std::list&lt;/code&gt; for performance-critical code unless you have a specific reason to use a linked list (like frequent insertions/deletions in the middle of the list).&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Using std::list
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="nf"&gt;sum_list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;list&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;l&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Using std::vector
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="nf"&gt;sum_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="kt"&gt;long&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.2 The Double Indirection Trap: &lt;code&gt;std::vector&amp;lt;std::vector&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Developers often use &lt;code&gt;std::vector&amp;lt;std::vector&amp;lt;int&amp;gt;&amp;gt;&lt;/code&gt; for grids. This is a pointer to an array of pointers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; To access &lt;code&gt;grid[i][j]&lt;/code&gt;, the CPU must fetch grid &lt;code&gt;-&amp;gt;&lt;/code&gt; fetch pointer at &lt;code&gt;grid[i]&lt;/code&gt; (cache miss 1) &lt;code&gt;-&amp;gt;&lt;/code&gt; fetch data at &lt;code&gt;[j]&lt;/code&gt; (cache miss 2). Rows are not contiguous in physical memory.&lt;/p&gt;

&lt;p&gt;To solve this, we use a clever trick: flatten the 2D structure into a 1D vector.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Using &lt;code&gt;std::vector&amp;lt;std::vector&amp;lt;T&amp;gt;&amp;gt;&lt;/code&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;
&lt;span class="cm"&gt;/*
[Row 0 Data ...] -&amp;gt; 0x1000
[Row 1 Data ...] -&amp;gt; 0x8004
[Row 2 Data ...] -&amp;gt; 0x2000
[Row 3 Data ...] -&amp;gt; 0x4008

grid = [0x1000, 0x8004, 0x2000, 0x4008]
*/&lt;/span&gt;


&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// Double indirection, two cache misses&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Flattening the 2D Structure
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;pair&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;to_2d&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// [Row 1 Data... | Row 2 Data... | Row 3 Data...] (Contiguous)&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;)];&lt;/span&gt; &lt;span class="c1"&gt;// Single access, better cache locality&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.3 The Tetris Game: Struct Packing and Alignment
&lt;/h3&gt;

&lt;p&gt;The compiler aligns data to memory boundaries. If you order your variables poorly, you create holes (padding) in your cache lines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does the compiler add padding?&lt;/strong&gt; To ensure that data types are aligned to their natural boundaries (e.g., &lt;code&gt;4-byte&lt;/code&gt; integers on &lt;code&gt;4-byte&lt;/code&gt; boundaries). Misaligned accesses can be slower or even cause hardware exceptions on some architectures.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Poorly Ordered Struct
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Bad&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
    &lt;span class="c1"&gt;// 7 bytes padding&lt;/span&gt;
    &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// 8 bytes&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
    &lt;span class="c1"&gt;// 4 bytes padding&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="c1"&gt;// Size: 24 bytes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Well-Ordered Struct
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Good&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;double&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;    &lt;span class="c1"&gt;// 8 bytes&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// 4 bytes&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// 1 byte&lt;/span&gt;
    &lt;span class="c1"&gt;// 3 bytes padding&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="c1"&gt;// Size: 16 bytes (no padding between members)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.4 Spatial Locality: Hot/Cold Data Splitting
&lt;/h3&gt;

&lt;p&gt;Objects often contain data we check frequently (ID, Health) and data we rarely check (Name, Biography).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; If a struct is &lt;code&gt;200 bytes&lt;/code&gt; (mostly text strings), only &lt;code&gt;3&lt;/code&gt; structs fit in a cache line. Iterating over them fills the cache with Cold text data you aren't reading, flushing out useful data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What to do&lt;/strong&gt;: Move rare data to a separate pointer or array.&lt;/p&gt;

&lt;h4&gt;
  
  
  Bad Code Example: Mixed Hot/Cold Data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;User&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;              &lt;span class="c1"&gt;// HOT&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;         &lt;span class="c1"&gt;// HOT&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;  &lt;span class="c1"&gt;// COLD (Pollutes cache)&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;bio&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;       &lt;span class="c1"&gt;// COLD (Pollutes cache)&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Good Code Example: Split Hot/Cold Data
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;UserHot&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;balance&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;UserCold&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;coldData&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// Pointer to cold data&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;UserCold&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="kt"&gt;char&lt;/span&gt; &lt;span class="n"&gt;bio&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.5 Data Oriented Design: AoS (Array of Struct) vs. SoA (Struct of Arrays)
&lt;/h3&gt;

&lt;p&gt;Imagine you are building a game with thousands of entities, each with position and color.&lt;/p&gt;

&lt;p&gt;How do you store them?&lt;/p&gt;

&lt;h4&gt;
  
  
  4.5.1 Array of Structs (AoS)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Entity&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// Position&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;      &lt;span class="c1"&gt;// Color&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with such design is that when you want to update positions, you load entire cache lines with color data you don't need.&lt;/p&gt;

&lt;p&gt;So here comes the alternative:&lt;/p&gt;

&lt;h4&gt;
  
  
  4.5.2 Struct of Arrays (SoA)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;Entities&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;z&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Essentially, you separate data by usage patterns. When updating positions, you only load position arrays into the cache, maximizing cache utilization and minimizing cache misses.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.6 Hardware Topology: NUMA &amp;amp; Context Switching
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;This will be covered in more details in the next post.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Understanding how memory works at a low level is crucial for writing high-performance software. By leveraging knowledge about caches, memory hierarchies, and data locality, programmers can make informed decisions that lead to significant performance improvements.&lt;/p&gt;

&lt;p&gt;In this post, we covered the basics of modern memory systems, including the differences between SRAM and DRAM, the structure of CPU caches, and practical programming techniques to optimize memory access patterns. In the upcoming parts of this series, we will dive deeper into virtual memory, NUMA architectures, and advanced optimization strategies.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>architecture</category>
      <category>computerscience</category>
      <category>books</category>
    </item>
    <item>
      <title>I built a Backend web Framework from Scratch in C++</title>
      <dc:creator>Hamza Hasanain</dc:creator>
      <pubDate>Sat, 30 Aug 2025 21:10:32 +0000</pubDate>
      <link>https://dev.to/hamzahassanain0/i-built-a-backend-web-framework-from-scratch-in-c-41n8</link>
      <guid>https://dev.to/hamzahassanain0/i-built-a-backend-web-framework-from-scratch-in-c-41n8</guid>
      <description>&lt;h4&gt;
  
  
  wouldn’t go as far as calling it a framework — it’s more of a library
&lt;/h4&gt;

&lt;p&gt;I’ve been exploring some backend web frameworks lately and kept asking myself: &lt;em&gt;what do these things actually do under the hood?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To find out, I decided to dive into C++ and experiment. After some tinkering, I built a small homegrown backend web library, split into three layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Socket Library&lt;/strong&gt; – Handles raw communication between processes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP Server&lt;/strong&gt; – Parses HTTP requests, manages headers and bodies, and handles TCP streams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web Library&lt;/strong&gt; – Provides a simple framework for routing, controllers, and serving static files, similar to Express.js.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each layer is built on top of the one beneath it, so understanding the foundation is crucial.&lt;/p&gt;

&lt;p&gt;Before we dive into the layers, you can check out the GitHub repos:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Socket Library: &lt;a href="https://github.com/HamzaHassanain/hamza-socket-lib" rel="noopener noreferrer"&gt;github.com/HamzaHassanain/hamza-socket-lib&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;HTTP Server: &lt;a href="https://github.com/HamzaHassanain/hamza-http-server-lib" rel="noopener noreferrer"&gt;github.com/HamzaHassanain/hamza-http-server-lib&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Web Library: &lt;a href="https://github.com/HamzaHassanain/hamza-backend-web-library-cpp" rel="noopener noreferrer"&gt;github.com/HamzaHassanain/hamza-backend-web-library-cpp&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Example Blog App: &lt;a href="https://github.com/HamzaHassanain/simple-blog-from-scratch" rel="noopener noreferrer"&gt;github.com/HamzaHassanain/simple-blog-from-scratch&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Please note that this is just my understanding of how things work, how I implemented the stuff&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Also note that this project isn’t fully production-ready, but it’s an excellent exercise in &lt;strong&gt;understanding backend frameworks from the ground up&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding Sockets: How Processes Talk
&lt;/h2&gt;

&lt;p&gt;At the core of networking on Unix-like systems are &lt;strong&gt;file descriptors (FDs)&lt;/strong&gt; — small integers a process uses to refer to kernel-managed resources (files, pipes, or sockets). When you call something like &lt;code&gt;fflush(stdout)&lt;/code&gt; you’re asking your program’s runtime to push buffered bytes down to the FD that represents &lt;code&gt;stdout&lt;/code&gt;; what happens to those bytes next depends on what that FD is connected to (a terminal, a file, or a socket).&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;socket&lt;/strong&gt; is one of those kernel-managed resources: it’s a kernel object that your process creates with &lt;code&gt;socket(...)&lt;/code&gt; and then uses to send and receive network data. You can think of a socket as an endpoint inside your program; the socket itself is represented by an FD in your process. To tell the kernel where packets should go (or where they came from), a socket is usually &lt;strong&gt;bound&lt;/strong&gt; to a network &lt;em&gt;address&lt;/em&gt;, which is commonly expressed as three parts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Address family&lt;/strong&gt; — how to interpret addresses (IPv4 or IPv6, e.g. &lt;code&gt;AF_INET&lt;/code&gt; or &lt;code&gt;AF_INET6&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IP address&lt;/strong&gt; — which host/machine on the network you mean (e.g. &lt;code&gt;127.0.0.1&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port&lt;/strong&gt; — which particular service or process on that host should receive the traffic (e.g. &lt;code&gt;80&lt;/code&gt;, &lt;code&gt;8080&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: Only ports 0–1023 are reserved for well-known services like HTTP (80) or SSH (22). Ports above that are available for general use.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Socket types
&lt;/h3&gt;

&lt;p&gt;Two socket types are most relevant when writing networked servers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1) Datagram sockets (UDP)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UDP sockets are &lt;em&gt;connectionless&lt;/em&gt;: you can call &lt;code&gt;sendto()&lt;/code&gt; with any destination address (IP+port) and the kernel will attempt to deliver that single datagram.&lt;/li&gt;
&lt;li&gt;Each &lt;code&gt;recvfrom()&lt;/code&gt; or &lt;code&gt;recvmsg()&lt;/code&gt; call returns exactly one datagram (so message boundaries are preserved).&lt;/li&gt;
&lt;li&gt;There is no handshake, and the network does not guarantee delivery, ordering, or uniqueness — datagrams can be lost, duplicated, or arrive out of order.&lt;/li&gt;
&lt;li&gt;It’s common to bind a UDP socket to a port and serve many different remote peers on that single FD; the kernel provides the sender’s address on each receive so you can reply.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2) Stream sockets (TCP)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TCP sockets are &lt;em&gt;connection-oriented&lt;/em&gt;: the client and server perform a 3-way handshake to establish a connection.&lt;/li&gt;
&lt;li&gt;After the handshake the kernel exposes a reliable, ordered &lt;strong&gt;byte stream&lt;/strong&gt; to your process. TCP ensures bytes are delivered and in order (barring extreme failures), but it does not preserve packet/message boundaries; if you send two &lt;code&gt;write()&lt;/code&gt; calls on the sender, the receiver may receive them merged or split across &lt;code&gt;read()&lt;/code&gt; calls.&lt;/li&gt;
&lt;li&gt;For servers you &lt;code&gt;bind()&lt;/code&gt; and &lt;code&gt;listen()&lt;/code&gt; on a port. &lt;code&gt;accept()&lt;/code&gt; returns a brand-new FD representing the established connection; the listening FD continues to accept more connections. Each client connection has its own kernel socket object and FD.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Notes on scale &amp;amp; semantics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For UDP you can call &lt;code&gt;connect()&lt;/code&gt; on the socket to set a default peer (useful to avoid passing addr to every &lt;code&gt;sendto()&lt;/code&gt;), but &lt;code&gt;connect()&lt;/code&gt; on a UDP socket only sets the default destination — the underlying semantics remain datagram and connectionless.&lt;/li&gt;
&lt;li&gt;For TCP, &lt;code&gt;accept()&lt;/code&gt; and the new FD are what you use to &lt;code&gt;read()&lt;/code&gt;/&lt;code&gt;write()&lt;/code&gt; that client's data; the listening socket never carries per-client data.&lt;/li&gt;
&lt;li&gt;Remember: “ordered bytes” (TCP) ≠ “preserved messages” — if you need discrete messages on top of TCP, implement framing (length prefix, delimiters, etc.).&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Creating a Simple UDP Socket
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Creating a simple UDP socket with address in C-style (On Unix)&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;sockfd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AF_INET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SOCK_DGRAM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sockfd&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Handle error&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;sockaddr_in&lt;/span&gt; &lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;memset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin_family&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AF_INET&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin_port&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;htons&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;inet_pton&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AF_INET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"127.0.0.1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sin_addr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sockfd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;sockaddr&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Handle error&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Use the socket...&lt;/span&gt;

&lt;span class="c1"&gt;// Cleanup&lt;/span&gt;
&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sockfd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Creating a Simple UDP Socket (using my socket library)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Creating a simple UDP socket with address Using My library (same logic as above but wrapped in my library)&lt;/span&gt;

&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;hh_socket&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;socket_address&lt;/span&gt; &lt;span class="nf"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"127.0.0.1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IPV4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;socket&lt;/span&gt; &lt;span class="nf"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;UDP&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Cleanup is handled by destructor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Creating a Simple TCP Server (using my socket library)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Creating a simple TCP server, that echoes back messages&lt;/span&gt;

&lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;hh_socket&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;socket_address&lt;/span&gt; &lt;span class="nf"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;port&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="n"&gt;ip_address&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"0.0.0.0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;family&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IPV4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;socket&lt;/span&gt; &lt;span class="nf"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Protocol&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;TCP&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// blocking call&lt;/span&gt;

    &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// blocking call&lt;/span&gt;
    &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="n"&gt;echo_message&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Echo: "&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;echo_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;echo_message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// echo back&lt;/span&gt;
    &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;close&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Cleanup is handled by destructor&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Handling Blocking Operations
&lt;/h3&gt;

&lt;p&gt;Blocking means an I/O call (like &lt;code&gt;read()&lt;/code&gt; or &lt;code&gt;accept()&lt;/code&gt;) makes your program wait until the operation completes.&lt;/p&gt;

&lt;p&gt;When a socket call blocks, the current thread simply sits idle until the OS has data to return or the requested action completes. For servers that handle many clients, blocking on a single thread quickly becomes a bottleneck. To handle many connections efficiently, you can use:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Multithreading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idea: handle each connection in its own thread or use a pool of worker threads.&lt;/li&gt;
&lt;li&gt;Pros: simple mental model — each handler can use blocking calls; easy to write.&lt;/li&gt;
&lt;li&gt;Cons: high memory/context-switch cost for many connections; synchronization complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. I/O multiplexing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idea: a single thread (or a few threads) waits on many file descriptors and reacts when any become ready. Tools: &lt;code&gt;epoll&lt;/code&gt; (Linux), &lt;code&gt;IOCP&lt;/code&gt; (Windows), &lt;code&gt;kqueue&lt;/code&gt; (macOS). There is also &lt;code&gt;select&lt;/code&gt; (Windows/Unix) it is less efficient for large numbers of connections.&lt;/li&gt;
&lt;li&gt;Pros: low thread overhead; great for many concurrent connections.&lt;/li&gt;
&lt;li&gt;Cons: more complex control flow; must handle partial reads/writes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Async I/O&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Idea: submit read/write requests to the kernel and receive completion events later (no thread is blocked waiting). &lt;code&gt;io_uring&lt;/code&gt; on modern Linux is a powerful example.&lt;/li&gt;
&lt;li&gt;Pros: excellent throughput and low latency; scales well.&lt;/li&gt;
&lt;li&gt;Cons: API is more advanced; portability issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For my project, I used &lt;strong&gt;I/O multiplexing&lt;/strong&gt;, allowing a single-threaded event loop to handle hundreds of connections efficiently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// a simple epoll server (using my epoll_server class) for handling multiple sockets.&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;chat_server&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;public&lt;/span&gt; &lt;span class="n"&gt;epoll_server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;unordered_map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nl"&gt;protected:&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;on_connection_opened&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;send_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_buffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Enter username: "&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;on_message_received&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_fd&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// First message is username&lt;/span&gt;
            &lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_fd&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_fd&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;" joined the chat"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Regular chat message&lt;/span&gt;
            &lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_fd&lt;/span&gt;&lt;span class="p"&gt;()]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;": "&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;on_connection_closed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;override&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_fd&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;second&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;" left the chat"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="n"&gt;usernames&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;erase&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;it&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;private&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;
    &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;conn_state&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;conns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;send_message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Building the HTTP Server
&lt;/h2&gt;

&lt;p&gt;TCP streams are just sequences of bytes. An HTTP request might be &lt;strong&gt;fragmented across multiple TCP packets&lt;/strong&gt;, so the server must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reassemble the byte stream.&lt;/li&gt;
&lt;li&gt;Extract request headers.&lt;/li&gt;
&lt;li&gt;Parse the body (if present).&lt;/li&gt;
&lt;li&gt;Handle limits (max body size, header size).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The HTTP server integrates tightly with the &lt;strong&gt;socket library&lt;/strong&gt;, that is, it extends the &lt;strong&gt;hh_socket::epoll_server&lt;/strong&gt; functionality by reusing its efficient connection handling and abstraction. This shows how layering simplifies complexity: the HTTP server focuses on protocol logic, while sockets manage the low-level networking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note that my implementation is not fully compliant with the HTTP specification, it just provides a basic framework for handling HTTP requests.&lt;/strong&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  High-level (use the project's parser):
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: receive a buffer from a connection and let the project's parser assemble&lt;/span&gt;
&lt;span class="c1"&gt;// requests that may span multiple TCP reads.&lt;/span&gt;
&lt;span class="n"&gt;hh_socket&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;data_buffer&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="n"&gt;hh_http&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;http_message_handler&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// `handle` returns an http_handled_data describing either a complete request&lt;/span&gt;
&lt;span class="c1"&gt;// or that more bytes are required (completed == false).&lt;/span&gt;
&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;handle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Not enough data yet - wait for the next read and call parser.handle again&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"BAD_"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Parser returned an error token (e.g. BAD_METHOD_OR_URI_OR_VERSION)&lt;/span&gt;
    &lt;span class="c1"&gt;// Application can craft an error response here&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Complete request: use result.method, result.uri, result.headers, result.body&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Low-level (manual reassembly sketch):
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Read bytes into a string buffer until we detect the header terminator \r\n\r\n&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;headers_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\r\n\r\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers_end&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;npos&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;header_block&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers_end&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;rest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headers_end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// body (maybe partial)&lt;/span&gt;

    &lt;span class="c1"&gt;// parse request-line and headers (split on \r\n, then on ':')&lt;/span&gt;
    &lt;span class="c1"&gt;// find Content-Length (if present) to determine expected body size&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;content_length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// parse from header_block if present&lt;/span&gt;

    &lt;span class="c1"&gt;// Keep reading until we have the full body&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;content_length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;rest&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;receive&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;to_string&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;substr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content_length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Now header_block contains headers and `body` contains the full payload&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  HTTP Server example (using my http_server class)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"http_server.hpp"&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="k"&gt;namespace&lt;/span&gt; &lt;span class="n"&gt;hh_http&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// this automatically sets up the server to listen for incoming connections&lt;/span&gt;
    &lt;span class="c1"&gt;// also this handles big request body, and also, you can normally send a big response,&lt;/span&gt;
    &lt;span class="c1"&gt;// as the epoll server itself handles sending such big chunks of data&lt;/span&gt;
    &lt;span class="n"&gt;http_server&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8081&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0.0.0.0"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Set up all the callbacks&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_request_callback&lt;/span&gt;&lt;span class="p"&gt;([](&lt;/span&gt; &lt;span class="n"&gt;http_request&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;http_response&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Handle incoming HTTP requests&lt;/span&gt;
        &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hello, World!&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"OK"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;send&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// send the headers, and the body to the client&lt;/span&gt;
        &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;set_error_callback&lt;/span&gt;&lt;span class="p"&gt;([](&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cerr&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s"&gt;"Error: "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;what&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;endl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Start the server (this will block)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Web Library Layer
&lt;/h2&gt;

&lt;p&gt;The top layer provides &lt;strong&gt;routing, controllers, and static file serving&lt;/strong&gt;, similar to Express.js. Key features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MVC-like architecture&lt;/strong&gt; – Organizes code into controllers, views, and models for better maintainability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing system&lt;/strong&gt; – Maps incoming HTTP requests to controller actions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Static file serving&lt;/strong&gt; – Delivers HTML, CSS, and JavaScript assets alongside dynamic content.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Simple Example server:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;"web-lib.hpp"&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="c1"&gt;// Define request handlers (controller actions)&lt;/span&gt;
&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="nf"&gt;get_users_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_request&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_response&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: [&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;]}"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;EXIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="nf"&gt;create_user_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_request&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;shared_ptr&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_response&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// Extract user data from request body&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_body&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: true, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;User created&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;EXIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_shared&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_server&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8080&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"0.0.0.0"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;make_shared&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_router&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Map routes to controller actions&lt;/span&gt;
    &lt;span class="k"&gt;using&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;web_request_handler_t&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/users"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;get_users_handler&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/users"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;create_user_handler&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// It auto-detects the static files (based on the extention, then sends it)&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;use_static&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"static"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Route with path parameters&lt;/span&gt;
    &lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/api/users/:id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt;&lt;span class="p"&gt;{[](&lt;/span&gt;&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;get_path_params&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;at&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;send_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s"&gt;}"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hh_web&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;EXIT&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}});&lt;/span&gt;

    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;use_router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;router&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;server&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="n"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why This Structure Matters
&lt;/h2&gt;

&lt;p&gt;Here’s what this layered design teaches:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sockets&lt;/strong&gt; handle raw communication and events efficiently.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HTTP server&lt;/strong&gt; interprets protocol-level data reliably from TCP streams.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Web library&lt;/strong&gt; allows developers to structure their application cleanly and add features without worrying about low-level details.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Even after a short time learning backend programming, this project clarified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Networking can be simplified to &lt;strong&gt;basic operations&lt;/strong&gt; and abstractions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TCP streams need careful parsing&lt;/strong&gt;, not just reading packets.&lt;/li&gt;
&lt;li&gt;Layering responsibilities makes large systems manageable and testable.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>cpp</category>
      <category>devchallenge</category>
    </item>
  </channel>
</rss>
