DEV Community: Hamza Hasanain

The Lie of Time Management: Why Your Energy is the Real Bottleneck

Hamza Hasanain — Tue, 12 May 2026 03:53:56 +0000

I recently tried a very foolish experiment. I decided to overwhelm myself with a massive workload to force myself to learn better time management. Between juggling university coursework, optimizing cloud infrastructure for Repovive, and grinding through AWS architecture study sessions, I mapped out every hour of my day.

I failed miserably.

But from that failure, I discovered a crucial truth: Time management is an overrated concept. The problem has never been a lack of time. The problem is entirely your energy as a human being.

Your energy is your daily currency. You spend it throughout the day on decisions, complex logic, and effort until your balance is depleted. Everything around us, especially in the tech industry, seems designed to drain this balance without us even noticing.

Here is the reality of what is actually stealing your energy, why we avoid hard tasks, and how to stop the leak.

The Procrastination Myth

We often think we procrastinate because we are out of time or lack discipline. But psychological research paints a different picture. According to Dr. Tim Pychyl, author of Solving the Procrastination Puzzle, procrastination is not a time management issue; it is an emotional regulation problem.

As developers, we procrastinate to avoid the immediate negative emotions—like anxiety, frustration, or the intimidation of a blank IDE—associated with a complex task. When your energy is already low, you simply do not have the internal currency left to fight those negative feelings, so your brain seeks a quick dopamine hit elsewhere.

Silent Energy Drains for Developers

1. Context Switching and The "Quick Check"

Those five minutes you spend checking your phone, replying to a Slack ping, or glancing at a Jira board don't just waste time; they burn your mental currency.

When you are deep in a complex problem—like holding the architecture of a distributed submission pipeline in your head or debugging a custom C++ allocator—your brain builds a massive, fragile mental model. Research by Gloria Mark at UC Irvine shows that it takes an average of 23 minutes and 15 seconds to return to your original state of flow after an interruption. Tearing down and rebuilding that mental model multiple times a day is exhausting.

2. The Morning Coffee Trap

Drinking coffee the second you wake up is actively setting you up for a mid-day crash.

According to neurobiologist Dr. Andrew Huberman of Stanford University, caffeine doesn't give you energy; it blocks adenosine, the brain chemical that makes you feel sleepy. If you drink coffee immediately, the adenosine continues to build up in the background. Once the caffeine metabolizes, you are hit with a massive backlog of adenosine all at once, causing your brain to shut down in the middle of the afternoon.

Delay your coffee by 90 to 120 minutes after waking up. Let your body's natural cortisol spike clear out the morning adenosine first.

3. Mental Load and Analysis Paralysis

Giving tasks more mental weight than they deserve drains your reserves fast. Staring at the screen, agonizing over the absolute most optimal database schema or over-engineering a simple microservice, burns more energy than actually writing the code. Overthinking is a massive energy leak. Often, writing a naive implementation and refactoring it later is far cheaper on your internal battery.

How to Protect Your Currency

If energy is the real bottleneck, how do we protect it?

Audit Your Drains: Stop tracking your time and start tracking your energy. Notice when you feel sharpest and when you feel completely brain-dead. Protect your high-energy windows fiercely—schedule your hardest architecture or problem-solving work for those specific hours.
Strategic Ignorance: Turn off notifications when doing deep work. Batch your communication. Check your emails and messages only at specific, low-energy points in the day rather than leaving them open on a second monitor.
Lower the Activation Energy: To fight the emotional block of starting an intimidating ticket, break it down into absurdly small steps. Don't set out to "build the feature." Set out to "create the file" or "write the function signature." Once the initial friction is overcome, momentum takes over.

Time is just a container. What matters is the energy you bring to the time you have. Stop trying to manage the clock, and start managing your battery.

AI's Gateway Drug to Engineering

Hamza Hasanain — Wed, 29 Apr 2026 10:02:34 +0000

Right now, there's a rumor floating around: "Just spend enough time with Claude, get its vibe, and BAM! Become a software engineer."
Not gonna happen. Claude will not make you a software engineer. At best, you'll be a script kiddie with the most advanced compiler known to humanity.
Why? Why is the idea of making vibe coding into software engineering is just a myth?

1) Writing code ≠ software engineering.

Claude is an execution engine. Perfect for fast scripts or even React components. Writing code is the easiest part of software engineering, which includes parsing data with Python, being different from memory management in C++, configuring secure VPCs, designing compiler passes, and many more. What does it boil down to? System design.

2) Dreaming about context engineering.

You may have heard people talk about tough prompts teaching you "context engineering." Nope. You can't prompt for the unknown unknowns. If you ask the AI to write you a backend, it will happily create tens of megabytes of code inside one database document. With the experience, you know how bad of an idea it is, because you've encountered document size limitations and network fees. LLMs go for the path of least resistance; lacking the engineering context to guide it, you get a poorly designed system.

3) The multi-million dollar security trap.

Think you can just vibe-code an authentication flow? AI is notorious for generating code that compiles flawlessly but is riddled with invisible vulnerabilities—missing rate limits, hardcoded secrets, and string-concatenated SQL ripe for injection. When you rely on AI to build your architecture, you aren't just risking a buggy app; you are opening the door to catastrophic user data leakage. And when that data spills, regulators won't care that Claude wrote your backend. You will be staring down GDPR fines and legal nightmares easily worth millions of dollars or euros. Security architecture isn't something you can prompt-engineer after the breach happens.

4) "Learning as you go"

Learning as you go is negligence, and the most dangerous take is that hitting the ceiling with your AI prompts will allow you to learn all the necessary things, like CI/CD pipelines, infrastructure, hardware sensitivity, and optimization. Hitting the ceiling doesn't mean spending a quiet weekend learning about DevOps and such. No, it will be hit when you lose data due to a production outage and receive a four-hundred-dollar invoice from Vercel after using the AI to build a computationally-intensive backend on Vercel.

Engineers know this ceiling is coming. And that's what allows them to avoid it, design the system in such a way that it will never happen.
Do not confuse smart autocompletion with good engineering practices.

What Every Programmer Should Know About Memory Part 4

Hamza Hasanain — Wed, 07 Jan 2026 13:45:40 +0000

What Programmers Can Do: Writing Hardware-Sympathetic Code

In the previous article, we learned that memory geography matters. Now, we arrive at the finale—the most actionable part of Ulrich Drepper's paper: Section 6.

This is not about choosing a better algorithm (O(n) vs O(log n)). This is about writing code that respects how the hardware physically works. We will cover Cache Bypassing, TLB Optimization, Concurrency Pitfalls, and Code Layout.

Subsection A: Cache Optimization
- 1.1. Data Placement: std::vector vs std::list
- 1.2. The Double Indirection Trap
- 1.3. Bypassing the Cache (Non-Temporal Stores)
- 1.4. Access Patterns & Blocking (Tiling)
Subsection B: The Virtual Memory (TLB)
- 2.1. The High Cost of Translation
- 2.2. The Solution: Huge Pages
Subsection C: Data & Code Layout
- 3.1. The Tetris Game: Struct Packing
- 3.2. Hot/Cold Data Splitting
- 3.3. Struct of Arrays (SoA) vs Array of Structs (AoS)
- 3.4. Alignment Matters
- 3.5. Instruction Cache & Branch Prediction
Subsection D: Concurrency & NUMA
- 4.1. The Silent Killer: False Sharing
- 4.2. Thread Affinity (Pinning)
Subsection E: Prefetching
- 5.1. Helping the Hardware
Conclusion

Subsection A: Cache Optimization

The most significant performance cliff in modern computing is missing the L1 Cache. Accessing L1 takes ~4 cycles. Accessing RAM takes ~200+ cycles. Your goal is to keep data in L1 as long as possible (Temporal Locality) and use every byte you load (Spatial Locality).

1.1 Data Placement: std::vector Beats std::list

This is the Hello World of memory optimization. It teaches the fundamental rule: Linked Lists are cache poison.

Why: A linked list scatters nodes across the heap (0x1000, 0x8004, 0x200). The CPU cannot predict the next address, breaking the Hardware Prefetcher. You pay the full RAM latency tax for every node.

In contrast, std::vector stores elements contiguously in memory (0x1000, 0x1004, 0x1008). Accessing one element brings the next few into the cache line, leveraging spatial locality and prefetching. This drastically reduces cache misses and improves performance.

Bad Code Example: Using std::list

long long sum_list(const std::list<int>& l) {
    long long sum = 0;
    for (int val : l) sum += val;
    return sum;
}

Good Code Example: Using std::vector

long long sum_vector(const std::vector<int>& v) {
    long long sum = 0;
    for (int val : v) sum += val;
    return sum;
}

1.2 The Double Indirection Trap: `std::vector<std::vector<T>>`

Developers often use std::vector<std::vector<int>> for grids. This is a pointer to an array of pointers.

Why: To access grid[i][j], the CPU must fetch grid -> fetch pointer at grid[i] (cache miss 1) -> fetch data at [j] (cache miss 2). Rows are not contiguous in physical memory.

To solve this, we use a clever trick: flatten the 2D structure into a 1D vector.

Bad Code Example: Using `std::vector<std::vector<T>>`

std::vector<std::vector<int>> grid(rows, std::vector<int>(cols));
int value = grid[i][j]; // Double indirection, two cache misses

Good Code Example: Flattening the 2D Structure

inline int idx(int i, int j, int cols) {
    return i * cols + j;
}

// [Row 1 Data... | Row 2 Data... | Row 3 Data...] (Contiguous)
std::vector<int> grid(rows * cols);
int value = grid[idx(i, j, cols)]; // Single access, better cache locality

1.3 Bypassing the Cache (Non-Temporal Stores)

The Hidden Cost of Writing:
Normally, when you write to memory (e.g., data[i] = 0), the CPU must ensure cache coherency. Since it writes to a 64-byte cache line, it must first Read-For-Ownership (RFO). It fetches the existing 64 bytes from RAM into L1, modifies the 4 bytes you changed, and marks the line as "Modified".

The Problem (Cache Pollution):
If you are initializing a massive array (e.g., memset of 1GB), the CPU will:

Read 1GB of old data from RAM (wasting bandwidth).
Fill almost the entire L1/L2/L3 cache with this zeroed data.
Evict your application's hot data (code, stack, other variables) to make room.

This is called Cache Pollution, and it destroys performance for code running immediately after the write.

The Solution: Non-Temporal Stores (Streaming Stores)
You can instruct the CPU to use a Write-Combining Buffer (WCB) instead of the cache. You tell the CPU: "I promise I will overwrite this entire line. Don't read it. Do not pollute the cache with it. Just write it to RAM."

Code Example (Intel Intrinsics):

#include <immintrin.h>

void stream_memset(int* data, int size, int value) {
    // 1. Create a 128-bit vector filled with 'value' (4 integers)
    __m128i v = _mm_set1_epi32(value);

    // Note: ensure 'size' is a multiple of 4 integers (16 bytes)
    for (int i = 0; i < size; i += 4) {
        // 2. The Streaming Store (The Magic)
        // Writes to 16-byte aligned memory, bypassing L1/L2.
        // It tells the CPU to NOT fetch the old data (No Read-For-Ownership).
        _mm_stream_si128((__m128i*)&data[i], v);
    }

    // 3. The Fence
    // Streaming stores are "weakly ordered". This instruction
    // Forces all Write-Combining Buffers to flush to RAM immediately.
    _mm_sfence();
}

The Constraint: Memory Alignment

The specific intrinsic _mm_stream_si128 physically requires the memory address to be 16-byte aligned (divisible by 16).

If you access address 0x1000, it works (Ends in 0).
If you access address 0x1004, it Crashes (Segfault).

Using standard new or malloc does not guarantee this alignment. You must use specific allocators:

1. Modern C++ (C++17):

#include <cstdlib>
// std::aligned_alloc(alignment, size)
int* data = (int*)std::aligned_alloc(16, 1000 * sizeof(int));
std::free(data);

2. The "Intel" Way (Intrinsics):

#include <immintrin.h> 
int* data = (int*)_mm_malloc(1000 * sizeof(int), 16);
_mm_free(data); // Must use _mm_free matching _mm_malloc

3. The POSIX Way (Linux/Unix):

#include <cstdlib>
void* ptr;
if (posix_memalign(&ptr, 16, 1000 * sizeof(int)) == 0) {
    int* data = (int*)ptr;
    free(data);
}

For AVX/AVX2, use _mm256_stream_si256 which requires 32-byte alignment.

1.4 Access Patterns & Blocking (Tiling)

Hardware prefetchers are good at linear access (Row-Major), but they fail when access patterns are strided (Column-Major) or random.

Row-Major vs Column-Major:

// Fast: Row-major access (Sequential)
// All on the same page/cache line.
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];
    }
}

// Slow: Column-major access (Strided)
// High Cache miss rate & TLB miss rate!
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];
    }
}

The Fix: Blocking (Loop Tiling)
Divide the problem into small sub-problems that fit entirely inside the L1 Cache.

Choosing the Block Size (B):
For a square block of B x B elements, you want the working set (3 * B^2 * sizeof(element)) to fit in L1.

B ≈ sqrt( L1_Size / (3 * Element_Size) )

Example: L1 = 32KB, float = 4B -> B ≈ sqrt(32768 / 12) ≈ 52. Choose B=48 or B=32 for alignment.

The Algorithm:

Load a small B x B block of A and B into L1.
Compute all possible results for that block.
Only move to the next block when finished.

This maximizes Temporal Locality (reuse). The data goes into L1 and stays there.

Subsection B: The Virtual Memory (TLB)

This is a critical section often ignored by developers. Every time your code touches a virtual address, the CPU must translate it to a physical address using the TLB (Translation Lookaside Buffer).

2.1 The High Cost of Translation

The TLB is a tiny cache for logical-to-physical address translations. It typically has distinct levels (L1/L2) with entry counts in the dozens to hundreds (e.g., 64 L1 entries, 512 L2 entries).
Standard memory pages are 4KB. If you access 2GB of memory sequentially, you need 524,288 page table entries. Your TLB will thrash constantly.

2.2 The Solution: Huge Pages

Modern CPUs support Huge Pages (e.g., 2MB or 1GB).
Using 2MB pages for that same 2GB array reduces entries to just 1,024. The entire mapping can now fit in the L2 TLB.

Enabling Huge Pages (Linux):

# Allocate 512 hugepages of 2MB each (Total 1GB)
sysctl -w vm.nr_hugepages=512
# Verify
grep Huge /proc/meminfo

Code Example (Using mmap):

#include <sys/mman.h>

// Request a 2MB Huge Page explicitly
void* huge_data = mmap(NULL, 2 * 1024 * 1024, 
                       PROT_READ | PROT_WRITE, 
                       MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, 
                       -1, 0);

if (huge_data == MAP_FAILED) {
    // Fallback (or check if user has privileges/OS support)
}

Note: Linux also supports **Transparent Huge Pages (THP), which tries to use huge pages automatically. However, explicit mmap or madvise gives you deterministic control.

Subsection C: Data & Code Layout

3.1 The Tetris Game: Struct Packing

The compiler aligns data to memory boundaries. If you order your variables poorly, you create holes (padding) in your cache lines.

Why does the compiler add padding? To ensure that data types are aligned to their natural boundaries (e.g., 4-byte integers on 4-byte boundaries).

Bad Code Example: Poorly Ordered Struct

struct Bad {
    char a;      // 1 byte
    // 7 bytes padding
    double c;    // 8 bytes
    int b;       // 4 bytes
    // 4 bytes padding
};
// Size: 24 bytes

Good Code Example: Well-Ordered Struct

struct Good {
    double c;    // 8 bytes
    int b;       // 4 bytes
    char a;      // 1 byte
    // 3 bytes padding
};
// Size: 16 bytes (no padding between members)

3.2 Hot/Cold Data Splitting

Objects often contain data we check frequently (ID, Health) and data we rarely check (Name, Biography).

Why: If a struct is 200 bytes (mostly text strings), only 3 structs fit in a cache line. Iterating over them fills the cache with Cold text data you aren't reading, flushing out useful data.

What to do: Move rare data to a separate pointer or array.

Bad Code Example: Mixed Hot/Cold Data

struct User {
    int id;              // HOT
    int balance;         // HOT
    char username[128];  // COLD (Pollutes cache)
};

Good Code Example: Split Hot/Cold Data

struct UserHot {
    int id;
    int balance;
    UserCold* coldData; // Pointer to cold data
};
struct UserCold {
    char username[128];
};

3.3 Struct of Arrays (SoA) vs Array of Structs (AoS)

This is a classic battle in Game Development and Data-Oriented Design.

Array of Structs (AoS) - The OOP Way:

struct Point {
    int x, y, z;
};
Point points[1000];

This is good if you always access x, y, and z together. But often, you loop over just x to do a physics calculation.
The cost: Every time you load points[i].x, you also load y and z into the cache line, wasting 66% of your bandwidth.

Struct of Arrays (SoA) - The Data-Oriented Way:

struct Points {
    int x[1000];
    int y[1000];
    int z[1000];
};

Now, x values are packed contiguously. One cache line load brings in 16 x values at once. This is also perfect for SIMD (Single Instruction Multiple Data) auto-vectorization.

3.4 Alignment Matters

CPUs love boundaries. Ideally, your data structures should start at addresses divisible by 64 (cache line size).
C++ Solution:

struct alignas(64) AlignedData {
    int critical_value;
    // ...
};

3.5 Instruction Cache & Branch Prediction

It's not just data that gets cached—instructions do too (L1i Cache). If your code jumps around unpredictably, the CPU pipeline stalls.

Branch Hints:
Modern CPUs have powerful dynamic branch predictors that often figure out patterns better than you can. However, for static branches (like error checking), you can give the compiler a hint to move cold code away from hot code.

#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)

void process_transaction(Transaction* t) {
    // "Cold" path: Compiler moves this assembly block to the end of the function
    if (unlikely(t == nullptr)) {
        handle_error(); 
        return;
    }

    // "Hot" path: Continues immediately in memory, keeping L1i efficient
    do_math(t);
}

Subsection D: Concurrency & NUMA

4.1 The Silent Killer: False Sharing

This is the most insidious performance bug in multithreading.
Two threads on different cores modify variables that happen to sit on the same 64-byte cache line. The cache coherence protocol (MESI) forces the line to bounce back and forth ("ping-ponging"), executing slowly.

The Fix (Padding):
Align critical shared data to 64 bytes to ensure it lives on its own island.

struct PaddedCounter {
    alignas(64) std::atomic<int> value;
    // Padding is implicit due to alignas, but explicit padding 
    // can also be used: char pad[60];
};

PaddedCounter counters[NUM_THREADS]; // Each counter is now on a separate line

Result: often a 10x-50x speedup in contended write workloads.

4.2 Thread Affinity (Pinning)

In a NUMA system, memory is local to a specific CPU socket. If the OS scheduler moves your thread to a different socket, it must access memory remotely (high latency).

The Solution: Pin the thread to a specific core (or socket).

#include <pthread.h>

void pin_thread_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

Tooling: Use numactl to bind processes: numactl --physcpubind=0-3 --membind=0 ./myapp.

Subsection E: Prefetching

5.1 Helping the Hardware

Hardware prefetchers are great at standard patterns (i++), but they struggle with pointer lookups (p = p->next).

Software Prefetching:
You can issue a non-blocking instruction to fetch a line into L1 before you need it. Use __builtin_prefetch(addr, rw, locality).

while (node) {
    // locality: 3 = heavy reuse (L1), 0 = no reuse (streaming)
    // rw: 0 = read, 1 = write
    __builtin_prefetch(node->next, 0, 3);

    do_heavy_work(node->value); 
    // By the time work is done, node->next is hopefully in L1.

    node = node->next;
}

Warning: Tuning this is hard. Prefetch too early, and you evict useful data. Prefetch too late, and it hasn't arrived. Measure everything.

Tools for Performance Engineers

Don't guess—measure.

perf (Linux): The gold standard.
- perf stat -e cycles,cache-misses,instructions ./app: Check IPC and miss rates.
- perf record -g ./app & perf report: Find exactly where cache misses happen.
valgrind (Cachegrind): valgrind --tool=cachegrind ./app. Slow, but gives deterministic cache simulation.
lscpu / hwloc: View your topology (L1 sizes, NUMA nodes).

Quick Cheat Sheet

Mechanic	Do ...	Don't ...
Containers	Prefer `std::vector` (Contiguous).	Use `std::list` (Linked Lists are cache poison).
Indirection	Flatten 2D arrays to 1D vectors.	Use `vector<vector<T>>` (Double Indirection).
Struct Packing	Order members: Largest to Smallest.	Order randomly (creates padding/holes).
Hot/Cold Data	Split rare fields into separate structs.	Pollute cache lines with unused data strings.
Data Layout	Use Struct of Arrays (SoA) for bulk processing.	Use Array of Structs (AoS) for everything.
Alignment	Align structs/arrays to 64B.	Use unaligned addresses for SIMD/Streaming.
Concurrency	Pad atomic counters to 64B.	Let threads fight over the same cache line.
Huge Pages	Use 2MB pages for >100MB arrays.	Rely on 4KB pages for massive working sets.

Conclusion

I hope this overview of Drepper's work helps you write code that the hardware loves. Happy Coding!

What Every Programmer Should Know About Memory Part 3

Hamza Hasanain — Fri, 02 Jan 2026 08:35:00 +0000

Geography Matters: NUMA Support

In the previous article What Every Programmer Should Know About Memory Part 2, we talked about Virtual Memory and how it translates the lies of the OS into physical reality. We covered page tables, the TLB, and how the hardware walks the tree to find your data.

In this article, we continue from where we left off and cover section 5 from the paper What Every Programmer Should Know About Memory by Ulrich Drepper.

Up until now, we've mostly pretended that all RAM is created equal. We assumed that if you have 16GB of RAM, accessing byte 0 is just as fast as accessing byte 15,999,999,999. In the old days of SMP (Symmetric Multi-Processing), this was true. All CPUs connected to a single memory controller via a single bus.

But as core counts exploded, that single bus became a bottleneck. The solution was to split the memory up and give each CPU its own local memory. This created NUMA (Non-Uniform Memory Access).

From SMP to NUMA: Why equality is dead
The Hardware Topology: Nodes and Interconnects
- 2.1. Local vs. Remote Memory
- 2.2. The Interconnect Penalty
OS Policies: The "First Touch" Trap
- 3.1. How Linux Allocates Memory
- 3.2. The Trap: Main Thread Initialization
- 3.3. The "Spillover" Behavior (Zone Reclaim)
Tools of the Trade
- 4.1. Analyzing with lscpu
- 4.2. The Distance Matrix (numactl)
- 4.3. Controlling Policy with numactl
- 4.4. Programming with libnuma
Conclusion

1. UMA vs. NUMA: The Death of Equality

To understand why modern servers behave the way they do, we need to look at the evolution of memory architectures.

1.1 UMA (Uniform Memory Access)

The Old Way: In the days of SMP (Symmetric Multi-Processing), we had a single memory controller and a single system bus. All CPUs connected to this bus.

What: "Uniform" means the cost to access RAM is the same for every core. Accessing address 0x0 takes 100ns for Core 0 and 100ns for Core 1.
Why it failed: The shared bus became a bottleneck. As we added more cores (2, 4, 8...), they all fought for the same bandwidth. It was like having 64 cars trying to use a single lane highway.

1.2 NUMA (Non-Uniform Memory Access)

The New Way: To solve the bottleneck, hardware architects split the memory up.

What: Instead of one giant bank of RAM, we attach a dedicated chunk of RAM to each processor socket. Each Processor + its Local RAM is called a NUMA Node.
How: The nodes are connected by a high-speed interconnect (like Intel UPI or AMD Infinity Fabric). If CPU 0 needs data from CPU 1's memory, it asks CPU 1 to fetch it and ship it over the wire.

This architecture solves the bandwidth problem (multiple highways!) but introduces a new problem: Physics.

2. The Cost of Remote Access

Now that memory is physically distributed, distance matters.

If a CPU on Node 0 needs data located in Node 0's RAM, the path is short and fast.
If a CPU on Node 0 needs data located in Node 1's RAM, the request must travel over the interconnect to Node 1, wait for Node 1's memory controller to fetch it, and ship it back.

2.1 The Latency Penalty

We often measure this cost as a "latency factor."

Local Access: 1.0 (Baseline)
Remote Access: 1.5x - 2.0x Slower

That means every cache miss that hits remote memory is twice as expensive as a local miss. In high-performance computing (HPC) or low-latency trading, this is a disaster.

2.2 Bandwidth Saturation: The Clogged Pipe

It's not just about speed; it's about capacity. The interconnect between sockets has a limited bandwidth.

If you write a program where all threads on all 64 cores are aggressively reading from Node 0's memory, you create a traffic jam. The local cores on Node 0 might get their data fine, but the remote cores on other nodes will see massive stalls as they fight for space on the interconnect.

3. OS Policies: The "First Touch" Trap

So how does the OS decide where to put your memory? If you malloc(1GB), does it go to Node 0 or Node 1?

Linux uses a policy called First-Touch Allocation.

3.1 How Linux Allocates Memory

When you call malloc(1GB), the kernel doesn't actually give you physical RAM. It gives you a promise (Virtual Memory).
The physical RAM is allocated only when you write to that page for the first time. This is called a Page Fault.

At that exact moment, the kernel looks at which CPU triggered the page fault. It says, "Ah, you are running on CPU 5, which belongs to Node 0. I will allocate this physical page from Node 0's RAM to make it fast for you."

This is normally good, but it leads to a deadly trap.

3.2 The Trap: Main Thread Initialization

This policy leads to one of the most common performance bugs in high-performance applications.

The Scenario:

You start your program. The Main Thread (running on Node 0) allocates a huge array and initializes it to zero (memset).
Because the Main Thread touched all the pages, the OS dutifully allocates 100% of the RAM on Node 0.
You spawn 64 worker threads (spread across Node 0, 1, 2, 3) to process the data in parallel.

The Result:

Threads on Node 0 are happy (Local access).
Threads on Node 1, 2, 3 are miserable. They are all being forced to fetch data remotely from Node 0.
The interconnect to Node 0 becomes saturated.
Performance scales poorly, and you wonder why adding more cores made it slower.

The Fix:
Parallel Initialization. Don't let the main thread memset everything. Have your worker threads initialize the specific chunks of data they will be working on. This ensures the physical memory pages are allocated on the local nodes where the workers live.

3.3 The "Spillover" Behavior (Zone Reclaim)

What happens if Node 0 is full? By default, if a thread on Node 0 requests memory and Node 0 is full, Linux will attempt to allocate from Node 1 rather than crashing.

This creates unpredictable latency spikes. Your application runs fast for the first 30 minutes, fills up Node 0, and suddenly slows down by 50% because new allocations are silently spilling over to Node 1. Monitoring numa_miss stats in /sys/devices/system/node/ is the only way to catch this.

4. Tools of the Trade

How do you know if you are running on a NUMA machine?

4.1 Analyzing with `lscpu`

Open your terminal and type lscpu. It reveals the truth about your hardware.

$ lscpu
...
NUMA node(s):          2
NUMA node0 CPU(s):     0-31
NUMA node1 CPU(s):     32-63

NUMA node(s): 2 -> You have 2 distinct memory banks.
NUMA node0 CPU(s): 0-31 -> If you run a thread on Core 5, its local memory is Node 0. If it accesses Node 1, it pays the penalty.

4.2 The Distance Matrix (`numactl`)

To see exactly how "remote" a node is, use numactl --hardware. The "node distances" table at the bottom is key:

node distances:
node   0   1
  0:  10  21
  1:  21  10

10: Represents local access (the baseline cost).
21: Represents the cost to cross the interconnect.

If you saw a value like 30 or 40, that would imply an even longer path (like jumping over two sockets in a 4-socket server).

4.3 Controlling Policy with `numactl`

You can override the default OS behavior using numactl.

Interleaving:
If you have a read-only lookup table that every thread accesses randomly, "First Touch" is bad (it unfairly burdens one node). Instead, you can force the OS to spread the pages round-robin across all nodes.

# Interleave memory allocation across all nodes
numactl --interleave=all ./my_application

Binding:
You can also strict-bind a process to a specific node, ensuring it never inadvertently runs on a remote core or allocates remote memory.

# Run only on Node 0's CPUs, allocate only from Node 0's RAM
numactl --cpunodebind=0 --membind=0 ./my_application

4.4 Programming with `libnuma`

Sometimes you can't control how the user runs your binary. You can enforce memory policy directly in C++ using libnuma:

#include <numa.h>

// Allocate 10MB specifically on Node 0
void* data = numa_alloc_onnode(10 * 1024 * 1024, 0);

// Or run this thread only on Node 0
numa_run_on_node(0);

Note: This requires linking with -lnuma.

5. Conclusion

Ignoring NUMA is ignoring the laws of physics in your server. As programmers, we can't change the hardware, but we can change how we behave on it.

By respecting concepts like First-Touch, understanding the Interconnect Penalty, and pinning our threads appropriately, we can stop fighting the hardware and start working with it.

In the next and final part, we will cover Section 6: What Programmers Can Do. This will be a massive deep dive into cache blocking, data layout (SoA vs AoS), and the infamous False Sharing effect.

What Every Programmer Should Know About Memory Part 2

Hamza Hasanain — Tue, 25 Nov 2025 09:13:52 +0000

Why does your pointer not point where you think it does?.

In the previous article What Every Programmer Should Know About Memory (Part 1), we covered sections 2 and 3 from the article: What Every Programmer Should Know About Memory by Ulrich Drepper. In this article, we will continue from where we left off and cover section 4 (yes, section 4 only).

The previous article explored memory hierarchies from the ground up — how DRAM hardware works, why CPU caches exist, and practical optimization techniques like cache-line awareness and data structure layout. We examined the physical reality behind the "flat array" abstraction and learned why memory access patterns matter for performance.

In this article, we continue with section 4 of Ulrich Drepper's paper, diving deep into Virtual Memory — the translation layer that gives every process its own address space while sharing physical RAM.

Prerequisites: The Basics

0.1. Paging?
0.2. More Concepts

The Illusion of Ownership: Virtual vs. Physical

1.1. The Sandbox: How the MMU makes every process believe it owns the entire RAM
1.2. The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data

The Page Table Walk: A Tree Structure

2.1. Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process
2.2. The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)
2.3. The Hardware Walker: How the processor "walks the tree" to find physical pages

The Accelerator: The TLB (Translation Look-Aside Buffer)

3.1. Caching the Address: The TLB as a tiny, ultra-fast cache specifically for address translations
3.2. TLB Thrashing: A Practical Example
3.3. The Context Switch Penalty: Why switching processes forces a TLB flush (and why it's expensive)

Optimization: Making the TLB Bigger (Without Hardware Changes)

4.1. The Page Size Limit: Why 4KB pages clog up the TLB
4.2. Huge Pages (2MB/1GB): Increasing the range of a single TLB entry to reduce misses

Conclusion: Respecting the Translation Layer

0 Prerequisites: The Basics

Before diving into the details of Virtual Memory, let's define a few key concepts that will help you understand the rest of the article.

0.1 Paging?

A page is a fixed-length contiguous block of virtual memory. In most systems, the default page size is 4KB (4096 bytes), although larger page sizes (like 2MB or 1GB) can also be used for specific applications.

Paging: is a memory management scheme that eliminates the need for contiguous allocation of physical memory. Instead, it divides virtual memory into pages and maps them to physical memory frames, allowing for more efficient use of RAM and enabling features like virtual memory.

Physical Frame: is a fixed-length block of physical memory that corresponds to a page in virtual memory. The operating system maintains a mapping between virtual pages and physical frames, allowing processes to access memory without needing to know the actual physical location of their data.

Frames VS Pages: In the context of paging, a "page" refers to a block of virtual memory, while a "frame" refers to a block of physical memory. The operating system maps virtual pages to physical frames, allowing processes to access memory without needing to know the actual physical location of their data.

0.2 More Concepts

There is much more to virtualization, but for the sake of saving time, we will see only one-line definitions of some important concepts that will help you understand the rest of the article.

Address Space: The range of memory addresses that a process can use. Each process has its own virtual address space, which is mapped to physical memory by the operating system.

Memory Management Unit (MMU): A hardware component that handles the translation of virtual addresses to physical addresses. It works in conjunction with the operating system to manage memory access and enforce protection.

Page Table: A data structure used by the operating system to keep track of the mapping between virtual pages and physical frames. Each process has its own page table.

TLB (Translation Lookaside Buffer): A small, fast cache that stores recent translations of virtual addresses to physical addresses. It helps speed up the address translation process by reducing the number of memory accesses needed.

1 The Illusion of Ownership: Virtual vs. Physical

As you probably understand by now (reading the prerequisite section of course, Paging?), virtual memory creates an illusion for each process that it has its own dedicated physical memory. In reality, the operating system manages the physical memory and allocates it to processes as needed.

Now, we will explore in a bit more detail how this illusion is created and maintained.

1.1 The Sandbox: How the MMU makes every process believe it owns the entire RAM

We know what an MMU is and what it does (see More Concepts), but how does it handle this translation? How does it know which virtual address maps to which physical address?

Let's talk about the Levels Of Translation:

Single-Level Translation: In a simple system, the MMU uses a single-level page table to map virtual addresses to physical addresses. Each entry in the page table corresponds to a virtual page and contains the physical frame number where that page is stored.

Multi-Level Translation: In more complex systems, the MMU uses a multi-level page table to reduce memory overhead. The virtual address is divided into multiple parts, each part indexing into a different level of the page table. This hierarchical structure allows for more efficient use of memory.

1.2 The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data

Let's discuss the trade-offs between single-level and multi-level page tables.

Single-level tables are simple and fast (one lookup) but waste a massive amount of RAM for the table itself. Multi-level tables save RAM by only allocating what is needed, but they are slower because they require multiple memory lookups to find the address.

The Math of Latency:
Imagine a single memory access takes 100ns. If you have a 4-level page table and a TLB miss, you don't just wait 100ns for your data. You wait:
100ns (L4) + 100ns (L3) + 100ns (L2) + 100ns (L1) + 100ns (Actual Data) = 500ns.
That is a 5x slowdown just for translation!

2 The Page Table Walk: A Tree Structure

The page table walk is the process by which the MMU translates a virtual address to a physical address using the page table. In a multi-level page table, this involves traversing a tree-like structure to find the correct mapping.

Think of it like a Library Index:
If you had a single flat list of every book in the world, it would be impossible to hold. Instead, we use a hierarchy:

L4: Which Floor?
L3: Which Aisle?
L2: Which Shelf?
L1: Which Book?

2.1 Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process

As we discussed before, using a flat page table for every process would require a massive amount of memory, especially for systems with large address spaces. For example, in a 32-bit system with 4KB pages, a flat page table would require 4MB of memory per process (2^20 entries * 4 bytes per entry). This is impractical for systems with many processes or limited memory resources.

2.2 The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)

To address the memory overhead issue, multi-level page tables break down the virtual address into multiple parts, each part indexing into a different level of the page table. This hierarchical structure allows the operating system to allocate page table entries only for used virtual pages, significantly reducing memory usage.

2.3 The Hardware Walker: How the processor "walks the tree" to find physical pages

This is the interesting part! Here, we learn how the CPU finds the physical address corresponding to a given virtual address using the multi-level page table structure.

The Hardware Walker is a component of the MMU that is responsible for traversing the multi-level page table to find the physical address corresponding to a given virtual address.

CR3 Register (or TTBR): This special CPU register holds the physical address of the root of the page table (Level 4). When a context switch occurs, the operating system updates this register to point to the page table of the new process.

When a process accesses a virtual address, the hardware walker performs the following steps:

Extract the Indices: The hardware walker extracts the indices for each level of the page table from the virtual address. For example, in a 4-level page table, it would extract indices for L4, L3, L2, and L1.
Traverse the Page Table: Starting from the root of the page table (L4), the hardware walker uses the extracted indices to navigate through each level of the page table. At each level, it reads the corresponding entry to find the address of the next level's page table.
Find the Physical Address: Once the hardware walker reaches the final level (L1), it retrieves the physical frame number from the page table entry. It then combines this frame number with the offset from the original virtual address to compute the final physical address.

3 The Accelerator: The TLB (Translation Look-Aside Buffer)

To avoid the performance hit of walking page tables for every access, processors cache the computed physical addresses in a specialized cache called the TLB.

3.1 Caching the Address: The TLB as a tiny, ultra-fast cache specifically for address translations

The TLB (Translation Lookaside Buffer) is a small, fast cache that stores recent translations of virtual addresses to physical addresses. It is designed to speed up the address translation process by reducing the number of memory accesses needed to translate a virtual address.

When a process accesses a virtual address, the MMU first checks the TLB to see if the translation for that address is already cached. If it is, the MMU can quickly retrieve the corresponding physical address from the TLB (Cache Hit), avoiding the need to walk the page table, if not, it has to walk the page table (Cache Miss).

3.2 TLB Thrashing: A Practical Example

This is where theory meets practice. If you access memory in a pattern that constantly jumps to new pages, you will cause TLB Thrashing. The TLB is small; if you touch too many pages too quickly, you evict useful entries.

Consider iterating over a large 2D array:

// Fast: Row-major access (Sequential)
// We access matrix[0][0], matrix[0][1], matrix[0][2]...
// These are all on the same page. One TLB miss per page (4096 bytes).
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum += matrix[i][j];
    }
}

// Slow: Column-major access (Strided)
// We access matrix[0][0], matrix[1][0], matrix[2][0]...
// Each access jumps N * sizeof(int) bytes forward.
// We likely hit a NEW page every single time. High TLB miss rate!
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        sum += matrix[i][j];
    }
}

3.3 The Context Switch Penalty: Why switching processes forces a TLB flush (and why it's expensive)

We did not discuss context switching before, so let's define it first:

Context Switching: is the process of saving the state of a currently running process and loading the state of another process to allow multiple processes to share a single CPU. This involves saving and restoring the CPU registers, program counter, and other process-specific information.

When a context switch occurs, the TLB must be flushed (cleared) because the cached translations in the TLB are specific to the virtual address space of the currently running process. If the TLB were not flushed, the new process could potentially access incorrect physical addresses based on stale TLB entries from the previous process, leading to data corruption or security vulnerabilities.

Flushing the TLB is expensive because it requires the MMU to walk the page tables again for each memory access made by the new process, resulting in increased latency and reduced performance. This is particularly problematic in systems with frequent context switches, as the overhead of flushing the TLB can significantly impact overall system performance.

THE OPTIMIZATION: Modern Processors and operating systems implement various techniques to mitigate the performance impact of TLB flushes during context switches. One common approach is to use Address Space Identifiers (ASIDs) or Process Context Identifiers (PCIDs), which allow the TLB to retain entries for multiple processes simultaneously. This way, when a context switch occurs, the TLB does not need to be completely flushed; instead, only entries associated with the previous process are invalidated, while entries for other processes remain valid. This significantly reduces the overhead of context switches and improves overall system performance.

Note on Threads vs. Processes:
It is important to note that Threads within the same process share the same Page Table (and thus the same TLB entries). Context switching between threads is much cheaper than switching between processes because the TLB does not need to be flushed.

4 Optimization: Making the TLB Bigger (Without Hardware Changes)

To improve TLB performance without changing the hardware, operating systems can use techniques like huge pages to increase the effective size of TLB entries.

4.1 The Page Size Limit: Why 4KB pages clog up the TLB

The default page size of 4KB can lead to TLB misses when a process accesses a large amount of memory, as each TLB entry only covers a small portion of the address space. This can result in frequent TLB misses and increased latency due to page table walks.

4.2 Huge Pages (2MB/1GB): Increasing the range of a single TLB entry to reduce misses

Huge pages are larger memory pages that can be used to reduce the number of TLB entries needed to cover a given address space. By using huge pages (e.g., 2MB or 1GB), a single TLB entry can cover a much larger portion of the address space, reducing the likelihood of TLB misses and improving performance.

The Problem with Huge Pages: While huge pages can improve TLB performance, they also come with some challenges. Allocating large contiguous blocks of physical memory can be difficult, especially in systems with fragmented memory. Additionally, using huge pages can lead to increased memory usage, as smaller pages may be wasted if a process does not fully utilize the allocated huge page (internal fragmentation).

Real World Use Case:
Database engines like PostgreSQL or Oracle often manage buffer pools (cached data) that are dozens of GBs in size. Mapping 64GB of RAM using 4KB pages would require millions of TLB entries, causing constant thrashing. Using Huge Pages makes this manageable and significantly improves database throughput.

5 Conclusion: Respecting the Translation Layer

Virtual memory and the associated translation mechanisms are fundamental to modern computing. Understanding how virtual addresses are translated to physical addresses, the role of the TLB, and optimization techniques like huge pages is crucial for developers aiming to write efficient software.

What Every Programmer Should Know About Memory Part 1

Hamza Hasanain — Fri, 21 Nov 2025 13:18:51 +0000

I recently came across an interesting paper titled
What Every Programmer Should Know About Memory by Ulrich Drepper. The paper dives into the structure of memory subsystems in use on modern commodity hardware,and what programs should do to achieve optimal performance by utilizing them.

What I will be doing is just summarizing what I (as a semi-intelligent being) have learned from reading the paper. I highly recommend reading the paper as the title says, what every programmer should know about memory.

Needless to say, some parts of the paper where quite complex for my brain, I did my best to understand everything, but I might have missed some details. If you find any mistakes, please let me know!

What I Will Cover Here

I just finished reading the first 3 sections of the paper, which cover the following topics:

Basic Architecture of Modern Computers
Main Memory
Caches

This post will be structured around these topics, so here is my table of contents:

1- Introduction: The Lie of the Flat Array

1.1. The O(1) Myth of Pointer Access
1.2. The Latency Numbers (Approximate)

2- The Hardware Reality (RAM Physics)

2.1. SRAM vs. DRAM: Why Main Memory Uses Leaky Capacitors
2.2. The Refresh Tax: Why Execution Stalls for Memory Maintenance
2.3. Address Multiplexing: Sharing Pins to Save Money (and Costing Time)
2.4. The Latency Chain: The Precharge → RAS → CAS Protocol
2.5. Burst Mode: Why We Never Read Just One Byte

3- The Caching Solution (CPU Caches)

3.1. The Hierarchy: L1 (Brain), L2 (Buffer), L3 (Bridge)
3.2. Spatial Locality: How Cache Lines (64 Bytes) Hide Latency
3.3. Associativity: The Parking Lot Problem
3.4. Write Policies: Write-Through, Write-Back, Dirty Bits, Lazy Eviction
3.5. Multi-Core Complexity: MESI Protocol and False Sharing

4- Programmer Takeaways

4.1. Data Placement: Why std::vector Beats std::list
4.2. The Double Indirection Trap
4.3. The Tetris Game: Struct Packing and Alignment
4.4. Spatial Locality: Hot/Cold Data Splitting
4.5. Data Oriented Design: AoS (Array of Struct) vs. SoA (Struct of Arrays)
4.6. Hardware Topology: NUMA & Context Switching

5- Conclusion

Introduction: The Lie of the Flat Array

The Flat Memory Model (also known as the Linear Memory Model) is one of the most fundamental lies operating systems tell programmers. It is an abstraction that presents memory to your program as a single, continuous, contiguous tape of bytes, addressable from 0 to 2^N - 1.

This abstraction is crucial for programming, as it allows us to think of memory in simple terms, using pointers and offsets to access data. However, in reality, memory is far from flat. It is a complex hierarchy of storage types, each with different speeds, sizes, and costs.

You might ask the OS for a 2GB contiguous block for a game engine or database. The Limitation: The OS might not have 2GB of physically contiguous RAM. It might have 2GB free, but scattered in 4KB chunks all over the physical chips.

Reality: The OS uses Virtual Memory Paging to map your flat virtual addresses to scattered physical addresses. The abstraction holds, but the OS has to work hard (using Page Tables and the TLB - Translation Lookaside Buffer) to maintain the illusion. If you access memory too sporadically, you thrash the TLB, causing performance degradation.

This will be discussed in more detail in the next posts (where we will talk about Virtual Memory and NUMA support in more details), but for now, just remember: Memory is not flat. Access times vary wildly depending on where your data resides in the memory hierarchy.

1.1 The O(1) Myth of Pointer Access

When talking about pointer access, most programmers assume that dereferencing a pointer is an O(1) operation - meaning it takes a constant amount of time regardless of the size of the dataset or where the data is located. However, in modern systems, that constant time can vary by a factor of 1,000,000 depending on where the data physically lives.

If your data is in the L1 Cache (closest to the CPU core), the dereference is nearly unnoticeable. If it is in Main RAM, the CPU must stall and wait. If it is swapped out to the Disk, the CPU could execute millions of instructions in the time it takes to fetch that one value.

1.2 The Latency Numbers (Approximate)

To put this in perspective, let's look at the cost in CPU cycles:

Location of Data	Approximate Latency	Simple Analogy
L1 Cache	3−4 cycles	Grabbing a pen from your desk.
L2 Cache	10−12 cycles	Picking a book off a nearby shelf.
L3 Cache	30−70 cycles	Walking to a colleague's desk.
Main RAM	100−300 cycles	Walking to the coffee machine down the hall.
SSD/NVMe	10,000+ cycles	Driving to the supermarket.
HDD (Page Fault)	10,000,000+ cycles	Flying to the moon.

The Hardware Reality (RAM Physics)

If you accidentally came across the terms CPU Caches, RAM, and ever wondered what they are, why they are different, and why not just use one type of memory, this section will give you a basic understanding of how modern memory systems are structured.

CPU Caches are referred to as fast memory because they are built using SRAM (Static RAM) technology, while the Main RAM is built using DRAM (Dynamic RAM) technology. The two have different trade-offs in terms of speed, cost, and density.

2.1 SRAM vs. DRAM: Why Main Memory Uses Leaky Capacitors

Let's start with the SRAM, DRAM, and see how they are built:

SRAM uses a set of transistors to store each bit of data. A typical SRAM cell uses 6 transistors to store a single bit, forming a flip-flop circuit that can hold its state as long as power is supplied. This design allows for very fast access times (on the order of nanoseconds) because the data can be read or written directly without any additional steps.

DRAM, on the other hand, uses a single transistor and a capacitor to store each bit of data. The capacitor holds an electrical charge to represent a 1 and no charge to represent a 0. However, capacitors leak charge over time, so the data must be periodically refreshed (every few milliseconds) to maintain its integrity. This refresh process introduces latency and complexity but allows DRAM to be much denser and cheaper than SRAM.

Check this article if you want to know why real-world capacitors are not perfect insulators

2.2 The Refresh Tax: Why Execution Stalls for Memory Maintenance

As mentioned earlier, The problem with DRAM is that capacitors are imperfect, they leak electrons and lose their charge over time. To prevent data loss, The solution is that the MC (Memory Controller) must read every single row of memory and write it back (recharge it) before the data fades away.

This maintenance operation is called a Refresh Cycle. During a refresh cycle, the memory controller temporarily halts normal memory operations to perform the refresh. This can lead to delays in servicing memory requests from the CPU, causing stalls in execution.

Keep in mind that, it does not halt the entire RAM chip at once, instead it refreshes rows sequentially. However, the refresh operation occupies the Bank's sense amplifiers. Therefore, if the CPU requests data from ANY row within that specific Bank (not just the row being refreshed), it must wait until the Bank becomes available again.

What is a Bank? A Bank is a subdivision of the DRAM chip that can be accessed independently. Each Bank has its own sense amplifiers and can be refreshed or accessed separately from other Banks. This allows for some level of parallelism and reduces the impact of refresh cycles on overall memory access latency.

What is sense amplifiers? Sense amplifiers are specialized circuits within the DRAM that detect and amplify the small voltage changes on the bit lines during read operations. They are crucial for accurately reading the data stored in the capacitors of the DRAM cells.

2.3 Address Multiplexing: Sharing Pins to Save Money (and Costing Time)

Before we dive into the latency chain, why it exists, and how it works, we need to understand why DRAM chips use Address Multiplexing.

Address Multiplexing is a technique used in DRAM to reduce the number of pins required on the chip package. Instead of having separate pins for each bit of the address, the address is sent in two parts: the Row Address and the Column Address. This allows the same set of pins to be reused for both parts of the address, effectively halving the number of pins needed (while simultaneously increasing the time it takes to access data).

Why is this important? The number of pins on a chip package directly impacts its cost and complexity. By using address multiplexing, manufacturers can produce DRAM chips that are more affordable and easier to integrate into systems.

2.4 The Latency Chain: The Precharge → RAS → CAS Protocol

To understand why memory latency exists, you have to stop thinking of RAM as a magic bucket and start thinking of it as a physical Matrix (a grid of rows and columns).

To read a single byte of data, the Memory Controller cannot just say Give me index 400. It has to manipulate the physical grid using a strict three-step protocol. This sequence is determined by the physical construction of the DRAM chip and the need to minimize the number of pins on the chip package.

Imagine a massive warehouse (the DRAM Bank):

The Cells: The data lives in millions of tiny boxes (capacitors) arranged in Rows and Columns.

The Row Buffer (Sense Amps): There is a loading dock (Row Buffer).

The Rule: You cannot read a box while it is on the shelf. You must first move the entire row of boxes to the loading dock.

As we said before, when the CPU asks for a memory address, the controller breaks it down into a Row Address and a Column Address.

Step 1: Precharge: If there is already a row loaded in the Row Buffer, it must be precharged (written back to the shelf) before loading a new row. This step ensures that the Row Buffer is ready for the next operation.

Step 2: RAS (Row Address Strobe): The controller sends the Row Address to the DRAM chip, which activates the corresponding row and loads it into the Row Buffer (sense amplifiers). This step is crucial because it prepares the data for access.

Step 3: CAS (Column Address Strobe): Finally, the controller sends the Column Address to select the specific byte within the loaded row. The data is then read from the Row Buffer and sent back to the CPU.

2.5 Burst Mode: Why We Never Read Just One Byte

We have established that accessing a single byte from DRAM involves a multi-step process that requires a significant amount of taxes (Latency). If we paid that tax every time we wanted a single byte (8 bits), our computers would be extraordinarily slow. So engineers came up with a clever solution called Burst Mode.

Burst Mode allows the memory controller to read or write multiple consecutive bytes in a single operation after the initial access. Instead of fetching just one byte, the controller can fetch a block of data (typically 64 bytes) in one go, that means (for the simplicity) if you have an array arr[0..63], of 32-bit integers, (4-bytes) and you request arr[0], the memory controller will fetch arr[0] to arr[15] in one operation, because they are all located in the same row and can be accessed sequentially.

Why 64 bytes? See section 3.2 for details

The Caching Solution (CPU Caches)

Time for a quick history lesson.

In the early days of computers (1940s-1970s), CPUs and RAM were about the same speed. The CPU could ask for data and get it right away, so there was no waiting. Life was simple, and there was no need for a cache because the CPU wasn't sitting around with nothing to do.

But in the 1980s, things changed. CPUs started getting much, much faster every year, while RAM speed only improved a little. This created a huge speed difference, known as the Memory Wall. The super-fast CPU now had to spend most of its time waiting for the slower RAM to deliver data, like a sports car stuck in a traffic jam.

To solve this problem, engineers invented the cache. A cache is a small, extremely fast memory that lives right next to the CPU. It started with big, expensive computers in the 60s. By 1989, the Intel 486 brought a small L1 cache to personal computers. As the speed gap grew, we added a bigger, slightly slower L2 cache, and later an even bigger L3 cache for multiple CPU cores to share. The idea is to keep the most frequently used data in the fastest memory, so the CPU can keep working instead of waiting.

The next few sections will explain how this cache system works.

3.1 The Hierarchy: L1 (Brain), L2 (Buffer), L3 (Bridge)

First, if you take a look at the image, you notice, there are 2 parts of the L1 cache: D-Cache (Data Cache) and I-Cache (Instruction Cache), but you notice that L2 and L3 caches are unified (store both instructions and data). This is because modern CPUs use a technique called Harvard Architecture for the L1 cache, which separates instructions and data to allow simultaneous access. This improves performance by allowing the CPU to fetch instructions and data in parallel.

The cache hierarchy is designed to balance speed, size, and cost:

L1 Cache: This is the smallest and fastest cache, located directly on the CPU core. It typically ranges from 16KB to 64KB in size and has the lowest latency (around 3-4 cycles). The L1 cache is split into two parts: one for instructions (I-Cache) and one for data (D-Cache). Its primary role is to provide the CPU with the most frequently accessed data and instructions as quickly as possible.
L2 Cache: This cache is larger than L1, typically ranging from 256KB to 1MB, and is still located on the CPU core. It has slightly higher latency (around 10-12 cycles) but can store more data. The L2 cache acts as a buffer between the fast L1 cache and the slower L3 cache or main memory, holding data that is not as frequently accessed as that in L1 but still needs to be retrieved quickly.
L3 Cache: This is the largest and slowest cache in the hierarchy, often ranging from MBs to GBs. It is usually shared among multiple CPU cores and has higher latency (around 30-70 cycles). The L3 cache serves as a bridge between the CPU cores and the main memory, storing data that is less frequently accessed but still benefits from being cached. It only exists due to the terrifying fact that main memory is so slow compared to the CPU.

3.2 Spatial Locality: How Cache Lines (64 Bytes) Hide Latency

As we discussed in the section Burst Mode: Why We Never Read Just One Byte, when the CPU requests data from memory, it doesn't just fetch a single byte. Instead, it fetches a block of data known as a cache line. In modern systems, a cache line is typically 64 bytes in size.

Why 64 bytes? This size is chosen because it aligns well with the cache line size of modern CPUs. By fetching data in blocks that match the cache line size, the system can take advantage of spatial locality, reducing the number of memory accesses required for sequential data access patterns.

Cache Lines: A cache line is the smallest unit of data that can be transferred between the CPU cache and main memory. Modern CPUs typically use a cache line size of 64 bytes. When the CPU requests data from memory, it fetches an entire cache line, even if only a small portion of that data is needed.

spatial locality: Spatial locality refers to the tendency of programs to access data locations that are close to each other within a short time frame. When a program accesses a particular memory address, it is likely to access nearby addresses soon after. By fetching data in blocks (cache lines), the system can take advantage of this behavior, reducing the number of memory accesses and improving overall performance.

3.3 Associativity: The Parking Lot Problem

We are talking about caches, and they are fast and all that, but they are also small, so you need to have a strategy to decide where to put data when it comes into the cache, and where to find it when you need it again.

Let's discuss 3 strategies for organizing data in the cache, known as cache associativity:

Direct-Mapped Cache: In a direct-mapped cache, each block of main memory maps to exactly one location in the cache. This is like having a parking lot where each car has a designated parking spot. If two cars (memory blocks) want to park in the same spot, one has to leave (be evicted). This method is simple and fast but can lead to many conflicts if multiple frequently accessed memory blocks map to the same cache line.
Fully Associative Cache: In a fully associative cache, any block of main memory can be stored in any location in the cache. This is like having a parking lot where cars can park anywhere. This method minimizes conflicts but requires more complex hardware to search the entire cache for a block, which can slow down access times.
Set-Associative Cache: This is a compromise between direct-mapped and fully associative caches. The cache is divided into several sets, and each block of main memory maps to a specific set but can be stored in any location within that set. This is like having a parking lot divided into sections, where cars can park anywhere within their designated section. This method balances the speed of direct-mapped caches with the flexibility of fully associative caches.

3.4 Write Policies: Write-Through, Write-Back, Dirty Bits, Lazy Eviction

We have established that reading from RAM is slow. Writing to RAM is just as slow.

If your program executes a loop that increments a counter i++ one million times, and every single increment forces a write to physical RAM, your CPU will spend 99.9% of its time waiting for the memory bus.

To solve this, hardware engineers created two main policies for handling writes: Write-Through (safe but slow) and Write-Back (complex but fast).

Write-Through: In a write-through cache, every time the CPU writes data to the cache, it also immediately writes that data to the main memory. This ensures that the main memory always has the most up-to-date data, which is important for data integrity. However, this approach can be slow because every write operation incurs the latency of writing to main memory.

Write-Back: In a write-back cache, when the CPU writes data to the cache, it does not immediately write that data to main memory. Instead, it marks the cache line as dirty, indicating that it has been modified. The data is only written back to main memory when the cache line is evicted (replaced) or when certain conditions are met (like a flush operation). This approach improves performance by reducing the number of write operations to main memory, but it introduces complexity in managing dirty cache lines and ensuring data consistency.

The Dirty Bit: The dirty bit is a flag associated with each cache line that indicates whether the data in that cache line has been modified (written to) since it was loaded from main memory. If the dirty bit is set, it means the cache line contains data that is different from what is in main memory, and it must be written back to main memory before being evicted.

The Eviction Policy: When the cache is full and a new block of data needs to be loaded, the cache must evict (remove) an existing block to make room. In a write-back cache, if the evicted block is marked as dirty, the cache must first write the modified data back to main memory before loading the new block. This process is known as lazy eviction because the write-back to main memory is deferred until eviction, rather than occurring immediately on every write.

As guessed, the write-back policy is generally preferred in modern CPUs due to its performance advantages, despite the added complexity of managing dirty cache lines and ensuring data consistency. And This introduces the problem of cache coherence in multi-core systems, which we will discuss next.

3.5 Multi-Core Complexity: MESI Protocol and False Sharing

The problem with Lazy Eviction (Write-Back) in a multi-core system is coherence. If Core 1 has a Dirty version of a variable, and Core 2 tries to read it from RAM, Core 2 will read garbage.

To solve this, hardware engineers implemented a Social Contract between cores. They don't just talk to RAM; they talk to each other. The most common standard for this negotiation is the MESI Protocol.

However, this strict protocol has a nasty side effect called False Sharing.

Under MESI, every Cache Line (that 64-byte chunk) has a 2-bit state tag attached to it. These bits tell the core what rights it has over that data.

Modified (M): The cache line is dirty (modified) and is the only valid copy. Other caches do not have this data.
Exclusive (E): The cache line is clean (not modified) and is the only valid copy. Other caches do not have this data.
Shared (S): The cache line is clean and may be present in other caches. Multiple caches can read this data.
Invalid (I): The cache line is not valid. It may have been modified by another core or is not present in this cache.

Now we know the modes, what they mean, let's now see how the 2 Cores interact when accessing shared data.

The Snooping Bus: All cores are connected to a shared communication channel called the snooping bus. Whenever a core wants to read or write data, it broadcasts its intention on this bus. Other cores listen (snoop) to these broadcasts and respond accordingly to maintain coherence.

Simple example:

Core 1 wants to write to a variable X. It checks its cache and finds that X is in the Shared (S) state. To modify it, Core 1 must first broadcast an Invalidate message on the snooping bus, telling all other cores to mark their copies of X as Invalid (I). Once all other cores acknowledge the invalidation, Core 1 can change the state of X to Modified (M) and proceed with the write.

Note in the example, we Say it checks its cache and finds that X ...... This is not totaly correct, it dones not only find X, it finds the entire cache line that contains X. This is where False Sharing comes into play, that is a performace desaster.

False Sharing occurs when two or more cores modify different variables that happen to reside in the same cache line. Even though the variables are independent, the MESI protocol forces the cores to invalidate each other's cache lines because they share the same cache line. This is known widely as Cache Line Ping-Pong.

Programmer Takeaways

Here is where the real fun begins. Now that we understand how memory works under the hood, let's discuss some practical takeaways for programmers to optimize their code for better performance.

4.1 Data Placement: Why std::vector Beats std::list

This is the Hello World of memory optimization. It teaches the fundamental rule: Linked Lists are cache poison.

Needless to say, prefer std::vector over std::list for performance-critical code unless you have a specific reason to use a linked list (like frequent insertions/deletions in the middle of the list).

Bad Code Example: Using std::list

long long sum_list(const std::list<int>& l) {
    long long sum = 0;
    for (int val : l) sum += val;
    return sum;
}

Good Code Example: Using std::vector

long long sum_vector(const std::vector<int>& v) {
    long long sum = 0;
    for (int val : v) sum += val;
    return sum;
}

4.2 The Double Indirection Trap: `std::vector<std::vector<T>>`

Developers often use std::vector<std::vector<int>> for grids. This is a pointer to an array of pointers.

Why: To access grid[i][j], the CPU must fetch grid -> fetch pointer at grid[i] (cache miss 1) -> fetch data at [j] (cache miss 2). Rows are not contiguous in physical memory.

To solve this, we use a clever trick: flatten the 2D structure into a 1D vector.

Bad Code Example: Using `std::vector<std::vector<T>>`


/*
[Row 0 Data ...] -> 0x1000
[Row 1 Data ...] -> 0x8004
[Row 2 Data ...] -> 0x2000
[Row 3 Data ...] -> 0x4008

grid = [0x1000, 0x8004, 0x2000, 0x4008]
*/


std::vector<std::vector<int>> grid(rows, std::vector<int>(cols));
int value = grid[i][j]; // Double indirection, two cache misses

Good Code Example: Flattening the 2D Structure

inline int idx(int i, int j, int cols) {
    return i * cols + j;
}
inline pair<int, int> to_2d(int index, int cols) {
    return {index / cols, index % cols};
}
// [Row 1 Data... | Row 2 Data... | Row 3 Data...] (Contiguous)
std::vector<int> grid(rows * cols);
int value = grid[idx(i, j, cols)]; // Single access, better cache locality

4.3 The Tetris Game: Struct Packing and Alignment

The compiler aligns data to memory boundaries. If you order your variables poorly, you create holes (padding) in your cache lines.

Why does the compiler add padding? To ensure that data types are aligned to their natural boundaries (e.g., 4-byte integers on 4-byte boundaries). Misaligned accesses can be slower or even cause hardware exceptions on some architectures.

Bad Code Example: Poorly Ordered Struct

struct Bad {
    char a;      // 1 byte
    // 7 bytes padding
    double c;    // 8 bytes
    int b;       // 4 bytes
    // 4 bytes padding
};
// Size: 24 bytes

Good Code Example: Well-Ordered Struct

struct Good {
    double c;    // 8 bytes
    int b;       // 4 bytes
    char a;      // 1 byte
    // 3 bytes padding
};
// Size: 16 bytes (no padding between members)

4.4 Spatial Locality: Hot/Cold Data Splitting

Objects often contain data we check frequently (ID, Health) and data we rarely check (Name, Biography).

Why: If a struct is 200 bytes (mostly text strings), only 3 structs fit in a cache line. Iterating over them fills the cache with Cold text data you aren't reading, flushing out useful data.

What to do: Move rare data to a separate pointer or array.

Bad Code Example: Mixed Hot/Cold Data

struct User {
    int id;              // HOT
    int balance;         // HOT
    char username[128];  // COLD (Pollutes cache)
    char bio[256];       // COLD (Pollutes cache)
};

Good Code Example: Split Hot/Cold Data

struct UserHot {
    int id;
    int balance;
    UserCold* coldData; // Pointer to cold data
};
struct UserCold {
    char username[128];
    char bio[256];
};

4.5 Data Oriented Design: AoS (Array of Struct) vs. SoA (Struct of Arrays)

Imagine you are building a game with thousands of entities, each with position and color.

How do you store them?

4.5.1 Array of Structs (AoS)

struct Entity {
    float x, y, z;      // Position
    float r, g, b;      // Color
};

The problem with such design is that when you want to update positions, you load entire cache lines with color data you don't need.

So here comes the alternative:

4.5.2 Struct of Arrays (SoA)

struct Entities {
    std::vector<float> x,y,z;
    std::vector<float> r,g,b;
};

Essentially, you separate data by usage patterns. When updating positions, you only load position arrays into the cache, maximizing cache utilization and minimizing cache misses.

4.6 Hardware Topology: NUMA & Context Switching

This will be covered in more details in the next post.

Conclusion

Understanding how memory works at a low level is crucial for writing high-performance software. By leveraging knowledge about caches, memory hierarchies, and data locality, programmers can make informed decisions that lead to significant performance improvements.

In this post, we covered the basics of modern memory systems, including the differences between SRAM and DRAM, the structure of CPU caches, and practical programming techniques to optimize memory access patterns. In the upcoming parts of this series, we will dive deeper into virtual memory, NUMA architectures, and advanced optimization strategies.

I built a Backend web Framework from Scratch in C++

Hamza Hasanain — Sat, 30 Aug 2025 21:10:32 +0000

wouldn’t go as far as calling it a framework — it’s more of a library

I’ve been exploring some backend web frameworks lately and kept asking myself: what do these things actually do under the hood?

To find out, I decided to dive into C++ and experiment. After some tinkering, I built a small homegrown backend web library, split into three layers:

Socket Library – Handles raw communication between processes.
HTTP Server – Parses HTTP requests, manages headers and bodies, and handles TCP streams.
Web Library – Provides a simple framework for routing, controllers, and serving static files, similar to Express.js.

Each layer is built on top of the one beneath it, so understanding the foundation is crucial.

Before we dive into the layers, you can check out the GitHub repos:

Socket Library: github.com/HamzaHassanain/hamza-socket-lib
HTTP Server: github.com/HamzaHassanain/hamza-http-server-lib
Web Library: github.com/HamzaHassanain/hamza-backend-web-library-cpp
Example Blog App: github.com/HamzaHassanain/simple-blog-from-scratch

Please note that this is just my understanding of how things work, how I implemented the stuff

Also note that this project isn’t fully production-ready, but it’s an excellent exercise in understanding backend frameworks from the ground up.

Understanding Sockets: How Processes Talk

At the core of networking on Unix-like systems are file descriptors (FDs) — small integers a process uses to refer to kernel-managed resources (files, pipes, or sockets). When you call something like fflush(stdout) you’re asking your program’s runtime to push buffered bytes down to the FD that represents stdout; what happens to those bytes next depends on what that FD is connected to (a terminal, a file, or a socket).

A socket is one of those kernel-managed resources: it’s a kernel object that your process creates with socket(...) and then uses to send and receive network data. You can think of a socket as an endpoint inside your program; the socket itself is represented by an FD in your process. To tell the kernel where packets should go (or where they came from), a socket is usually bound to a network address, which is commonly expressed as three parts:

Address family — how to interpret addresses (IPv4 or IPv6, e.g. AF_INET or AF_INET6).
IP address — which host/machine on the network you mean (e.g. 127.0.0.1).
Port — which particular service or process on that host should receive the traffic (e.g. 80, 8080).

Note: Only ports 0–1023 are reserved for well-known services like HTTP (80) or SSH (22). Ports above that are available for general use.

Socket types

Two socket types are most relevant when writing networked servers:

1) Datagram sockets (UDP)

UDP sockets are connectionless: you can call sendto() with any destination address (IP+port) and the kernel will attempt to deliver that single datagram.
Each recvfrom() or recvmsg() call returns exactly one datagram (so message boundaries are preserved).
There is no handshake, and the network does not guarantee delivery, ordering, or uniqueness — datagrams can be lost, duplicated, or arrive out of order.
It’s common to bind a UDP socket to a port and serve many different remote peers on that single FD; the kernel provides the sender’s address on each receive so you can reply.

2) Stream sockets (TCP)

TCP sockets are connection-oriented: the client and server perform a 3-way handshake to establish a connection.
After the handshake the kernel exposes a reliable, ordered byte stream to your process. TCP ensures bytes are delivered and in order (barring extreme failures), but it does not preserve packet/message boundaries; if you send two write() calls on the sender, the receiver may receive them merged or split across read() calls.
For servers you bind() and listen() on a port. accept() returns a brand-new FD representing the established connection; the listening FD continues to accept more connections. Each client connection has its own kernel socket object and FD.

Notes on scale & semantics

For UDP you can call connect() on the socket to set a default peer (useful to avoid passing addr to every sendto()), but connect() on a UDP socket only sets the default destination — the underlying semantics remain datagram and connectionless.
For TCP, accept() and the new FD are what you use to read()/write() that client's data; the listening socket never carries per-client data.
Remember: “ordered bytes” (TCP) ≠ “preserved messages” — if you need discrete messages on top of TCP, implement framing (length prefix, delimiters, etc.).

Creating a Simple UDP Socket

// Creating a simple UDP socket with address in C-style (On Unix)

int sockfd = socket(AF_INET, SOCK_DGRAM, 0);
if (sockfd < 0) {
    // Handle error
}

struct sockaddr_in addr;
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_port = htons(8080);
inet_pton(AF_INET, "127.0.0.1", &addr.sin_addr);

if (bind(sockfd, (struct sockaddr*)&addr, sizeof(addr)) < 0) {
    // Handle error
}

// Use the socket...

// Cleanup
close(sockfd);
return 0;

Creating a Simple UDP Socket (using my socket library)

// Creating a simple UDP socket with address Using My library (same logic as above but wrapped in my library)

using namespace hh_socket;

socket_address addr( port(8080) ,  ip_address("127.0.0.1") , family(IPV4) );

socket sock(addr , Protocol::UDP);

// Cleanup is handled by destructor

Creating a Simple TCP Server (using my socket library)

// Creating a simple TCP server, that echoes back messages

using namespace hh_socket;

socket_address addr( port(8080) ,  ip_address("0.0.0.0") , family(IPV4) );

socket sock(addr , Protocol::TCP);

sock.listen();

while (true) {
    std::shared_ptr<connection> conn = sock.accept(); // blocking call

    data_buffer buf = conn->receive(); // blocking call
    data_buffer echo_message ("Echo: ");

    echo_message.append(buf);

    conn->send(echo_message); // echo back
    conn->close();
}

// Cleanup is handled by destructor

Handling Blocking Operations

Blocking means an I/O call (like read() or accept()) makes your program wait until the operation completes.

When a socket call blocks, the current thread simply sits idle until the OS has data to return or the requested action completes. For servers that handle many clients, blocking on a single thread quickly becomes a bottleneck. To handle many connections efficiently, you can use:

1. Multithreading

Idea: handle each connection in its own thread or use a pool of worker threads.
Pros: simple mental model — each handler can use blocking calls; easy to write.
Cons: high memory/context-switch cost for many connections; synchronization complexity.

2. I/O multiplexing

Idea: a single thread (or a few threads) waits on many file descriptors and reacts when any become ready. Tools: epoll (Linux), IOCP (Windows), kqueue (macOS). There is also select (Windows/Unix) it is less efficient for large numbers of connections.
Pros: low thread overhead; great for many concurrent connections.
Cons: more complex control flow; must handle partial reads/writes.

3. Async I/O

Idea: submit read/write requests to the kernel and receive completion events later (no thread is blocked waiting). io_uring on modern Linux is a powerful example.
Pros: excellent throughput and low latency; scales well.
Cons: API is more advanced; portability issues.

For my project, I used I/O multiplexing, allowing a single-threaded event loop to handle hundreds of connections efficiently.

// a simple epoll server (using my epoll_server class) for handling multiple sockets.

class chat_server : public epoll_server {
    std::unordered_map<int, std::string> usernames;

protected:
    void on_connection_opened(std::shared_ptr<connection> conn) override {
        send_message(conn, data_buffer("Enter username: "));
    }

    void on_message_received(std::shared_ptr<connection> conn, const data_buffer &db) override {
        std::string msg = db.to_string();

        if (usernames.find(conn->get_fd()) == usernames.end()) {
            // First message is username
            usernames[conn->get_fd()] = msg;
            broadcast(usernames[conn->get_fd()] + " joined the chat");
        } else {
            // Regular chat message
            broadcast(usernames[conn->get_fd()] + ": " + msg);
        }
    }

    void on_connection_closed(std::shared_ptr<connection> conn) override {
        auto it = usernames.find(conn->get_fd());
        if (it != usernames.end()) {
            broadcast(it->second + " left the chat");
            usernames.erase(it);
        }
    }

private:
    void broadcast(const std::string &message) {
        data_buffer msg(message);
        for (const auto &[fd, conn_state] : conns) {
            send_message(conn_state.conn, msg);
        }
    }
};

Building the HTTP Server

TCP streams are just sequences of bytes. An HTTP request might be fragmented across multiple TCP packets, so the server must:

Reassemble the byte stream.
Extract request headers.
Parse the body (if present).
Handle limits (max body size, header size).

The HTTP server integrates tightly with the socket library, that is, it extends the hh_socket::epoll_server functionality by reusing its efficient connection handling and abstraction. This shows how layering simplifies complexity: the HTTP server focuses on protocol logic, while sockets manage the low-level networking.

Note that my implementation is not fully compliant with the HTTP specification, it just provides a basic framework for handling HTTP requests.

High-level (use the project's parser):

// Example: receive a buffer from a connection and let the project's parser assemble
// requests that may span multiple TCP reads.
hh_socket::data_buffer buf = conn->receive();
hh_http::http_message_handler parser;

// `handle` returns an http_handled_data describing either a complete request
// or that more bytes are required (completed == false).
auto result = parser.handle(conn, buf);
if (!result.completed) {
    // Not enough data yet - wait for the next read and call parser.handle again
} else if (result.method.rfind("BAD_", 0) == 0) {
    // Parser returned an error token (e.g. BAD_METHOD_OR_URI_OR_VERSION)
    // Application can craft an error response here
} else {
    // Complete request: use result.method, result.uri, result.headers, result.body
}

Low-level (manual reassembly sketch):

// Read bytes into a string buffer until we detect the header terminator \r\n\r\n
std::string buffer = conn->receive().to_string();
size_t headers_end = buffer.find("\r\n\r\n");
if (headers_end != std::string::npos) {
    std::string header_block = buffer.substr(0, headers_end);
    std::string rest = buffer.substr(headers_end + 4); // body (maybe partial)

    // parse request-line and headers (split on \r\n, then on ':')
    // find Content-Length (if present) to determine expected body size
    size_t content_length = 0; // parse from header_block if present

    // Keep reading until we have the full body
    while (rest.size() < content_length) {
        rest += conn->receive().to_string();
    }

    std::string body = rest.substr(0, content_length);
    // Now header_block contains headers and `body` contains the full payload
}

HTTP Server example (using my http_server class)

#include "http_server.hpp"

int main() {
    using namespace hh_http;

    // this automatically sets up the server to listen for incoming connections
    // also this handles big request body, and also, you can normally send a big response,
    // as the epoll server itself handles sending such big chunks of data
    http_server server(8081, "0.0.0.0");

    // Set up all the callbacks
    server.set_request_callback([]( http_request &req , http_response &res ) {
        // Handle incoming HTTP requests
        res.set_body("Hello, World!\n");
        res.set_status(200, "OK");
        res.send(); // send the headers, and the body to the client
        res.end();
    });
    server.set_error_callback([](const std::exception &e) {
        std::cerr << "Error: " << e.what() << std::endl;
    });

    // Start the server (this will block)
    server.listen();
}

The Web Library Layer

The top layer provides routing, controllers, and static file serving, similar to Express.js. Key features:

MVC-like architecture – Organizes code into controllers, views, and models for better maintainability.
Routing system – Maps incoming HTTP requests to controller actions.
Static file serving – Delivers HTML, CSS, and JavaScript assets alongside dynamic content.

Simple Example server:

#include "web-lib.hpp"

// Define request handlers (controller actions)
hh_web::exit_code get_users_handler(
    std::shared_ptr<hh_web::web_request> req,
    std::shared_ptr<hh_web::web_response> res) {

    res->send_json("{\"users\": [\"Alice\", \"Bob\"]}");
    return hh_web::exit_code::EXIT;
}

hh_web::exit_code create_user_handler(
    std::shared_ptr<hh_web::web_request> req,
    std::shared_ptr<hh_web::web_response> res) {

    // Extract user data from request body
    std::string body = req->get_body();
    res->send_json("{\"success\": true, \"message\": \"User created\"}");
    return hh_web::exit_code::EXIT;
}

int main() {

    auto server = std::make_shared<hh_web::web_server<>>(8080, "0.0.0.0");

    auto router = std::make_shared<hh_web::web_router<>>();

    // Map routes to controller actions
    using V = std::vector<hh_web::web_request_handler_t<>>;

    router->get("/api/users", V{get_users_handler});
    router->post("/api/users", V{create_user_handler});

    // It auto-detects the static files (based on the extention, then sends it)
    server->use_static("static");

    // Route with path parameters
    router->get("/api/users/:id", V{[](auto req, auto res) -> hh_web::exit_code {
        auto params = req->get_path_params();
        std::string user_id = params.at("id");
        res->send_json("{\"user_id\": \"" + user_id + "\"}");
        return hh_web::exit_code::EXIT;
    }});

    server->use_router(router);
    server->listen();
}

Why This Structure Matters

Here’s what this layered design teaches:

Sockets handle raw communication and events efficiently.
HTTP server interprets protocol-level data reliably from TCP streams.
Web library allows developers to structure their application cleanly and add features without worrying about low-level details.

Even after a short time learning backend programming, this project clarified:

Networking can be simplified to basic operations and abstractions.
TCP streams need careful parsing, not just reading packets.
Layering responsibilities makes large systems manageable and testable.

DEV Community: Hamza Hasanain

The Lie of Time Management: Why Your Energy is the Real Bottleneck

The Procrastination Myth

Silent Energy Drains for Developers

1. Context Switching and The "Quick Check"

2. The Morning Coffee Trap

3. Mental Load and Analysis Paralysis

How to Protect Your Currency

AI's Gateway Drug to Engineering

1) Writing code ≠ software engineering.

2) Dreaming about context engineering.

3) The multi-million dollar security trap.

4) "Learning as you go"

What Every Programmer Should Know About Memory Part 4

What Programmers Can Do: Writing Hardware-Sympathetic Code

Table of Contents

Subsection A: Cache Optimization

1.1 Data Placement: std::vector Beats std::list

Bad Code Example: Using std::list

Good Code Example: Using std::vector

1.2 The Double Indirection Trap: std::vector<std::vector<T>>

Bad Code Example: Using std::vector<std::vector<T>>

Good Code Example: Flattening the 2D Structure

1.3 Bypassing the Cache (Non-Temporal Stores)

The Constraint: Memory Alignment

1.4 Access Patterns & Blocking (Tiling)

Subsection B: The Virtual Memory (TLB)

2.1 The High Cost of Translation

2.2 The Solution: Huge Pages

Subsection C: Data & Code Layout

3.1 The Tetris Game: Struct Packing

Bad Code Example: Poorly Ordered Struct

Good Code Example: Well-Ordered Struct

3.2 Hot/Cold Data Splitting

Bad Code Example: Mixed Hot/Cold Data

Good Code Example: Split Hot/Cold Data

3.3 Struct of Arrays (SoA) vs Array of Structs (AoS)

3.4 Alignment Matters

3.5 Instruction Cache & Branch Prediction

Subsection D: Concurrency & NUMA

4.1 The Silent Killer: False Sharing

4.2 Thread Affinity (Pinning)

Subsection E: Prefetching

5.1 Helping the Hardware

Tools for Performance Engineers

Quick Cheat Sheet

Conclusion

What Every Programmer Should Know About Memory Part 3

Geography Matters: NUMA Support

Table of Contents

1. UMA vs. NUMA: The Death of Equality

1.1 UMA (Uniform Memory Access)

1.2 NUMA (Non-Uniform Memory Access)

2. The Cost of Remote Access

2.1 The Latency Penalty

2.2 Bandwidth Saturation: The Clogged Pipe

3. OS Policies: The "First Touch" Trap

3.1 How Linux Allocates Memory

3.2 The Trap: Main Thread Initialization

3.3 The "Spillover" Behavior (Zone Reclaim)

4. Tools of the Trade

4.1 Analyzing with lscpu

4.2 The Distance Matrix (numactl)

4.3 Controlling Policy with numactl

4.4 Programming with libnuma

5. Conclusion

What Every Programmer Should Know About Memory Part 2

Why does your pointer not point where you think it does?.

Table of Contents

0 Prerequisites: The Basics

0.1 Paging?

0.2 More Concepts

1 The Illusion of Ownership: Virtual vs. Physical

1.1 The Sandbox: How the MMU makes every process believe it owns the entire RAM

1.2 The Cost of Translation: Why a single virtual address might require 4+ physical memory accesses before you even touch your data

2 The Page Table Walk: A Tree Structure

2.1 Why We Can't Use Flat Tables: The impossibility of a 4MB directory for every process

2.2 The Multi-Level Solution: Breaking addresses into directories (L4 → L3 → L2 → L1)

2.3 The Hardware Walker: How the processor "walks the tree" to find physical pages

3 The Accelerator: The TLB (Translation Look-Aside Buffer)

1.2 The Double Indirection Trap: `std::vector<std::vector<T>>`

Bad Code Example: Using `std::vector<std::vector<T>>`

4.1 Analyzing with `lscpu`

4.2 The Distance Matrix (`numactl`)

4.3 Controlling Policy with `numactl`

4.4 Programming with `libnuma`

4.2 The Double Indirection Trap: `std::vector<std::vector<T>>`

Bad Code Example: Using `std::vector<std::vector<T>>`