<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Subham</title>
    <description>The latest articles on DEV Community by Subham (@shadowsaurus).</description>
    <link>https://dev.to/shadowsaurus</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3824116%2F6c985d70-fea7-49d7-b382-607397a0ecc2.png</url>
      <title>DEV Community: Subham</title>
      <link>https://dev.to/shadowsaurus</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/shadowsaurus"/>
    <language>en</language>
    <item>
      <title>Dissecting MongoDB — It Uses B-Trees Too, So Why Are Writes Actually Faster?</title>
      <dc:creator>Subham</dc:creator>
      <pubDate>Tue, 24 Mar 2026 15:15:38 +0000</pubDate>
      <link>https://dev.to/shadowsaurus/dissecting-mongodb-it-uses-b-trees-too-so-why-are-writes-actually-faster-17kh</link>
      <guid>https://dev.to/shadowsaurus/dissecting-mongodb-it-uses-b-trees-too-so-why-are-writes-actually-faster-17kh</guid>
      <description>&lt;p&gt;Last time we dissected Postgres and found that its write slowness isn't a bug — it's the direct consequence of MVCC, B-Tree indexes, and heap storage all working together to make reads fast.&lt;/p&gt;

&lt;p&gt;MongoDB is often pitched as the alternative for flexible, write-heavy workloads. But &lt;em&gt;why&lt;/em&gt; is it better at writes? What does "flexible schema" actually mean on disk? And how does sharding actually work when your data grows beyond a single machine?&lt;/p&gt;

&lt;p&gt;Let's open it up.&lt;/p&gt;




&lt;h2&gt;
  
  
  First — When Does MongoDB Actually Make Sense?
&lt;/h2&gt;

&lt;p&gt;Before internals, the scenario. Because choosing a database without understanding your workload is how you end up rewriting everything in 18 months.&lt;/p&gt;

&lt;p&gt;Consider a product catalog — something like Flipkart or Amazon. You have phones, t-shirts, furniture, and groceries. Each category has completely different attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Mobile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;phone&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prod_123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"iPhone 15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"electronics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"specs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ram"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8GB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"storage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"256GB"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"battery"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"3877mAh"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"variants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"black"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;79999&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"white"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;79999&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;T-shirt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;completely&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;different&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;shape&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prod_456"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Cotton T-Shirt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"category"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"clothing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"specs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"material"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"100% cotton"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"slim"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"variants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"S"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"red"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;599&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"stock"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Postgres, you'd either create 50 nullable columns, use EAV (Entity-Attribute-Value — a known nightmare), or store a JSONB blob. The last option works, but then you're essentially building MongoDB inside Postgres.&lt;/p&gt;

&lt;p&gt;MongoDB wins here because the document model is native — no schema migration when you add a new product category, no ALTER TABLE on a 50M row table at 2 AM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where MongoDB loses:&lt;/strong&gt; Anything needing strong ACID across multiple documents — orders, payments, financial ledgers. Multi-document transactions exist in MongoDB (since v4.0) but they're slower and less battle-tested than Postgres. Use the right tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Storage Engine — WiredTiger
&lt;/h2&gt;

&lt;p&gt;MongoDB before version 3.2 used MMAPv1 — memory-mapped files directly. Simple, but had brutal limitations: collection-level locking (one write blocks all reads on that collection), no compression, poor write performance.&lt;/p&gt;

&lt;p&gt;In 2014, MongoDB acquired WiredTiger. Everything changed.&lt;/p&gt;

&lt;h3&gt;
  
  
  What WiredTiger Actually Is
&lt;/h3&gt;

&lt;p&gt;Here's the thing most articles get wrong — &lt;strong&gt;WiredTiger stores data in B-Trees on disk&lt;/strong&gt;, not LSM trees. The files look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/var/lib/mongodb/
├── collection-0-1234567890.wt    ← your users collection, B-Tree file
├── collection-2-1234567890.wt    ← another collection
├── index-1-1234567890.wt         ← index file
├── WiredTigerLog.0000000001      ← journal (WAL equivalent)
└── WiredTiger.wt                 ← metadata
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So if both Postgres and WiredTiger use B-Trees on disk, why are MongoDB writes faster?&lt;/p&gt;

&lt;p&gt;Because &lt;strong&gt;WiredTiger uses an aggressive in-memory buffer with in-place updates and automatic MVCC cleanup&lt;/strong&gt; — the combination that Postgres doesn't have.&lt;/p&gt;

&lt;p&gt;Three specific differences:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. MVCC location.&lt;/strong&gt; Postgres keeps dead tuple versions on disk pages until VACUUM cleans them. WiredTiger keeps old versions in RAM only, for the duration of active transactions. Transaction commits → old version gone from memory automatically. No VACUUM needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Compression.&lt;/strong&gt; WiredTiger compresses every page (Snappy by default, Zstd optionally). Postgres has no page-level compression by default. Less data on disk = less I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Cache size.&lt;/strong&gt; WiredTiger defaults to 50% of RAM. Postgres &lt;code&gt;shared_buffers&lt;/code&gt; defaults to 128MB — criminally small for production. More cache = more cache hits = less disk I/O.&lt;/p&gt;

&lt;h3&gt;
  
  
  What a Document Looks Like on Disk
&lt;/h3&gt;

&lt;p&gt;MongoDB stores documents in BSON — Binary JSON. Compact, typed, fast to parse.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{ name: "Bob", age: 25 }

→ BSON bytes:
\x16\x00\x00\x00        (document size: 22 bytes)
\x02                    (type: string)
name\x00                (field name)
\x04\x00\x00\x00Bob\x00 (string value)
\x10                    (type: int32)
age\x00                 (field name)
\x19\x00\x00\x00        (value: 25)
\x00                    (document end)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each document has an &lt;code&gt;_id&lt;/code&gt; field — ObjectId by default (12-byte unique identifier). This becomes the primary key and the default index.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Write Path — From Your Code to Durable Storage
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;db.users.insertOne({name: "Bob"})&lt;/code&gt;, here's everything that actually happens:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — Wire Protocol
&lt;/h3&gt;

&lt;p&gt;Your driver serializes the document to BSON and sends it over TCP port 27017 using MongoDB's binary wire protocol (OP_MSG). Along with the document, it sends a &lt;strong&gt;Write Concern&lt;/strong&gt; — this single parameter controls the entire durability guarantee.&lt;/p&gt;

&lt;p&gt;More on Write Concern at the end of this section. First, what happens on the server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — WiredTiger Cache (RAM)
&lt;/h3&gt;

&lt;p&gt;The write lands in WiredTiger's in-memory cache first — no disk I/O yet. The relevant B-Tree page is found (or created), the document is written in-place on that page, and the page is marked &lt;strong&gt;dirty&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If the document grew in size (new fields added) and doesn't fit on its current page, WiredTiger moves it to a new page — a B-Tree page split. This is the "document padding" strategy: WiredTiger leaves extra space after documents to accommodate small in-place updates without splits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Journal Write (Crash Safety)
&lt;/h3&gt;

&lt;p&gt;Simultaneously, the change is written to the &lt;strong&gt;journal&lt;/strong&gt; — WiredTiger's WAL equivalent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/var/lib/mongodb/journal/WiredTigerLog.0000000001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sequential append. Fast. Every 100ms, the journal is flushed to disk (configurable via &lt;code&gt;storage.journal.commitIntervalMs&lt;/code&gt;). This means in the default configuration, a crash can lose up to 100ms of writes.&lt;/p&gt;

&lt;p&gt;If you set &lt;code&gt;j: true&lt;/code&gt; in Write Concern, MongoDB waits for the journal flush before acknowledging — safer, slower.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Oplog Entry (Replication)
&lt;/h3&gt;

&lt;p&gt;If you're running a replica set (and in production you always are), the write is also recorded in the &lt;strong&gt;oplog&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// local.oplog.rs — a special capped collection&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;op&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;i&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                    &lt;span class="c1"&gt;// i=insert, u=update, d=delete&lt;/span&gt;
  &lt;span class="nx"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;myapp.users&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// namespace&lt;/span&gt;
  &lt;span class="nx"&gt;o&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;ObjectId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;  &lt;span class="c1"&gt;// the document&lt;/span&gt;
  &lt;span class="nx"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;Timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1234567890&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;// logical clock&lt;/span&gt;
  &lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;                        &lt;span class="c1"&gt;// oplog version&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Secondaries tail this collection continuously — like &lt;code&gt;tail -f&lt;/code&gt; on a log file — and apply each entry to their own WiredTiger instance. This is the entire replication mechanism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — Checkpoint (Durable to Disk)
&lt;/h3&gt;

&lt;p&gt;Every 60 seconds (or when dirty data hits 5GB), WiredTiger runs a &lt;strong&gt;checkpoint&lt;/strong&gt; — all dirty pages in the cache are flushed to their respective &lt;code&gt;.wt&lt;/code&gt; B-Tree files on disk. After a checkpoint, the journal up to that point can be safely discarded.&lt;/p&gt;

&lt;p&gt;The full write path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;insertOne({name: "Bob"})
    │
    ▼
BSON serialize → TCP (port 27017)
    │
    ▼
WiredTiger cache → dirty page marked
    │
    ├──► Journal append (sequential, every 100ms flush)
    │
    └──► Oplog entry (local.oplog.rs) → secondaries tail this
              │
              ▼  every 60s
         Checkpoint → .wt files on disk (durable)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Write Concern — The Durability Dial
&lt;/h3&gt;

&lt;p&gt;This is the most important knob in MongoDB writes. Most developers leave it at the default and don't think about it — which is fine until something breaks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Write Concern&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;th&gt;Risk&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;w: 0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fire and forget — no ack&lt;/td&gt;
&lt;td&gt;Complete data loss possible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;w: 1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary RAM — default&lt;/td&gt;
&lt;td&gt;100ms window: crash → data loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;w: 1, j: true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary journal flushed&lt;/td&gt;
&lt;td&gt;Single node crash safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;w: majority&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Majority nodes RAM&lt;/td&gt;
&lt;td&gt;Network partition safe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;w: majority, j: true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Majority nodes journal&lt;/td&gt;
&lt;td&gt;Safest. Use for financial data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The common mistake: running production with &lt;code&gt;w: 1&lt;/code&gt; (default) and assuming the data is safe. Kill the primary 50ms after a write — that write is gone. For anything that matters, use &lt;code&gt;w: majority&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Replication — How Replica Sets Actually Work
&lt;/h2&gt;

&lt;p&gt;MongoDB's replication unit is the &lt;strong&gt;Replica Set&lt;/strong&gt; — a group of mongod processes that all hold the same data.&lt;/p&gt;

&lt;p&gt;Minimum setup: &lt;strong&gt;1 Primary + 2 Secondaries&lt;/strong&gt;. Why three? Because elections require a majority — with two nodes, if one goes down, you can't form a majority of two, so writes stop.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Oplog Tailing Mechanism
&lt;/h3&gt;

&lt;p&gt;Secondary nodes maintain a long-polling cursor on the primary's &lt;code&gt;local.oplog.rs&lt;/code&gt;. Every new oplog entry triggers the secondary to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Read the entry&lt;/li&gt;
&lt;li&gt;Apply the operation to its own WiredTiger instance&lt;/li&gt;
&lt;li&gt;Update its &lt;code&gt;lastAppliedTimestamp&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Send a heartbeat back to primary with this timestamp&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The primary uses these timestamps to determine &lt;strong&gt;replication lag&lt;/strong&gt; — how far behind each secondary is. With &lt;code&gt;w: majority&lt;/code&gt; write concern, the primary waits until enough secondaries have confirmed applying the write before acknowledging to the client.&lt;/p&gt;

&lt;h3&gt;
  
  
  Elections — What Happens When Primary Goes Down
&lt;/h3&gt;

&lt;p&gt;Primary sends heartbeats every 2 seconds. If a secondary doesn't hear from primary for 10 seconds — it calls an election.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Secondary times out (10s no heartbeat)
2. Increments its term number → becomes CANDIDATE
3. Sends RequestVote to all members: "my oplog is at ts:X, vote for me?"
4. Members check: is candidate's oplog as fresh as mine?
   → Yes → vote granted
   → No → vote refused
5. Candidate gets majority → becomes PRIMARY
6. Others step down to SECONDARY
7. Writes resume (typically 10-30s total downtime)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The key insight:&lt;/strong&gt; The candidate with the most up-to-date oplog wins. This prevents electing a secondary that missed recent writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production implication:&lt;/strong&gt; Your application must handle 10-30 seconds of write unavailability during elections. Use &lt;code&gt;retryWrites: true&lt;/code&gt; in your connection string — MongoDB drivers automatically retry eligible operations after a new primary is elected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Read Preference — Where Do Reads Go?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;primary&lt;/code&gt; (default)&lt;/td&gt;
&lt;td&gt;Only primary&lt;/td&gt;
&lt;td&gt;Always fresh data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;primaryPreferred&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Primary, fallback secondary&lt;/td&gt;
&lt;td&gt;Good general default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;secondary&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Only secondaries&lt;/td&gt;
&lt;td&gt;Read scaling, stale data ok&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;secondaryPreferred&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Secondary, fallback primary&lt;/td&gt;
&lt;td&gt;Read-heavy workloads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nearest&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lowest latency node&lt;/td&gt;
&lt;td&gt;Multi-region, geo-distributed&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The gotcha with secondary reads: &lt;strong&gt;replication lag&lt;/strong&gt;. If your secondary is 500ms behind and a user writes then immediately reads, they won't see their own write. This is the &lt;strong&gt;read-your-own-writes problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Fix: use &lt;strong&gt;causal consistency&lt;/strong&gt; at the session level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startSession&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;causalConsistency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;insertOne&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="c1"&gt;// This read is guaranteed to see the above write&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;findOne&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;session&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Sharding — Scaling Beyond One Machine
&lt;/h2&gt;

&lt;p&gt;Replication gives you &lt;strong&gt;high availability&lt;/strong&gt; — same data on multiple nodes. Sharding gives you &lt;strong&gt;horizontal scale&lt;/strong&gt; — different data on different nodes.&lt;/p&gt;

&lt;p&gt;These are different problems. Both matter in production. In a sharded cluster, each shard is itself a replica set — so you get both.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Three Components
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;mongos (router):&lt;/strong&gt; This is what your application connects to. It's stateless — holds no data, just routes queries. Always run 2+ behind a load balancer for HA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Config Servers:&lt;/strong&gt; A dedicated replica set (3 nodes) that stores the cluster's metadata — which chunk of data lives on which shard, shard membership, balancer status. The cluster's brain. If config servers go down, no writes are allowed cluster-wide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shards:&lt;/strong&gt; Where your actual data lives. Each shard is a full replica set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your App
    │
    ▼
mongos × 2 (stateless routers)
    │         ↕ chunk map
    │    Config Server RS (× 3)
    │
    ├──► Shard 1 RS  (users: country A-H)
    ├──► Shard 2 RS  (users: country I-R)
    └──► Shard 3 RS  (users: country S-Z)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Chunks — How Data Is Divided
&lt;/h3&gt;

&lt;p&gt;MongoDB divides a collection into &lt;strong&gt;chunks&lt;/strong&gt; — each chunk represents a contiguous range of shard key values. Default chunk size is 128MB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Enable sharding on database&lt;/span&gt;
&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enableSharding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;myapp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// Shard the users collection by country&lt;/span&gt;
&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shardCollection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;myapp.users&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now MongoDB splits the &lt;code&gt;country&lt;/code&gt; value space into chunks and distributes them across shards. When a chunk grows beyond 128MB, it splits automatically. The &lt;strong&gt;balancer&lt;/strong&gt; — a background process — migrates chunks between shards to keep distribution even.&lt;/p&gt;

&lt;p&gt;Chunk migration has a real cost: data is physically copied over the network from one shard to another. Schedule maintenance windows or disable the balancer during peak hours:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setBalancerState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;// peak hours&lt;/span&gt;
&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setBalancerState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;// off-peak&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Shard Key — The Most Important Decision You'll Make
&lt;/h3&gt;

&lt;p&gt;The shard key determines which shard a document lives on. Choose wrong and your cluster becomes useless. Changing it later is painful (though MongoDB 5.0+ supports online resharding).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hotspot problem — ObjectId as shard key:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ObjectId is monotonically increasing (contains a timestamp). Every new document gets an ObjectId larger than the previous one. All new inserts land in the "last" chunk. That chunk lives on one shard. Result: one shard at 100% write load, others idle.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// BAD — ObjectId is monotonically increasing&lt;/span&gt;
&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shardCollection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;myapp.orders&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;// GOOD — hash distributes uniformly&lt;/span&gt;
&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shardCollection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;myapp.orders&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;hashed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Hashed shard key&lt;/strong&gt; solves write hotspots — the hash of ObjectId is uniformly distributed. But you lose range queries — &lt;code&gt;WHERE createdAt BETWEEN x AND y&lt;/code&gt; becomes a scatter-gather across all shards.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compound shard key&lt;/strong&gt; balances both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// country gives locality, userId gives cardinality&lt;/span&gt;
&lt;span class="nx"&gt;sh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shardCollection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;myapp.users&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Queries filtering by &lt;code&gt;country&lt;/code&gt; hit one shard. High cardinality (millions of user IDs) means many chunks = good distribution.&lt;/p&gt;

&lt;p&gt;The three criteria for a good shard key:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;High cardinality&lt;/strong&gt; — enough distinct values to create many chunks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-monotonic&lt;/strong&gt; — inserts spread across chunks, no hotspot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query alignment&lt;/strong&gt; — your most common queries include the shard key&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Targeted vs Scatter-Gather Queries
&lt;/h3&gt;

&lt;p&gt;This is where shard key choice shows up in your latency metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Targeted query&lt;/strong&gt; — shard key present in filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// country is the shard key → mongos knows exactly which shard&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;IN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;// → hits Shard 1 only → fast&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scatter-gather query&lt;/strong&gt; — no shard key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// name is not the shard key → mongos has no idea&lt;/span&gt;
&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="c1"&gt;// → broadcast to ALL shards → wait for all responses → merge&lt;/span&gt;
&lt;span class="c1"&gt;// → latency = slowest shard's response time&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scatter-gather gets worse with &lt;code&gt;sort&lt;/code&gt; + &lt;code&gt;limit&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({}).&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// Each shard returns its top 10&lt;/span&gt;
&lt;span class="c1"&gt;// mongos merges 30 documents, returns final 10&lt;/span&gt;
&lt;span class="c1"&gt;// N shards × limit documents fetched, only limit returned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check if your query is scatter-gather:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}).&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;executionStats&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;// Look for: "SHARD_MERGE" in winningPlan → scatter-gather&lt;/span&gt;
&lt;span class="c1"&gt;// Look for: "SINGLE_SHARD" → targeted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Production Gotchas
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Jumbo chunks&lt;/strong&gt; — A chunk becomes "jumbo" when all documents in it have the same shard key value, so it can't split further. The balancer can't move jumbo chunks. They pile up on one shard. Fix: choose a higher-cardinality shard key upfront. If you're already here — manual chunk management required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transactions across shards&lt;/strong&gt; — Multi-document transactions that touch multiple shards require a 2-phase commit internally. Slow and lock-heavy. Design your data so related documents live on the same shard (use the same shard key value for related documents).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$lookup across sharded collections&lt;/strong&gt; — MongoDB's JOIN equivalent (&lt;code&gt;$lookup&lt;/code&gt;) across two sharded collections requires pulling data across shards — extremely expensive. In a sharded cluster, favor embedding over referencing. This is MongoDB's design philosophy anyway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Too-early sharding&lt;/strong&gt; — Sharding adds operational complexity. A well-tuned single replica set handles a lot. Don't shard until you've genuinely exhausted vertical scaling and read replicas. A common rule: shard when your dataset exceeds what comfortably fits on your biggest available instance.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here's the complete MongoDB architecture from a single write to a sharded, replicated cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insertOne&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Bob&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;writeConcern&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;w&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;majority&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;span class="nx"&gt;mongos&lt;/span&gt; &lt;span class="nx"&gt;router&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="nx"&gt;map&lt;/span&gt; &lt;span class="nf"&gt;lookup &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;Config&lt;/span&gt; &lt;span class="nx"&gt;Servers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;│&lt;/span&gt;  &lt;span class="nx"&gt;country&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;IN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;Shard&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="err"&gt;▼&lt;/span&gt;
&lt;span class="nx"&gt;Shard&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="nx"&gt;Primary&lt;/span&gt; &lt;span class="nx"&gt;node&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="nx"&gt;WiredTiger&lt;/span&gt; &lt;span class="nx"&gt;cache&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;dirty&lt;/span&gt; &lt;span class="nf"&gt;page &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;memory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fast&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;├──&lt;/span&gt; &lt;span class="nx"&gt;Journal&lt;/span&gt; &lt;span class="nx"&gt;append&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nc"&gt;WiredTigerLog &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;crash&lt;/span&gt; &lt;span class="nx"&gt;safety&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="err"&gt;└──&lt;/span&gt; &lt;span class="nx"&gt;Oplog&lt;/span&gt; &lt;span class="nx"&gt;entry&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;local&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;oplog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rs&lt;/span&gt;
              &lt;span class="err"&gt;│&lt;/span&gt;
              &lt;span class="err"&gt;├──────────────────►&lt;/span&gt; &lt;span class="nx"&gt;Secondary&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
              &lt;span class="err"&gt;│&lt;/span&gt;   &lt;span class="nx"&gt;oplog&lt;/span&gt; &lt;span class="nx"&gt;tailing&lt;/span&gt;    &lt;span class="nx"&gt;apply&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;ack&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="nx"&gt;primary&lt;/span&gt;
              &lt;span class="err"&gt;│&lt;/span&gt;
              &lt;span class="err"&gt;└──────────────────►&lt;/span&gt; &lt;span class="nx"&gt;Secondary&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
                  &lt;span class="nx"&gt;oplog&lt;/span&gt; &lt;span class="nx"&gt;tailing&lt;/span&gt;    &lt;span class="nx"&gt;apply&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;ack&lt;/span&gt; &lt;span class="nx"&gt;to&lt;/span&gt; &lt;span class="nx"&gt;primary&lt;/span&gt;
                        &lt;span class="err"&gt;│&lt;/span&gt;
    &lt;span class="nx"&gt;Primary&lt;/span&gt; &lt;span class="nx"&gt;waits&lt;/span&gt; &lt;span class="err"&gt;◄──────┘&lt;/span&gt;
    &lt;span class="nx"&gt;majority&lt;/span&gt; &lt;span class="nf"&gt;ack &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="nx"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;mongos&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="nx"&gt;acknowledged&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The chain:&lt;/strong&gt; WiredTiger (how data is stored) → Journal (crash safety) → Oplog (replication fuel) → Replica Set (HA) → Sharding (horizontal scale). Each layer builds on the one below it.&lt;/p&gt;




&lt;h2&gt;
  
  
  MongoDB vs Postgres — When to Actually Use Each
&lt;/h2&gt;

&lt;p&gt;After going through both internals:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use MongoDB when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your data structure varies significantly across documents (product catalog, CMS, user profiles)&lt;/li&gt;
&lt;li&gt;You need horizontal write scale from the start (sharding)&lt;/li&gt;
&lt;li&gt;You're storing hierarchical data that maps naturally to documents&lt;/li&gt;
&lt;li&gt;Schema flexibility is genuinely required, not just convenient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Postgres when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need strong ACID across multiple related entities (orders, payments, inventory)&lt;/li&gt;
&lt;li&gt;Your data is relational — joins are frequent and important&lt;/li&gt;
&lt;li&gt;You need complex aggregations and window functions&lt;/li&gt;
&lt;li&gt;Consistency guarantees matter more than write throughput&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use both when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MongoDB as primary store + Elasticsearch for full-text search&lt;/li&gt;
&lt;li&gt;MongoDB for flexible catalog data + Postgres for transactional order data&lt;/li&gt;
&lt;li&gt;This is not over-engineering — it's using the right tool for each problem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The mistake isn't choosing MongoDB or Postgres. The mistake is choosing one before you understand what your data actually looks like and how you'll query it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next in the series — we go fully write-heavy. Cassandra and the LSM Tree: why it can absorb millions of writes per second, what compaction actually does, and the read trade-offs you're silently accepting when you choose it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>mongodb</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>Building My Own S3: A Deep Dive into Distributed Storage Systems</title>
      <dc:creator>Subham</dc:creator>
      <pubDate>Sun, 15 Mar 2026 09:21:41 +0000</pubDate>
      <link>https://dev.to/shadowsaurus/building-my-own-s3-a-deep-dive-into-distributed-storage-systems-1l7p</link>
      <guid>https://dev.to/shadowsaurus/building-my-own-s3-a-deep-dive-into-distributed-storage-systems-1l7p</guid>
      <description>&lt;p&gt;Ever wondered how systems like Amazon S3 or MinIO actually work under the hood? I spent the last 2 months building &lt;strong&gt;Lilio&lt;/strong&gt; — a distributed object storage system from scratch in Go. No frameworks, no shortcuts, just pure distributed systems engineering.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk you through the architecture, the problems I faced, the tradeoffs I made, and some benchmarks. Let's dive in.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Lilio?
&lt;/h2&gt;

&lt;p&gt;Lilio is an S3-inspired distributed object storage system that handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chunking&lt;/strong&gt;: Splitting large files into smaller pieces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replication&lt;/strong&gt;: Storing chunks on multiple nodes for fault tolerance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent Hashing&lt;/strong&gt;: Distributing data evenly across nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quorum Reads/Writes&lt;/strong&gt;: Ensuring consistency in a distributed environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encryption&lt;/strong&gt;: AES-256-GCM at the bucket level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pluggable Backends&lt;/strong&gt;: Support for local storage, S3, Google Drive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt;: Prometheus metrics out of the box
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                         Lilio                               │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │ Chunking│──│ Hashing │──│  Store  │──│ Encrypt │         │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘         │
│       │                          │                          │
│       ▼                          ▼                          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Storage Backends (Pluggable)           │    │
│  │   ┌───────┐    ┌───────┐    ┌───────┐               │    │
│  │   │ Local │    │  S3   │    │ GDrive│               │    │
│  │   └───────┘    └───────┘    └───────┘               │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  High-Level Flow
&lt;/h3&gt;

&lt;p&gt;When you upload a file to Lilio, here's what happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. File comes in (e.g., 100MB video)
2. Split into chunks (1MB each = 100 chunks)
3. For each chunk:
   a. Generate chunk ID
   b. Hash to find target nodes (consistent hashing)
   c. Replicate to N nodes (quorum write)
   d. Encrypt if bucket has encryption enabled
4. Store metadata (chunk locations, checksums)
5. Return success when write quorum achieved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Component Breakdown
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────────────────┐
│                        Lilio Engine                            │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│   ┌──────────────────┐     ┌──────────────────┐                │
│   │   HTTP API       │     │   CLI            │                │
│   │   (server.go)    │     │   (cli/)         │                │
│   └────────┬─────────┘     └────────┬─────────┘                │
│            │                        │                          │
│            └────────────┬───────────┘                          │
│                         ▼                                      │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                    Core Engine                          │  │
│   │                    (lilio.go)                           │  │
│   │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐        │  │
│   │  │  Streaming  │ │   Quorum    │ │  Encryption │        │  │
│   │  │  Chunker    │ │   Manager   │ │  (AES-GCM)  │        │  │
│   │  └─────────────┘ └─────────────┘ └─────────────┘        │  │
│   └─────────────────────────────────────────────────────────┘  │
│                         │                                      │
│            ┌────────────┼────────────┐                         │
│            ▼            ▼            ▼                         │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐              │
│   │  Metadata   │ │  Hash Ring  │ │  Backends   │              │
│   │  Store      │ │  (consist.) │ │  Registry   │              │
│   └─────────────┘ └─────────────┘ └─────────────┘              │
│                                                                │
└────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Deep Dive: The Interesting Parts
&lt;/h2&gt;

&lt;p&gt;Here are the 5 most interesting technical challenges I solved:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Consistent Hashing — Why Simple Modulo Fails
&lt;/h3&gt;

&lt;p&gt;My first implementation used simple modulo hashing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// DON'T DO THIS&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;getNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;totalNodes&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;hash&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sum256&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkID&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="n"&gt;totalNodes&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The problem?&lt;/strong&gt; When I added a 5th node to my 4-node cluster, approximately 80% of my data needed to relocate. In production, that's hours of downtime and massive network load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution: Consistent Hashing with Virtual Nodes&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;HashRing&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ring&lt;/span&gt;           &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;uint32&lt;/span&gt;           &lt;span class="c"&gt;// Sorted positions&lt;/span&gt;
    &lt;span class="n"&gt;positionToNode&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;uint32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;  &lt;span class="c"&gt;// Position -&amp;gt; Node name&lt;/span&gt;
    &lt;span class="n"&gt;virtualNodes&lt;/span&gt;   &lt;span class="kt"&gt;int&lt;/span&gt;                &lt;span class="c"&gt;// 150 per physical node&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;HashRing&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;AddNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nodeName&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;virtualNodes&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s#%d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ring&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ring&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;positionToNode&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nodeName&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ring&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ring&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ring&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when I add a node, only ~20% of data moves. That's a 4x improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But I hit another bug.&lt;/strong&gt; When two virtual nodes hashed to the same position, my collision handling was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// WRONG: Causes clustering&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;positionToNode&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;  &lt;span class="c"&gt;// Sequential increment&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If positions 5000, 5001, 5002 all collide, they cluster together — defeating the purpose of consistent hashing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix: Rehash on collision&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// CORRECT: Maintains random distribution&lt;/span&gt;
&lt;span class="n"&gt;retryCount&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;positionToNode&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;position&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;retryCount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;retryKey&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s#%d#retry%d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;retryCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;position&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hash&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retryKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c"&gt;// New random position&lt;/span&gt;
    &lt;span class="n"&gt;retryCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Streaming I/O — Memory Matters
&lt;/h3&gt;

&lt;p&gt;My first upload implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// DON'T DO THIS&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;uploadFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c"&gt;// 1GB in RAM&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;splitIntoChunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c"&gt;// Another 1GB&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;encrypt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                 &lt;span class="c"&gt;// More copies&lt;/span&gt;
        &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c"&gt;// Result: 1GB file = 3GB RAM usage. Server crashes at 500MB.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The fix: Stream chunk by chunk&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;ChunkReader&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;reader&lt;/span&gt;    &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Reader&lt;/span&gt;
    &lt;span class="n"&gt;chunkSize&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;buffer&lt;/span&gt;    &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;     &lt;span class="kt"&gt;int&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ChunkReader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;NextChunk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadFull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;reader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EOF&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EOF&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;cr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;cr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cr&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now I process one chunk at a time. Memory usage: constant 2MB regardless of file size.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Size&lt;/th&gt;
&lt;th&gt;Before (Buffered)&lt;/th&gt;
&lt;th&gt;After (Streaming)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 MB&lt;/td&gt;
&lt;td&gt;~30 MB RAM&lt;/td&gt;
&lt;td&gt;~2 MB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 MB&lt;/td&gt;
&lt;td&gt;~300 MB RAM&lt;/td&gt;
&lt;td&gt;~2 MB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 GB&lt;/td&gt;
&lt;td&gt;~3 GB RAM (crash)&lt;/td&gt;
&lt;td&gt;~2 MB RAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 GB&lt;/td&gt;
&lt;td&gt;💀&lt;/td&gt;
&lt;td&gt;~2 MB RAM ✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  3. Quorum — The Heart of Distributed Consensus
&lt;/h3&gt;

&lt;p&gt;In a distributed system, nodes can fail, be slow, or have stale data. Quorum ensures consistency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The formula:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;N = Total replicas
W = Write quorum (minimum nodes for successful write)
R = Read quorum (minimum nodes to read from)

Rule: W + R &amp;gt; N (guarantees overlap)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example with N=3, W=2, R=2:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WRITE (storing chunk):
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node A  │  │ Node B  │  │ Node C  │
│  v2 ✅  │  │  v2 ✅  │   │  (slow) │
└─────────┘  └─────────┘  └─────────┘
     ↑            ↑
   Write       Write

W=2 achieved → Write successful!
(Node C will catch up eventually)


READ (later):
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node A  │  │ Node B  │  │ Node C  │
│  v2     │  │  v2     │  │  v1     │ ← stale!
└─────────┘  └─────────┘  └─────────┘
     ↑            ↑            ↑
   Read        Read         Read

R=2 → Got v2 from A and B → Return v2 ✅
Background: Repair Node C with v2 🔧
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Lilio&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;retrieveChunkWithQuorum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkInfo&lt;/span&gt; &lt;span class="n"&gt;ChunkInfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;ChunkResponse&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt; &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitGroup&lt;/span&gt;

    &lt;span class="c"&gt;// Read from ALL nodes in parallel&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodeName&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;chunkInfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StorageNodes&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RetrieveChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkInfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;checksum&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;calculateChecksum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ChunkResponse&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;Data&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Checksum&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;checksum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;Valid&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;    &lt;span class="n"&gt;checksum&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;chunkInfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Checksum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="p"&gt;}(&lt;/span&gt;&lt;span class="n"&gt;nodeName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c"&gt;// Check quorum&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;responses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Quorum&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;R&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;errors&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"read quorum failed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Find valid responses, trigger repair for stale ones&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;validData&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;staleNodes&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;responses&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Valid&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;validData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Data&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;staleNodes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;staleNodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NodeName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Self-healing: fix stale nodes in background&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;staleNodes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readRepair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkInfo&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ChunkID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;validData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;staleNodes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;validData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  4. Pluggable Everything — Interface-Driven Design
&lt;/h3&gt;

&lt;p&gt;I wanted Lilio to work with different storage backends and metadata stores without code changes. Go interfaces made this clean:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage Backend Interface:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;StorageBackend&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;StoreChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;RetrieveChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;DeleteChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkID&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;ListChunks&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;Info&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;BackendInfo&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Implementations&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;LocalBackend&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="c"&gt;// Local filesystem&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;S3Backend&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;      &lt;span class="c"&gt;// Amazon S3&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;GDriveBackend&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c"&gt;// Google Drive&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metadata Store Interface:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;MetadataStore&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;CreateBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;GetBucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;BucketMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;SaveObjectMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;meta&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ObjectMetadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;
    &lt;span class="n"&gt;GetObjectMetadata&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ObjectMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;ListObjects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;// Implementations&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;LocalStore&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="c"&gt;// JSON files&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;EtcdStore&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;    &lt;span class="c"&gt;// Distributed etcd&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;MemoryStore&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c"&gt;// In-memory (testing)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Config-driven selection:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metadata"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"etcd"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"etcd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"endpoints"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"node1:2379"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node2:2379"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"node3:2379"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"storages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"local-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"local"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/data/1"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3-backup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"bucket"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-bucket"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Switch from local development to distributed production? Just change the config. Zero code changes.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Observability with Prometheus Metrics
&lt;/h3&gt;

&lt;p&gt;One lesson I learned: &lt;strong&gt;metrics aren't optional&lt;/strong&gt;. Adding them later is painful. I instrumented from day one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metrics Interface:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Collector&lt;/span&gt; &lt;span class="k"&gt;interface&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;RecordPutObject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sizeBytes&lt;/span&gt; &lt;span class="kt"&gt;int64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Duration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;RecordQuorumWrite&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;success&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodesAttempted&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nodesSucceeded&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;RecordReadRepair&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;RecordBackendHealth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;node&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;healthy&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c"&gt;// ... 15+ metrics total&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What I track:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;lilio_objects_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"photos"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"put"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;lilio_object_size_bytes&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"photos"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lilio_request_duration_seconds&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"put"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;histogram&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;lilio_quorum_write_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;lilio_read_repairs_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"backend-1"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;lilio_backend_health&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"backend-1"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gauge:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;up&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;down&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example PromQL queries:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Average upload speed
rate(lilio_object_size_bytes_sum[5m]) / rate(lilio_object_size_bytes_count[5m])

# Quorum success rate
sum(rate(lilio_quorum_write_total{status="success"}[5m]))
/
sum(rate(lilio_quorum_write_total[5m]))

# P95 latency
histogram_quantile(0.95, lilio_request_duration_seconds_bucket)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Setup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yaml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prom/prometheus&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;9090:9090"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="na"&gt;grafana&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;grafana/grafana&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000:3000"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./grafana/dashboards:/etc/grafana/provisioning/dashboards&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pre-built dashboard included with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request rates per bucket&lt;/li&gt;
&lt;li&gt;P95/P99 latencies&lt;/li&gt;
&lt;li&gt;Quorum success rates&lt;/li&gt;
&lt;li&gt;Backend health status&lt;/li&gt;
&lt;li&gt;Read repair activity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; When I ran load tests and saw P99 latency spike at 10MB files, metrics helped me identify the bottleneck in under 5 minutes. No metrics = flying blind.&lt;/p&gt;




&lt;h2&gt;
  
  
  Tradeoffs I Made
&lt;/h2&gt;

&lt;p&gt;Every system has tradeoffs. Here's my CAP theorem positioning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;        Consistency
            /\
           /  \
          /    \
         / Lilio\
        /   (CP) \
       /          \
      /____________\
 Availability    Partition
                 Tolerance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Lilio prioritizes Consistency + Partition Tolerance (CP):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tradeoff&lt;/th&gt;
&lt;th&gt;Choice&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Consistency vs Availability&lt;/td&gt;
&lt;td&gt;Consistency&lt;/td&gt;
&lt;td&gt;Quorum writes fail if W nodes unavailable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory vs Speed&lt;/td&gt;
&lt;td&gt;Memory&lt;/td&gt;
&lt;td&gt;Streaming is slower but handles any file size&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity vs Features&lt;/td&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Pluggable backends add code but enable flexibility&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunk Size&lt;/td&gt;
&lt;td&gt;1MB default&lt;/td&gt;
&lt;td&gt;Balance between metadata overhead and parallelism&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Chunk Size Analysis:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Small chunks (64KB):
  ✅ Better parallelism
  ✅ Faster recovery (less data to re-replicate)
  ❌ More metadata overhead
  ❌ More network round trips

Large chunks (64MB):
  ✅ Less metadata
  ✅ Fewer network calls
  ❌ Slower recovery
  ❌ More memory per operation

Sweet spot: 1MB - 16MB depending on use case
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;Tested on: &lt;strong&gt;MacBook Air M1, 8GB RAM, 3 in-memory storage backends&lt;/strong&gt;&lt;br&gt;
Configuration: &lt;strong&gt;N=3, W=2, R=2 (majority quorum)&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Upload Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Size&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;Memory Usage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 KB&lt;/td&gt;
&lt;td&gt;0.32ms&lt;/td&gt;
&lt;td&gt;3.16 MB/s&lt;/td&gt;
&lt;td&gt;1.0 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 KB&lt;/td&gt;
&lt;td&gt;0.70ms&lt;/td&gt;
&lt;td&gt;145.83 MB/s&lt;/td&gt;
&lt;td&gt;1.1 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 MB&lt;/td&gt;
&lt;td&gt;2.63ms&lt;/td&gt;
&lt;td&gt;398.04 MB/s&lt;/td&gt;
&lt;td&gt;2.1 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 MB&lt;/td&gt;
&lt;td&gt;25ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;419.35 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;11.6 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 MB&lt;/td&gt;
&lt;td&gt;270ms&lt;/td&gt;
&lt;td&gt;387.68 MB/s&lt;/td&gt;
&lt;td&gt;106.1 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key Insight:&lt;/strong&gt; Memory scales at ~1.06× file size. Streaming prevents loading the entire file into memory at once, but we still need buffers for chunking and encryption. The sweet spot is 1-10MB files where we hit &lt;strong&gt;peak throughput of 420 MB/s&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Small File Problem
&lt;/h3&gt;

&lt;p&gt;During benchmarking, I discovered an interesting bottleneck:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Size&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;th&gt;What's happening?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 KB&lt;/td&gt;
&lt;td&gt;3.16 MB/s&lt;/td&gt;
&lt;td&gt;Quorum overhead dominates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 KB&lt;/td&gt;
&lt;td&gt;145.83 MB/s&lt;/td&gt;
&lt;td&gt;Still overhead-heavy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 MB&lt;/td&gt;
&lt;td&gt;398.04 MB/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100× faster than 1KB!&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For tiny files, the cost of coordinating with W=2 nodes (network calls, checksums, metadata updates) overshadows the actual data transfer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix?&lt;/strong&gt; Implement a fast path for files &amp;lt;10KB that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Skips chunking (store as single chunk)&lt;/li&gt;
&lt;li&gt;Batches metadata updates&lt;/li&gt;
&lt;li&gt;Uses simpler replication logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Estimated improvement: &lt;strong&gt;10-100× throughput for small files&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Concurrent Performance (Where It Shines)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Sequential&lt;/th&gt;
&lt;th&gt;10 Concurrent&lt;/th&gt;
&lt;th&gt;Scaling Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1MB uploads&lt;/td&gt;
&lt;td&gt;398 MB/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,014 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.5×&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The system scales &lt;strong&gt;almost perfectly&lt;/strong&gt; with concurrency. Ten parallel uploads achieve 2.5× sequential throughput, validating the parallel replication architecture.&lt;/p&gt;
&lt;h3&gt;
  
  
  Download Performance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File Size&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Throughput&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1 KB&lt;/td&gt;
&lt;td&gt;0.14ms&lt;/td&gt;
&lt;td&gt;7.41 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 KB&lt;/td&gt;
&lt;td&gt;0.47ms&lt;/td&gt;
&lt;td&gt;217.79 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1 MB&lt;/td&gt;
&lt;td&gt;2.66ms&lt;/td&gt;
&lt;td&gt;394.74 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 MB&lt;/td&gt;
&lt;td&gt;485ms&lt;/td&gt;
&lt;td&gt;216.15 MB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reads are generally &lt;strong&gt;faster than writes&lt;/strong&gt; since there's no replication coordination — we just fetch from any R=2 replicas.&lt;/p&gt;
&lt;h3&gt;
  
  
  Quorum Impact (Surprising Results)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Throughput (1MB)&lt;/th&gt;
&lt;th&gt;Expected Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;N=3, W=1, R=1&lt;/td&gt;
&lt;td&gt;❌ &lt;strong&gt;Rejected&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;W+R must be &amp;gt; N&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;N=3, W=2, R=2&lt;/td&gt;
&lt;td&gt;185.82 MB/s&lt;/td&gt;
&lt;td&gt;Baseline (majority)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;N=3, W=3, R=3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;264.95 MB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Should be slower!&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Wait, W=3 is faster than W=2?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In my test environment (no network latency), waiting for all 3 nodes is actually &lt;em&gt;simpler&lt;/em&gt; than checking "did we hit quorum of 2 yet?". The goroutine coordination overhead is lower.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In production&lt;/strong&gt; with real network delays, W=2 would be faster since it doesn't wait for the slowest node. This shows why &lt;strong&gt;real-world testing matters&lt;/strong&gt; — local benchmarks can mislead you.&lt;/p&gt;


&lt;h2&gt;
  
  
  Production Architecture
&lt;/h2&gt;

&lt;p&gt;For production, here's the recommended setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    Load Balancer (NGINX/ALB)                │
└─────────────────────────┬───────────────────────────────────┘
                          │
    ┌─────────────────────┼─────────────────────┐
    │                     │                     │
    ▼                     ▼                     ▼
┌────────┐           ┌────────┐           ┌────────┐
│Lilio 1 │           │Lilio 2 │           │Lilio 3 │
│  AZ-1  │           │  AZ-2  │           │  AZ-3  │
└───┬────┘           └───┬────┘           └───┬────┘
    │                    │                    │
    └────────────────────┼────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    etcd Cluster (3 nodes)                   │
│               Distributed Metadata + Consensus              │
└─────────────────────────────────────────────────────────────┘
                         │
    ┌────────────────────┼────────────────────┐
    │                    │                    │
    ▼                    ▼                    ▼
┌────────┐          ┌────────┐          ┌────────┐
│Local   │          │  S3    │          │ GDrive │
│Storage │          │(backup)│          │(backup)│
└────────┘          └────────┘          └────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why etcd for metadata?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raft consensus:&lt;/strong&gt; Battle-tested distributed coordination (same algorithm as Consul, CockroachDB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong consistency:&lt;/strong&gt; Linearizable reads/writes for metadata&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production proven:&lt;/strong&gt; Powers Kubernetes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watch API:&lt;/strong&gt; Get notified when metadata changes (enables caching)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-node:&lt;/strong&gt; Survives node failures with quorum&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alternative I considered:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL:&lt;/strong&gt; Good for single-region, but complex to distribute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consul:&lt;/strong&gt; Similar to etcd, good choice too&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DynamoDB:&lt;/strong&gt; AWS-only, vendor lock-in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local JSON:&lt;/strong&gt; Works for dev, not for production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chose etcd because it's &lt;strong&gt;designed for exactly this use case&lt;/strong&gt; — distributed metadata with strong consistency.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations &amp;amp; What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;Being honest about limitations is important. Here's what's not perfect:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Small File Performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: 3.16 MB/s for 1KB files&lt;/li&gt;
&lt;li&gt;Issue: Quorum coordination overhead dominates&lt;/li&gt;
&lt;li&gt;Fix: Fast path for &amp;lt;10KB files (planned)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Memory Not Truly Constant&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I initially thought streaming would keep memory flat&lt;/li&gt;
&lt;li&gt;Reality: Memory scales at 1.06× file size&lt;/li&gt;
&lt;li&gt;Why: Still need buffers for chunks, encryption, checksums&lt;/li&gt;
&lt;li&gt;It's still efficient (prevents 2-3× overhead of naive approach)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Test vs Production&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;My benchmarks show W=3 faster than W=2 (no network latency)&lt;/li&gt;
&lt;li&gt;Production would flip this (network delays matter)&lt;/li&gt;
&lt;li&gt;Learning: Always test in environment similar to production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. No Rollback on Partial Writes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If 1 of 2 nodes fails mid-write, the successful chunk stays&lt;/li&gt;
&lt;li&gt;This leaves the system in inconsistent state until retry&lt;/li&gt;
&lt;li&gt;Fix: Implement two-phase commit or rollback logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Version-Based Conflict Resolution Incomplete&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ChunkInfo has a Version field but it's not used for conflict resolution&lt;/li&gt;
&lt;li&gt;Read repair uses checksums, not timestamps&lt;/li&gt;
&lt;li&gt;Fix: Implement proper last-write-wins with version comparison&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't bugs — they're &lt;strong&gt;priorities&lt;/strong&gt;. For a portfolio project, getting 80% working beats 100% planning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with interfaces&lt;/strong&gt;: I refactored twice before learning this. Design interfaces first, implement later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Streaming isn't optional&lt;/strong&gt;: Memory will bite you. Always assume files can be larger than RAM.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Distributed ≠ Complicated&lt;/strong&gt;: Consistent hashing and quorum are surprisingly implementable. Don't be scared.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test failure modes&lt;/strong&gt;: Kill nodes randomly. Corrupt data. See what breaks. That's where bugs hide.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Metrics from day one&lt;/strong&gt;: Adding Prometheus later is painful. Instrument everything upfront.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What's Next?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Performance Optimizations (from benchmarking):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Small file fast path (10-100× improvement for &amp;lt;10KB files)&lt;/li&gt;
&lt;li&gt;[ ] Early quorum exit (20-40% latency reduction when W &amp;lt; N)&lt;/li&gt;
&lt;li&gt;[ ] Buffer pooling (30% memory reduction for high-throughput scenarios)&lt;/li&gt;
&lt;li&gt;[ ] Parallel chunk upload (2-3× faster for large files)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Erasure coding (Reed-Solomon) for storage efficiency&lt;/li&gt;
&lt;li&gt;[ ] Multi-region replication&lt;/li&gt;
&lt;li&gt;[ ] Garbage collection for orphaned chunks&lt;/li&gt;
&lt;li&gt;[ ] Web UI dashboard&lt;/li&gt;
&lt;li&gt;[ ] Kubernetes operator&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/subhammahanty235/lilio
&lt;span class="nb"&gt;cd &lt;/span&gt;lilio
go build &lt;span class="nt"&gt;-o&lt;/span&gt; lilio ./cmd/lilio

&lt;span class="c"&gt;# Start server&lt;/span&gt;
./lilio server

&lt;span class="c"&gt;# Create bucket and upload&lt;/span&gt;
./lilio bucket create photos
./lilio put vacation.jpg photos/vacation.jpg

&lt;span class="c"&gt;# Download&lt;/span&gt;
./lilio get photos/vacation.jpg downloaded.jpg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Want to see it in action with metrics?&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start infrastructure (etcd + Prometheus + Grafana)&lt;/span&gt;
docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Run with etcd metadata store&lt;/span&gt;
./lilio server &lt;span class="nt"&gt;--metadata&lt;/span&gt; etcd

&lt;span class="c"&gt;# Open Grafana dashboard&lt;/span&gt;
open http://localhost:3000
&lt;span class="c"&gt;# Login: admin/admin&lt;/span&gt;
&lt;span class="c"&gt;# Navigate to pre-configured Lilio dashboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Building a distributed storage system taught me more about systems design than any course or book. The concepts — consistent hashing, quorum, streaming I/O, interface-driven design — are applicable everywhere.&lt;/p&gt;

&lt;p&gt;If you're learning Go or distributed systems, I highly recommend building something like this. Start simple, add complexity incrementally, and don't be afraid to break things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/subhammahanty235/lilio" rel="noopener noreferrer"&gt;github.com/subhammahanty235/lilio&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Questions? Feedback? Found a bug? Open an issue or reach out!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you found this useful, consider:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⭐ &lt;strong&gt;Starring the repo&lt;/strong&gt; on GitHub&lt;/li&gt;
&lt;li&gt;📢 &lt;strong&gt;Sharing this post&lt;/strong&gt; with someone learning distributed systems&lt;/li&gt;
&lt;li&gt;💬 &lt;strong&gt;Leaving feedback&lt;/strong&gt; — what worked? what confused you?&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; &lt;code&gt;#go&lt;/code&gt; &lt;code&gt;#golang&lt;/code&gt; &lt;code&gt;#distributedsystems&lt;/code&gt; &lt;code&gt;#systemdesign&lt;/code&gt; &lt;code&gt;#backend&lt;/code&gt; &lt;code&gt;#storage&lt;/code&gt; &lt;code&gt;#programming&lt;/code&gt; &lt;code&gt;#s3&lt;/code&gt; &lt;code&gt;#objectstorage&lt;/code&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
      <category>backend</category>
      <category>systemdesign</category>
      <category>go</category>
    </item>
    <item>
      <title>Dissecting PostgreSQL — Why Being Read-Optimized Comes at the Cost of Write Speed</title>
      <dc:creator>Subham</dc:creator>
      <pubDate>Sat, 14 Mar 2026 19:59:28 +0000</pubDate>
      <link>https://dev.to/shadowsaurus/dissecting-postgresql-why-being-read-optimized-comes-at-the-cost-of-write-speed-5016</link>
      <guid>https://dev.to/shadowsaurus/dissecting-postgresql-why-being-read-optimized-comes-at-the-cost-of-write-speed-5016</guid>
      <description>&lt;p&gt;You've probably heard this before — &lt;em&gt;"Postgres is great but not ideal for write-heavy workloads."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But &lt;em&gt;why&lt;/em&gt;? What is actually happening under the hood that makes writes slow? And how is Cassandra or any LSM-tree database fundamentally different?&lt;/p&gt;

&lt;p&gt;This post is a deep dive into Postgres internals — storage layout, MVCC, WAL, B-Tree indexes, and how all of these together create a write bottleneck by design. Not as a flaw, but as a deliberate tradeoff.&lt;/p&gt;

&lt;p&gt;Let's dissect it.&lt;/p&gt;




&lt;h2&gt;
  
  
  First — Where Does Postgres Actually Store Your Data?
&lt;/h2&gt;

&lt;p&gt;Before we talk about writes, you need to understand where the data lives physically. No magic here.&lt;/p&gt;

&lt;p&gt;Your &lt;code&gt;users&lt;/code&gt; table is literally a binary file on disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/var/lib/postgresql/16/main/base/16384/24601
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. A plain file. You can &lt;code&gt;hexdump&lt;/code&gt; it and see your users' names and emails in raw bytes.&lt;/p&gt;

&lt;p&gt;Run this to find yours:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;data_directory&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- /var/lib/postgresql/16/main&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;oid&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'myapp'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- 16384&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relfilenode&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_class&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'users'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- 24601&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;$PGDATA&lt;/code&gt; directory structure looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$PGDATA/
├── base/
│   └── 16384/          ← your database (identified by OID)
│       ├── 24601        ← users table (heap file)
│       ├── 24601_fsm    ← free space map
│       ├── 24601_vm     ← visibility map
│       └── 24605        ← users_pkey index
├── pg_wal/              ← write-ahead log segments
└── postgresql.conf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The 8KB Page — Storage's Atomic Unit
&lt;/h3&gt;

&lt;p&gt;Postgres doesn't read individual rows from disk. It reads &lt;strong&gt;pages&lt;/strong&gt; — 8KB blocks. Every table file is a sequence of these pages.&lt;/p&gt;

&lt;p&gt;A single page looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────┐
│ Page Header (24 bytes)          │  ← LSN, checksum, flags
├─────────────────────────────────┤
│ Item Pointers (4 bytes each)    │  ← offsets pointing to tuples  below
├─────────────────────────────────┤
│                                 │
│         free space              │
│                                 │
├─────────────────────────────────┤
│ Tuple 3 │ Tuple 2 │ Tuple 1     │  ← actual row data, grows upward
├─────────────────────────────────┤
│ Special space                   │  ← used by indexes
└─────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Item pointers grow downward, tuples grow upward. When they meet — page is full. New rows go to a new page.&lt;/p&gt;

&lt;h3&gt;
  
  
  What a Tuple (Row) Looks Like on Disk
&lt;/h3&gt;

&lt;p&gt;Every row has a 23-byte header you never see:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t_xmin&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;Transaction ID that inserted this row&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t_xmax&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4B&lt;/td&gt;
&lt;td&gt;Transaction ID that deleted/updated this row (0 = still alive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t_ctid&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;6B&lt;/td&gt;
&lt;td&gt;Physical location — &lt;code&gt;(page_number, item_index)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t_infomask&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2B&lt;/td&gt;
&lt;td&gt;Flags — is it committed? frozen? has nulls?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;t_hoff&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1B&lt;/td&gt;
&lt;td&gt;Header size (grows if there's a null bitmap)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;null bitmap&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;Which columns are NULL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;column data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;Your actual data&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That &lt;code&gt;t_xmin&lt;/code&gt; and &lt;code&gt;t_xmax&lt;/code&gt; — remember these. They are the key to understanding everything that follows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Write Path — Where Things Get Expensive
&lt;/h2&gt;

&lt;p&gt;Now let's trace what actually happens when you run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Bob'&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 1 — Write to WAL First
&lt;/h3&gt;

&lt;p&gt;Before touching any data page, Postgres writes your change to the &lt;strong&gt;Write-Ahead Log&lt;/strong&gt; (&lt;code&gt;pg_wal/&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_wal/000000010000000000000001   ← 16MB binary segment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is for crash safety — if the server dies mid-write, Postgres replays the WAL from the last checkpoint and recovers. Every write goes here first, then to the actual heap.&lt;/p&gt;

&lt;p&gt;So already, one logical write = two physical writes.&lt;/p&gt;

&lt;p&gt;WAL is sequential — always appending. That part is fast. The problem comes next.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — Find the Right Page (Random I/O)
&lt;/h3&gt;

&lt;p&gt;Postgres needs to find the page containing the row with &lt;code&gt;id = 5&lt;/code&gt;. It checks the &lt;strong&gt;shared buffer pool&lt;/strong&gt; first (RAM cache). If the page is there — great, cache hit. If not, it reads the 8KB page from disk.&lt;/p&gt;

&lt;p&gt;This is a &lt;strong&gt;random I/O&lt;/strong&gt; — the disk head jumps to wherever that page lives. On spinning disks, random I/O is painfully slow (~100 IOPS vs ~500 MB/s for sequential). Even on SSDs, random reads are more expensive than sequential.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — MVCC: The Old Row is NOT Deleted
&lt;/h3&gt;

&lt;p&gt;This is where Postgres gets genuinely interesting — and where the write overhead accumulates.&lt;/p&gt;

&lt;p&gt;When you &lt;code&gt;UPDATE&lt;/code&gt; a row, Postgres does &lt;strong&gt;not&lt;/strong&gt; modify it in place. Instead:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The old row's &lt;code&gt;t_xmax&lt;/code&gt; is set to the current transaction ID — marking it as "deleted by this transaction"&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;brand new row version&lt;/strong&gt; is inserted on the page, with &lt;code&gt;t_xmin&lt;/code&gt; = current transaction ID&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So after your UPDATE, the page has &lt;strong&gt;two versions&lt;/strong&gt; of the row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────┐
│  Tuple (old): t_xmin=100, t_xmax=205 ← dead    │
│  name = 'Alice'                                │
├────────────────────────────────────────────────┤
│  Tuple (new): t_xmin=205, t_xmax=0   ← alive   │
│  name = 'Bob'                                  │
└────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is &lt;strong&gt;MVCC — Multi-Version Concurrency Control&lt;/strong&gt;. The big benefit: readers and writers don't block each other. A &lt;code&gt;SELECT&lt;/code&gt; running at transaction 204 will still see &lt;code&gt;'Alice'&lt;/code&gt; because the new version isn't visible to it yet. No locks needed for reads.&lt;/p&gt;

&lt;p&gt;The cost: every UPDATE is secretly an INSERT + a mark operation. Dead tuples pile up on disk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4 — Update Every Index
&lt;/h3&gt;

&lt;p&gt;If your &lt;code&gt;users&lt;/code&gt; table has 5 indexes (primary key, email, created_at, status, name), then a single UPDATE that changes any indexed column must update all relevant index pages too.&lt;/p&gt;

&lt;p&gt;Each index is a separate B-Tree file on disk. Each index update is another random I/O.&lt;/p&gt;

&lt;p&gt;One &lt;code&gt;UPDATE&lt;/code&gt; → 1 WAL write + 1 heap page read + 1 heap page write + N index page writes.&lt;/p&gt;

&lt;p&gt;This is write amplification in action.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 — VACUUM Cleans Up the Mess (Later)
&lt;/h3&gt;

&lt;p&gt;Dead tuples don't disappear. They sit on disk pages, taking up space, making full table scans slower. &lt;code&gt;autovacuum&lt;/code&gt; runs periodically in the background to reclaim them.&lt;/p&gt;

&lt;p&gt;But VACUUM itself consumes I/O and CPU. And if autovacuum can't keep up with your write rate — table bloat happens. Pages fill up with dead tuples. Sequential scans slow down. Index bloat increases.&lt;/p&gt;

&lt;p&gt;Here's the full write path visualized:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UPDATE query
    │
    ▼
WAL Buffer (RAM)
    │
    ▼ flush
WAL on disk (sequential write — fast)
    │
    ▼
Shared Buffer Pool → find/load heap page (random I/O — slow if miss)
    │
    ├──► Mark old tuple t_xmax = txn_id     (dead tuple, stays on disk)
    │
    └──► Insert new tuple on same/new page  (MVCC)
              │
              └──► Update all N index B-Trees  (N × random I/O)
                        │
                        ▼
              VACUUM (later, async) cleans dead tuples
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The B-Tree Problem
&lt;/h2&gt;

&lt;p&gt;Postgres uses &lt;strong&gt;B-Tree&lt;/strong&gt; as the default index structure. B-Trees are brilliant for reads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WHERE id = 5&lt;/code&gt; → O(log n) traversal from root to leaf&lt;/li&gt;
&lt;li&gt;Range queries (&lt;code&gt;WHERE created_at BETWEEN ...&lt;/code&gt;) → traverse to start, scan leaf nodes&lt;/li&gt;
&lt;li&gt;Sorted results (&lt;code&gt;ORDER BY id&lt;/code&gt;) → free, B-Tree is already sorted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But B-Trees have a fundamental write problem: &lt;strong&gt;updates are random I/O&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When you insert a new value, the B-Tree must find the correct leaf node and insert there. If the leaf is full, it splits — and the split propagates upward. Each node is a page on disk. Inserting into a B-Tree means jumping to a (potentially uncached) page, modifying it, writing it back.&lt;/p&gt;

&lt;p&gt;At scale — millions of writes per second — this random I/O becomes the bottleneck.&lt;/p&gt;

&lt;p&gt;Compare this to what write-optimized databases use.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why LSM Trees (Cassandra, RocksDB) Are Different
&lt;/h2&gt;

&lt;p&gt;LSM = &lt;strong&gt;Log-Structured Merge Tree&lt;/strong&gt;. The core insight: sequential writes to disk are an order of magnitude faster than random writes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Spinning disk:  sequential ~500 MB/s  vs  random ~1 MB/s  (500x difference!)
SSD:            sequential ~3 GB/s    vs  random ~200 MB/s (15x difference!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;LSM trees exploit this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write arrives
    │
    ▼
MemTable (RAM, sorted) ← microseconds, no disk I/O
    │
    ▼ when full, flush
SSTable on disk (immutable, sequential write) ← fast!
    │
    ▼ background
Compaction — merge SSTables, remove old versions ← sequential, amortized
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No random I/O on the write path. No index updates. No MVCC dead tuples. Writes just go to RAM, then get flushed sequentially to disk.&lt;/p&gt;

&lt;p&gt;The tradeoff: &lt;strong&gt;reads are slower&lt;/strong&gt;. To read a value, you might need to check multiple SSTables (the value could be in any of them before compaction). Bloom filters help, but LSM reads are fundamentally more expensive than a B-Tree lookup.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Tradeoff, Visualized
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PostgreSQL (B-Tree)&lt;/th&gt;
&lt;th&gt;Cassandra (LSM Tree)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Write path&lt;/td&gt;
&lt;td&gt;Random I/O + MVCC + index updates&lt;/td&gt;
&lt;td&gt;Sequential write to RAM → disk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Read path&lt;/td&gt;
&lt;td&gt;B-Tree O(log n), very fast&lt;/td&gt;
&lt;td&gt;Check MemTable + multiple SSTables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Updates&lt;/td&gt;
&lt;td&gt;New row version + dead tuple&lt;/td&gt;
&lt;td&gt;Append new version, compaction later&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disk I/O pattern&lt;/td&gt;
&lt;td&gt;Random&lt;/td&gt;
&lt;td&gt;Sequential&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Garbage collection&lt;/td&gt;
&lt;td&gt;VACUUM (explicit)&lt;/td&gt;
&lt;td&gt;Compaction (background)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Transactions&lt;/td&gt;
&lt;td&gt;Full ACID&lt;/td&gt;
&lt;td&gt;Eventual consistency (tunable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex queries&lt;/td&gt;
&lt;td&gt;Excellent (JOINs, aggregations)&lt;/td&gt;
&lt;td&gt;Very limited&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Neither is better. They solve different problems.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Does Postgres Write Performance Actually Hurt?
&lt;/h2&gt;

&lt;p&gt;Postgres handles typical web app write loads just fine. The bottleneck shows up when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. High UPDATE rate on wide tables with many indexes&lt;/strong&gt;&lt;br&gt;
Every UPDATE touches N index files. A table with 8 indexes and 10,000 updates/second = potentially 80,000 random I/O operations per second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. autovacuum can't keep up&lt;/strong&gt;&lt;br&gt;
Dead tuples accumulate faster than VACUUM can reclaim them. Table bloat increases. Full scans read more pages. The problem compounds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check table bloat&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_live_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="k"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_live_tup&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dead_pct&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_tables&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;n_dead_tup&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Checkpoint I/O spikes&lt;/strong&gt;&lt;br&gt;
Every &lt;code&gt;checkpoint_timeout&lt;/code&gt; (default 5 min), Postgres flushes all dirty pages to disk. This creates a spike. Tune &lt;code&gt;checkpoint_completion_target = 0.9&lt;/code&gt; to spread the I/O.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Transaction ID Wraparound&lt;/strong&gt;&lt;br&gt;
Every transaction gets a 32-bit ID. At ~2 billion transactions, wraparound happens — Postgres can no longer tell old from new. It enters emergency autovacuum. This is a real production incident waiting to happen if you're not monitoring it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Monitor wraparound risk&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;datname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datfrozenxid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;txn_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_database&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;txn_age&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- If age &amp;gt; 1.5 billion — start worrying&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Squeeze More Write Performance Out of Postgres
&lt;/h2&gt;

&lt;p&gt;If you need Postgres for write-heavy workloads, here's what actually helps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Tune autovacuum aggressively for busy tables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Default &lt;code&gt;autovacuum_vacuum_scale_factor = 0.2&lt;/code&gt; means VACUUM only triggers after 20% of rows are dead. For a 10M row table that's 2M dead tuples before cleanup. Tighten it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_vacuum_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;-- trigger at 1%&lt;/span&gt;
  &lt;span class="n"&gt;autovacuum_analyze_scale_factor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;005&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Batch writes, avoid single-row inserts&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Bad: 1000 round trips, 1000 WAL flushes&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(...);&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(...);&lt;/span&gt;
&lt;span class="c1"&gt;-- ... × 1000&lt;/span&gt;

&lt;span class="c1"&gt;-- Good: 1 round trip, 1 WAL flush&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(...),&lt;/span&gt; &lt;span class="p"&gt;(...),&lt;/span&gt; &lt;span class="p"&gt;(...);&lt;/span&gt; &lt;span class="c1"&gt;-- × 1000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Use &lt;code&gt;COPY&lt;/code&gt; for bulk loads&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;COPY&lt;/code&gt; is 10-50x faster than &lt;code&gt;INSERT&lt;/code&gt; for bulk data. It bypasses a lot of overhead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'/tmp/events.csv'&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;FORMAT&lt;/span&gt; &lt;span class="n"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Partial indexes to reduce index maintenance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of indexing everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Don't index all rows&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Only index what you actually query&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fewer index entries = faster updates on non-active rows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture in One Picture
&lt;/h2&gt;

&lt;p&gt;Here's the full Postgres architecture, from your query to the disk:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your App
    │  SQL string over TCP (port 5432)
    ▼
Postmaster ──► fork() → Backend Process (1 per connection)
                              │
                    ┌─────────┼──────────┐
                    ▼         ▼          ▼
              Buffer Pool   Locks     WAL Buffer
              (shared RAM)           (shared RAM)
                    │                    │
                    ▼                    ▼
               BgWriter             WAL Writer
               Autovacuum           Checkpointer
                    │                    │
                    ▼                    ▼
              Heap files (.base/)    pg_wal/ segments
              Index files
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every query goes through: Parse → Analyze → Rewrite → Plan → Execute.&lt;/p&gt;

&lt;p&gt;The planner is where performance is won or lost — it estimates the cost of every possible execution plan and picks the cheapest. Feed it bad statistics (stale &lt;code&gt;pg_statistic&lt;/code&gt;) and it picks wrong plans.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Always run after bulk loads&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- See what plan the planner chose&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BUFFERS&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'bob@example.com'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Postgres is read-optimized not by accident — it's the consequence of specific, deliberate design choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MVCC&lt;/strong&gt; → readers never block writers (great for concurrency, bad for write amplification)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;B-Tree indexes&lt;/strong&gt; → fast O(log n) reads, expensive random-I/O writes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heap storage&lt;/strong&gt; → rows stored in 8KB pages by insertion order, not update-friendly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WAL&lt;/strong&gt; → crash safety, but doubles write I/O&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VACUUM&lt;/strong&gt; → background GC for MVCC dead tuples, adds overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These same choices are what make Postgres excellent at complex SQL queries, strong consistency, and concurrent reads. You can't have both without tradeoffs.&lt;/p&gt;

&lt;p&gt;If you need extreme write throughput: Cassandra, ScyllaDB, or ClickHouse (for analytics). If you need correctness, complex queries, and strong consistency: Postgres is still one of the best databases ever built.&lt;/p&gt;

&lt;p&gt;The trick is knowing &lt;em&gt;which tradeoff you're making&lt;/em&gt; — and this post was about making sure you know exactly why.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Next up in this series — we flip sides completely. We'll dissect a write-heavy database from the inside: how LSM trees actually work, what happens during compaction, and why Cassandra can eat millions of writes per second without breaking a sweat.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>postgres</category>
      <category>watercooler</category>
    </item>
  </channel>
</rss>
