DEV Community: Subham

Dissecting MongoDB — It Uses B-Trees Too, So Why Are Writes Actually Faster?

Subham — Tue, 24 Mar 2026 15:15:38 +0000

Last time we dissected Postgres and found that its write slowness isn't a bug — it's the direct consequence of MVCC, B-Tree indexes, and heap storage all working together to make reads fast.

MongoDB is often pitched as the alternative for flexible, write-heavy workloads. But why is it better at writes? What does "flexible schema" actually mean on disk? And how does sharding actually work when your data grows beyond a single machine?

Let's open it up.

First — When Does MongoDB Actually Make Sense?

Before internals, the scenario. Because choosing a database without understanding your workload is how you end up rewriting everything in 18 months.

Consider a product catalog — something like Flipkart or Amazon. You have phones, t-shirts, furniture, and groceries. Each category has completely different attributes:

// Mobile phone
{
  "_id": "prod_123",
  "name": "iPhone 15",
  "category": "electronics",
  "specs": {
    "ram": "8GB",
    "storage": "256GB",
    "battery": "3877mAh"
  },
  "variants": [
    { "color": "black", "price": 79999, "stock": 45 },
    { "color": "white", "price": 79999, "stock": 12 }
  ]
}

// T-shirt — completely different shape
{
  "_id": "prod_456",
  "name": "Cotton T-Shirt",
  "category": "clothing",
  "specs": {
    "material": "100% cotton",
    "fit": "slim"
  },
  "variants": [
    { "size": "S", "color": "red", "price": 599, "stock": 200 }
  ]
}

In Postgres, you'd either create 50 nullable columns, use EAV (Entity-Attribute-Value — a known nightmare), or store a JSONB blob. The last option works, but then you're essentially building MongoDB inside Postgres.

MongoDB wins here because the document model is native — no schema migration when you add a new product category, no ALTER TABLE on a 50M row table at 2 AM.

Where MongoDB loses: Anything needing strong ACID across multiple documents — orders, payments, financial ledgers. Multi-document transactions exist in MongoDB (since v4.0) but they're slower and less battle-tested than Postgres. Use the right tool.

The Storage Engine — WiredTiger

MongoDB before version 3.2 used MMAPv1 — memory-mapped files directly. Simple, but had brutal limitations: collection-level locking (one write blocks all reads on that collection), no compression, poor write performance.

In 2014, MongoDB acquired WiredTiger. Everything changed.

What WiredTiger Actually Is

Here's the thing most articles get wrong — WiredTiger stores data in B-Trees on disk, not LSM trees. The files look like this:

/var/lib/mongodb/
├── collection-0-1234567890.wt    ← your users collection, B-Tree file
├── collection-2-1234567890.wt    ← another collection
├── index-1-1234567890.wt         ← index file
├── WiredTigerLog.0000000001      ← journal (WAL equivalent)
└── WiredTiger.wt                 ← metadata

So if both Postgres and WiredTiger use B-Trees on disk, why are MongoDB writes faster?

Because WiredTiger uses an aggressive in-memory buffer with in-place updates and automatic MVCC cleanup — the combination that Postgres doesn't have.

Three specific differences:

1. MVCC location. Postgres keeps dead tuple versions on disk pages until VACUUM cleans them. WiredTiger keeps old versions in RAM only, for the duration of active transactions. Transaction commits → old version gone from memory automatically. No VACUUM needed.

2. Compression. WiredTiger compresses every page (Snappy by default, Zstd optionally). Postgres has no page-level compression by default. Less data on disk = less I/O.

3. Cache size. WiredTiger defaults to 50% of RAM. Postgres shared_buffers defaults to 128MB — criminally small for production. More cache = more cache hits = less disk I/O.

What a Document Looks Like on Disk

MongoDB stores documents in BSON — Binary JSON. Compact, typed, fast to parse.

{ name: "Bob", age: 25 }

→ BSON bytes:
\x16\x00\x00\x00        (document size: 22 bytes)
\x02                    (type: string)
name\x00                (field name)
\x04\x00\x00\x00Bob\x00 (string value)
\x10                    (type: int32)
age\x00                 (field name)
\x19\x00\x00\x00        (value: 25)
\x00                    (document end)

Each document has an _id field — ObjectId by default (12-byte unique identifier). This becomes the primary key and the default index.

The Write Path — From Your Code to Durable Storage

When you run db.users.insertOne({name: "Bob"}), here's everything that actually happens:

Step 1 — Wire Protocol

Your driver serializes the document to BSON and sends it over TCP port 27017 using MongoDB's binary wire protocol (OP_MSG). Along with the document, it sends a Write Concern — this single parameter controls the entire durability guarantee.

More on Write Concern at the end of this section. First, what happens on the server.

Step 2 — WiredTiger Cache (RAM)

The write lands in WiredTiger's in-memory cache first — no disk I/O yet. The relevant B-Tree page is found (or created), the document is written in-place on that page, and the page is marked dirty.

If the document grew in size (new fields added) and doesn't fit on its current page, WiredTiger moves it to a new page — a B-Tree page split. This is the "document padding" strategy: WiredTiger leaves extra space after documents to accommodate small in-place updates without splits.

Step 3 — Journal Write (Crash Safety)

Simultaneously, the change is written to the journal — WiredTiger's WAL equivalent:

/var/lib/mongodb/journal/WiredTigerLog.0000000001

Sequential append. Fast. Every 100ms, the journal is flushed to disk (configurable via storage.journal.commitIntervalMs). This means in the default configuration, a crash can lose up to 100ms of writes.

If you set j: true in Write Concern, MongoDB waits for the journal flush before acknowledging — safer, slower.

Step 4 — Oplog Entry (Replication)

If you're running a replica set (and in production you always are), the write is also recorded in the oplog:

// local.oplog.rs — a special capped collection
{
  op: "i",                    // i=insert, u=update, d=delete
  ns: "myapp.users",          // namespace
  o: { _id: ObjectId("..."), name: "Bob" },  // the document
  ts: Timestamp(1234567890, 1),  // logical clock
  v: 2                        // oplog version
}

Secondaries tail this collection continuously — like tail -f on a log file — and apply each entry to their own WiredTiger instance. This is the entire replication mechanism.

Step 5 — Checkpoint (Durable to Disk)

Every 60 seconds (or when dirty data hits 5GB), WiredTiger runs a checkpoint — all dirty pages in the cache are flushed to their respective .wt B-Tree files on disk. After a checkpoint, the journal up to that point can be safely discarded.

The full write path:

insertOne({name: "Bob"})
    │
    ▼
BSON serialize → TCP (port 27017)
    │
    ▼
WiredTiger cache → dirty page marked
    │
    ├──► Journal append (sequential, every 100ms flush)
    │
    └──► Oplog entry (local.oplog.rs) → secondaries tail this
              │
              ▼  every 60s
         Checkpoint → .wt files on disk (durable)

Write Concern — The Durability Dial

This is the most important knob in MongoDB writes. Most developers leave it at the default and don't think about it — which is fine until something breaks.

Write Concern	What it means	Risk
`w: 0`	Fire and forget — no ack	Complete data loss possible
`w: 1`	Primary RAM — default	100ms window: crash → data loss
`w: 1, j: true`	Primary journal flushed	Single node crash safe
`w: majority`	Majority nodes RAM	Network partition safe
`w: majority, j: true`	Majority nodes journal	Safest. Use for financial data

The common mistake: running production with w: 1 (default) and assuming the data is safe. Kill the primary 50ms after a write — that write is gone. For anything that matters, use w: majority.

Replication — How Replica Sets Actually Work

MongoDB's replication unit is the Replica Set — a group of mongod processes that all hold the same data.

Minimum setup: 1 Primary + 2 Secondaries. Why three? Because elections require a majority — with two nodes, if one goes down, you can't form a majority of two, so writes stop.

The Oplog Tailing Mechanism

Secondary nodes maintain a long-polling cursor on the primary's local.oplog.rs. Every new oplog entry triggers the secondary to:

Read the entry
Apply the operation to its own WiredTiger instance
Update its lastAppliedTimestamp
Send a heartbeat back to primary with this timestamp

The primary uses these timestamps to determine replication lag — how far behind each secondary is. With w: majority write concern, the primary waits until enough secondaries have confirmed applying the write before acknowledging to the client.

Elections — What Happens When Primary Goes Down

Primary sends heartbeats every 2 seconds. If a secondary doesn't hear from primary for 10 seconds — it calls an election.

1. Secondary times out (10s no heartbeat)
2. Increments its term number → becomes CANDIDATE
3. Sends RequestVote to all members: "my oplog is at ts:X, vote for me?"
4. Members check: is candidate's oplog as fresh as mine?
   → Yes → vote granted
   → No → vote refused
5. Candidate gets majority → becomes PRIMARY
6. Others step down to SECONDARY
7. Writes resume (typically 10-30s total downtime)

The key insight: The candidate with the most up-to-date oplog wins. This prevents electing a secondary that missed recent writes.

Production implication: Your application must handle 10-30 seconds of write unavailability during elections. Use retryWrites: true in your connection string — MongoDB drivers automatically retry eligible operations after a new primary is elected.

Read Preference — Where Do Reads Go?

Mode	Behavior	Use case
`primary` (default)	Only primary	Always fresh data
`primaryPreferred`	Primary, fallback secondary	Good general default
`secondary`	Only secondaries	Read scaling, stale data ok
`secondaryPreferred`	Secondary, fallback primary	Read-heavy workloads
`nearest`	Lowest latency node	Multi-region, geo-distributed

The gotcha with secondary reads: replication lag. If your secondary is 500ms behind and a user writes then immediately reads, they won't see their own write. This is the read-your-own-writes problem.

Fix: use causal consistency at the session level:

const session = client.startSession({ causalConsistency: true });
await db.collection('users').insertOne({ name: 'Bob' }, { session });
// This read is guaranteed to see the above write
await db.collection('users').findOne({ name: 'Bob' }, { session });

Sharding — Scaling Beyond One Machine

Replication gives you high availability — same data on multiple nodes. Sharding gives you horizontal scale — different data on different nodes.

These are different problems. Both matter in production. In a sharded cluster, each shard is itself a replica set — so you get both.

The Three Components

mongos (router): This is what your application connects to. It's stateless — holds no data, just routes queries. Always run 2+ behind a load balancer for HA.

Config Servers: A dedicated replica set (3 nodes) that stores the cluster's metadata — which chunk of data lives on which shard, shard membership, balancer status. The cluster's brain. If config servers go down, no writes are allowed cluster-wide.

Shards: Where your actual data lives. Each shard is a full replica set.

Your App
    │
    ▼
mongos × 2 (stateless routers)
    │         ↕ chunk map
    │    Config Server RS (× 3)
    │
    ├──► Shard 1 RS  (users: country A-H)
    ├──► Shard 2 RS  (users: country I-R)
    └──► Shard 3 RS  (users: country S-Z)

Chunks — How Data Is Divided

MongoDB divides a collection into chunks — each chunk represents a contiguous range of shard key values. Default chunk size is 128MB.

// Enable sharding on database
sh.enableSharding("myapp")

// Shard the users collection by country
sh.shardCollection("myapp.users", { country: 1 })

Now MongoDB splits the country value space into chunks and distributes them across shards. When a chunk grows beyond 128MB, it splits automatically. The balancer — a background process — migrates chunks between shards to keep distribution even.

Chunk migration has a real cost: data is physically copied over the network from one shard to another. Schedule maintenance windows or disable the balancer during peak hours:

sh.setBalancerState(false)  // peak hours
sh.setBalancerState(true)   // off-peak

Shard Key — The Most Important Decision You'll Make

The shard key determines which shard a document lives on. Choose wrong and your cluster becomes useless. Changing it later is painful (though MongoDB 5.0+ supports online resharding).

The hotspot problem — ObjectId as shard key:

ObjectId is monotonically increasing (contains a timestamp). Every new document gets an ObjectId larger than the previous one. All new inserts land in the "last" chunk. That chunk lives on one shard. Result: one shard at 100% write load, others idle.

// BAD — ObjectId is monotonically increasing
sh.shardCollection("myapp.orders", { _id: 1 })

// GOOD — hash distributes uniformly
sh.shardCollection("myapp.orders", { _id: "hashed" })

Hashed shard key solves write hotspots — the hash of ObjectId is uniformly distributed. But you lose range queries — WHERE createdAt BETWEEN x AND y becomes a scatter-gather across all shards.

Compound shard key balances both:

// country gives locality, userId gives cardinality
sh.shardCollection("myapp.users", { country: 1, userId: 1 })

Queries filtering by country hit one shard. High cardinality (millions of user IDs) means many chunks = good distribution.

The three criteria for a good shard key:

High cardinality — enough distinct values to create many chunks
Non-monotonic — inserts spread across chunks, no hotspot
Query alignment — your most common queries include the shard key

Targeted vs Scatter-Gather Queries

This is where shard key choice shows up in your latency metrics.

Targeted query — shard key present in filter:

// country is the shard key → mongos knows exactly which shard
db.users.find({ country: "IN", name: "Bob" })
// → hits Shard 1 only → fast

Scatter-gather query — no shard key:

// name is not the shard key → mongos has no idea
db.users.find({ name: "Bob" })
// → broadcast to ALL shards → wait for all responses → merge
// → latency = slowest shard's response time

Scatter-gather gets worse with sort + limit:

db.users.find({}).sort({ createdAt: -1 }).limit(10)
// Each shard returns its top 10
// mongos merges 30 documents, returns final 10
// N shards × limit documents fetched, only limit returned

Check if your query is scatter-gather:

db.users.find({ name: "Bob" }).explain("executionStats")
// Look for: "SHARD_MERGE" in winningPlan → scatter-gather
// Look for: "SINGLE_SHARD" → targeted

Production Gotchas

Jumbo chunks — A chunk becomes "jumbo" when all documents in it have the same shard key value, so it can't split further. The balancer can't move jumbo chunks. They pile up on one shard. Fix: choose a higher-cardinality shard key upfront. If you're already here — manual chunk management required.

Transactions across shards — Multi-document transactions that touch multiple shards require a 2-phase commit internally. Slow and lock-heavy. Design your data so related documents live on the same shard (use the same shard key value for related documents).

$lookup across sharded collections — MongoDB's JOIN equivalent ($lookup) across two sharded collections requires pulling data across shards — extremely expensive. In a sharded cluster, favor embedding over referencing. This is MongoDB's design philosophy anyway.

Too-early sharding — Sharding adds operational complexity. A well-tuned single replica set handles a lot. Don't shard until you've genuinely exhausted vertical scaling and read replicas. A common rule: shard when your dataset exceeds what comfortably fits on your biggest available instance.

Putting It All Together

Here's the complete MongoDB architecture from a single write to a sharded, replicated cluster:

db.users.insertOne({ name: "Bob" }, { writeConcern: { w: "majority" } })
    │
    ▼
mongos router
    │  chunk map lookup (Config Servers)
    │  country: "IN" → Shard 1
    ▼
Shard 1 — Primary node
    ├── WiredTiger cache → dirty page (in-memory, fast)
    ├── Journal append → WiredTigerLog (crash safety)
    └── Oplog entry → local.oplog.rs
              │
              ├──────────────────► Secondary 1
              │   oplog tailing    apply + ack to primary
              │
              └──────────────────► Secondary 2
                  oplog tailing    apply + ack to primary
                        │
    Primary waits ◄──────┘
    majority ack (2/3 nodes) → mongos → client acknowledged

The chain: WiredTiger (how data is stored) → Journal (crash safety) → Oplog (replication fuel) → Replica Set (HA) → Sharding (horizontal scale). Each layer builds on the one below it.

MongoDB vs Postgres — When to Actually Use Each

After going through both internals:

Use MongoDB when:

Your data structure varies significantly across documents (product catalog, CMS, user profiles)
You need horizontal write scale from the start (sharding)
You're storing hierarchical data that maps naturally to documents
Schema flexibility is genuinely required, not just convenient

Use Postgres when:

You need strong ACID across multiple related entities (orders, payments, inventory)
Your data is relational — joins are frequent and important
You need complex aggregations and window functions
Consistency guarantees matter more than write throughput

Use both when:

MongoDB as primary store + Elasticsearch for full-text search
MongoDB for flexible catalog data + Postgres for transactional order data
This is not over-engineering — it's using the right tool for each problem

The mistake isn't choosing MongoDB or Postgres. The mistake is choosing one before you understand what your data actually looks like and how you'll query it.

Next in the series — we go fully write-heavy. Cassandra and the LSM Tree: why it can absorb millions of writes per second, what compaction actually does, and the read trade-offs you're silently accepting when you choose it.

Building My Own S3: A Deep Dive into Distributed Storage Systems

Subham — Sun, 15 Mar 2026 09:21:41 +0000

Ever wondered how systems like Amazon S3 or MinIO actually work under the hood? I spent the last 2 months building Lilio — a distributed object storage system from scratch in Go. No frameworks, no shortcuts, just pure distributed systems engineering.

In this post, I'll walk you through the architecture, the problems I faced, the tradeoffs I made, and some benchmarks. Let's dive in.

What is Lilio?

Lilio is an S3-inspired distributed object storage system that handles:

Chunking: Splitting large files into smaller pieces
Replication: Storing chunks on multiple nodes for fault tolerance
Consistent Hashing: Distributing data evenly across nodes
Quorum Reads/Writes: Ensuring consistency in a distributed environment
Encryption: AES-256-GCM at the bucket level
Pluggable Backends: Support for local storage, S3, Google Drive
Observability: Prometheus metrics out of the box

┌─────────────────────────────────────────────────────────────┐
│                         Lilio                               │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │ Chunking│──│ Hashing │──│  Store  │──│ Encrypt │         │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘         │
│       │                          │                          │
│       ▼                          ▼                          │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Storage Backends (Pluggable)           │    │
│  │   ┌───────┐    ┌───────┐    ┌───────┐               │    │
│  │   │ Local │    │  S3   │    │ GDrive│               │    │
│  │   └───────┘    └───────┘    └───────┘               │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

The Architecture

High-Level Flow

When you upload a file to Lilio, here's what happens:

1. File comes in (e.g., 100MB video)
2. Split into chunks (1MB each = 100 chunks)
3. For each chunk:
   a. Generate chunk ID
   b. Hash to find target nodes (consistent hashing)
   c. Replicate to N nodes (quorum write)
   d. Encrypt if bucket has encryption enabled
4. Store metadata (chunk locations, checksums)
5. Return success when write quorum achieved

Component Breakdown

┌────────────────────────────────────────────────────────────────┐
│                        Lilio Engine                            │
├────────────────────────────────────────────────────────────────┤
│                                                                │
│   ┌──────────────────┐     ┌──────────────────┐                │
│   │   HTTP API       │     │   CLI            │                │
│   │   (server.go)    │     │   (cli/)         │                │
│   └────────┬─────────┘     └────────┬─────────┘                │
│            │                        │                          │
│            └────────────┬───────────┘                          │
│                         ▼                                      │
│   ┌─────────────────────────────────────────────────────────┐  │
│   │                    Core Engine                          │  │
│   │                    (lilio.go)                           │  │
│   │  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐        │  │
│   │  │  Streaming  │ │   Quorum    │ │  Encryption │        │  │
│   │  │  Chunker    │ │   Manager   │ │  (AES-GCM)  │        │  │
│   │  └─────────────┘ └─────────────┘ └─────────────┘        │  │
│   └─────────────────────────────────────────────────────────┘  │
│                         │                                      │
│            ┌────────────┼────────────┐                         │
│            ▼            ▼            ▼                         │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐              │
│   │  Metadata   │ │  Hash Ring  │ │  Backends   │              │
│   │  Store      │ │  (consist.) │ │  Registry   │              │
│   └─────────────┘ └─────────────┘ └─────────────┘              │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Deep Dive: The Interesting Parts

Here are the 5 most interesting technical challenges I solved:

1. Consistent Hashing — Why Simple Modulo Fails

My first implementation used simple modulo hashing:

// DON'T DO THIS
func getNode(chunkID string, totalNodes int) int {
    hash := sha256.Sum256([]byte(chunkID))
    return int(hash[0]) % totalNodes
}

The problem? When I added a 5th node to my 4-node cluster, approximately 80% of my data needed to relocate. In production, that's hours of downtime and massive network load.

The solution: Consistent Hashing with Virtual Nodes

type HashRing struct {
    ring           []uint32           // Sorted positions
    positionToNode map[uint32]string  // Position -> Node name
    virtualNodes   int                // 150 per physical node
}

func (hr *HashRing) AddNode(nodeName string) {
    for i := 0; i < hr.virtualNodes; i++ {
        key := fmt.Sprintf("%s#%d", nodeName, i)
        position := hash(key)
        hr.ring = append(hr.ring, position)
        hr.positionToNode[position] = nodeName
    }
    sort.Slice(hr.ring, func(i, j int) bool {
        return hr.ring[i] < hr.ring[j]
    })
}

Now when I add a node, only ~20% of data moves. That's a 4x improvement.

But I hit another bug. When two virtual nodes hashed to the same position, my collision handling was:

// WRONG: Causes clustering
for hr.positionToNode[position] != "" {
    position++  // Sequential increment
}

If positions 5000, 5001, 5002 all collide, they cluster together — defeating the purpose of consistent hashing.

Fix: Rehash on collision

// CORRECT: Maintains random distribution
retryCount := 0
for hr.positionToNode[position] != "" && retryCount < 10 {
    retryKey := fmt.Sprintf("%s#%d#retry%d", nodeName, i, retryCount)
    position = hash(retryKey)  // New random position
    retryCount++
}

2. Streaming I/O — Memory Matters

My first upload implementation:

// DON'T DO THIS
func uploadFile(file io.Reader) {
    data, _ := io.ReadAll(file)      // 1GB in RAM
    chunks := splitIntoChunks(data)   // Another 1GB
    for _, chunk := range chunks {
        encrypt(chunk)                 // More copies
        store(chunk)
    }
}
// Result: 1GB file = 3GB RAM usage. Server crashes at 500MB.

The fix: Stream chunk by chunk

type ChunkReader struct {
    reader    io.Reader
    chunkSize int
    buffer    []byte
    index     int
}

func (cr *ChunkReader) NextChunk() ([]byte, int, error) {
    n, err := io.ReadFull(cr.reader, cr.buffer)
    if err == io.EOF {
        return nil, cr.index, io.EOF
    }
    chunk := cr.buffer[:n]
    cr.index++
    return chunk, cr.index - 1, nil
}

Now I process one chunk at a time. Memory usage: constant 2MB regardless of file size.

File Size	Before (Buffered)	After (Streaming)
10 MB	~30 MB RAM	~2 MB RAM
100 MB	~300 MB RAM	~2 MB RAM
1 GB	~3 GB RAM (crash)	~2 MB RAM
10 GB	💀	~2 MB RAM ✅

3. Quorum — The Heart of Distributed Consensus

In a distributed system, nodes can fail, be slow, or have stale data. Quorum ensures consistency.

The formula:

N = Total replicas
W = Write quorum (minimum nodes for successful write)
R = Read quorum (minimum nodes to read from)

Rule: W + R > N (guarantees overlap)

Example with N=3, W=2, R=2:

WRITE (storing chunk):
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node A  │  │ Node B  │  │ Node C  │
│  v2 ✅  │  │  v2 ✅  │   │  (slow) │
└─────────┘  └─────────┘  └─────────┘
     ↑            ↑
   Write       Write

W=2 achieved → Write successful!
(Node C will catch up eventually)


READ (later):
┌─────────┐  ┌─────────┐  ┌─────────┐
│ Node A  │  │ Node B  │  │ Node C  │
│  v2     │  │  v2     │  │  v1     │ ← stale!
└─────────┘  └─────────┘  └─────────┘
     ↑            ↑            ↑
   Read        Read         Read

R=2 → Got v2 from A and B → Return v2 ✅
Background: Repair Node C with v2 🔧

Implementation:

func (s *Lilio) retrieveChunkWithQuorum(chunkInfo ChunkInfo) ([]byte, error) {
    var responses []ChunkResponse
    var wg sync.WaitGroup

    // Read from ALL nodes in parallel
    for _, nodeName := range chunkInfo.StorageNodes {
        wg.Add(1)
        go func(name string) {
            defer wg.Done()
            data, _ := backend.RetrieveChunk(chunkInfo.ChunkID)
            checksum := calculateChecksum(data)

            responses = append(responses, ChunkResponse{
                Data:     data,
                Checksum: checksum,
                Valid:    checksum == chunkInfo.Checksum,
            })
        }(nodeName)
    }
    wg.Wait()

    // Check quorum
    if len(responses) < s.Quorum.R {
        return nil, errors.New("read quorum failed")
    }

    // Find valid responses, trigger repair for stale ones
    var validData []byte
    var staleNodes []string

    for _, resp := range responses {
        if resp.Valid {
            validData = resp.Data
        } else {
            staleNodes = append(staleNodes, resp.NodeName)
        }
    }

    // Self-healing: fix stale nodes in background
    if len(staleNodes) > 0 {
        go s.readRepair(chunkInfo.ChunkID, validData, staleNodes)
    }

    return validData, nil
}

4. Pluggable Everything — Interface-Driven Design

I wanted Lilio to work with different storage backends and metadata stores without code changes. Go interfaces made this clean:

Storage Backend Interface:

type StorageBackend interface {
    StoreChunk(chunkID string, data []byte) error
    RetrieveChunk(chunkID string) ([]byte, error)
    DeleteChunk(chunkID string) error
    ListChunks() ([]string, error)
    Info() BackendInfo
}

// Implementations
type LocalBackend struct { ... }   // Local filesystem
type S3Backend struct { ... }      // Amazon S3
type GDriveBackend struct { ... }  // Google Drive

Metadata Store Interface:

type MetadataStore interface {
    CreateBucket(name string) error
    GetBucket(name string) (*BucketMetadata, error)
    SaveObjectMetadata(meta *ObjectMetadata) error
    GetObjectMetadata(bucket, key string) (*ObjectMetadata, error)
    ListObjects(bucket, prefix string) ([]string, error)
    Type() string
}

// Implementations
type LocalStore struct { ... }   // JSON files
type EtcdStore struct { ... }    // Distributed etcd
type MemoryStore struct { ... }  // In-memory (testing)

Config-driven selection:

{
  "metadata": {
    "type": "etcd",
    "etcd": {
      "endpoints": ["node1:2379", "node2:2379", "node3:2379"]
    }
  },
  "storages": [
    {"name": "local-1", "type": "local", "path": "/data/1"},
    {"name": "s3-backup", "type": "s3", "bucket": "my-bucket"}
  ]
}

Switch from local development to distributed production? Just change the config. Zero code changes.

5. Observability with Prometheus Metrics

One lesson I learned: metrics aren't optional. Adding them later is painful. I instrumented from day one.

Metrics Interface:

type Collector interface {
    RecordPutObject(bucket string, sizeBytes int64, duration time.Duration)
    RecordQuorumWrite(success bool, nodesAttempted, nodesSucceeded int)
    RecordReadRepair(node string)
    RecordBackendHealth(node string, healthy bool)
    // ... 15+ metrics total
}

What I track:

lilio_objects_total{bucket="photos", operation="put"}
lilio_object_size_bytes{bucket="photos"} (histogram)
lilio_request_duration_seconds{operation="put"} (histogram)
lilio_quorum_write_total{status="success"}
lilio_read_repairs_total{node="backend-1"}
lilio_backend_health{node="backend-1"} (gauge: 1=up, 0=down)

Example PromQL queries:

# Average upload speed
rate(lilio_object_size_bytes_sum[5m]) / rate(lilio_object_size_bytes_count[5m])

# Quorum success rate
sum(rate(lilio_quorum_write_total{status="success"}[5m]))
/
sum(rate(lilio_quorum_write_total[5m]))

# P95 latency
histogram_quantile(0.95, lilio_request_duration_seconds_bucket)

Setup:

# docker-compose.yaml
services:
  prometheus:
    image: prom/prometheus
    ports: ["9090:9090"]

  grafana:
    image: grafana/grafana
    ports: ["3000:3000"]
    volumes:
      - ./grafana/dashboards:/etc/grafana/provisioning/dashboards

Pre-built dashboard included with:

Request rates per bucket
P95/P99 latencies
Quorum success rates
Backend health status
Read repair activity

Why this matters: When I ran load tests and saw P99 latency spike at 10MB files, metrics helped me identify the bottleneck in under 5 minutes. No metrics = flying blind.

Tradeoffs I Made

Every system has tradeoffs. Here's my CAP theorem positioning:

        Consistency
            /\
           /  \
          /    \
         / Lilio\
        /   (CP) \
       /          \
      /____________\
 Availability    Partition
                 Tolerance

Lilio prioritizes Consistency + Partition Tolerance (CP):

Tradeoff	Choice	Why
Consistency vs Availability	Consistency	Quorum writes fail if W nodes unavailable
Memory vs Speed	Memory	Streaming is slower but handles any file size
Complexity vs Features	Complexity	Pluggable backends add code but enable flexibility
Chunk Size	1MB default	Balance between metadata overhead and parallelism

Chunk Size Analysis:

Small chunks (64KB):
  ✅ Better parallelism
  ✅ Faster recovery (less data to re-replicate)
  ❌ More metadata overhead
  ❌ More network round trips

Large chunks (64MB):
  ✅ Less metadata
  ✅ Fewer network calls
  ❌ Slower recovery
  ❌ More memory per operation

Sweet spot: 1MB - 16MB depending on use case

Benchmarks

Tested on: MacBook Air M1, 8GB RAM, 3 in-memory storage backends
Configuration: N=3, W=2, R=2 (majority quorum)

Upload Performance

File Size	Latency	Throughput	Memory Usage
1 KB	0.32ms	3.16 MB/s	1.0 MB
100 KB	0.70ms	145.83 MB/s	1.1 MB
1 MB	2.63ms	398.04 MB/s	2.1 MB
10 MB	25ms	419.35 MB/s	11.6 MB
100 MB	270ms	387.68 MB/s	106.1 MB

Key Insight: Memory scales at ~1.06× file size. Streaming prevents loading the entire file into memory at once, but we still need buffers for chunking and encryption. The sweet spot is 1-10MB files where we hit peak throughput of 420 MB/s.

The Small File Problem

During benchmarking, I discovered an interesting bottleneck:

File Size	Throughput	What's happening?
1 KB	3.16 MB/s	Quorum overhead dominates
100 KB	145.83 MB/s	Still overhead-heavy
1 MB	398.04 MB/s	100× faster than 1KB!

For tiny files, the cost of coordinating with W=2 nodes (network calls, checksums, metadata updates) overshadows the actual data transfer.

The fix? Implement a fast path for files <10KB that:

Skips chunking (store as single chunk)
Batches metadata updates
Uses simpler replication logic

Estimated improvement: 10-100× throughput for small files.

Concurrent Performance (Where It Shines)

Test	Sequential	10 Concurrent	Scaling Factor
1MB uploads	398 MB/s	1,014 MB/s	2.5×

The system scales almost perfectly with concurrency. Ten parallel uploads achieve 2.5× sequential throughput, validating the parallel replication architecture.

Download Performance

File Size	Latency	Throughput
1 KB	0.14ms	7.41 MB/s
100 KB	0.47ms	217.79 MB/s
1 MB	2.66ms	394.74 MB/s
100 MB	485ms	216.15 MB/s

Reads are generally faster than writes since there's no replication coordination — we just fetch from any R=2 replicas.

Quorum Impact (Surprising Results)

Config	Throughput (1MB)	Expected Behavior
N=3, W=1, R=1	❌ Rejected	W+R must be > N
N=3, W=2, R=2	185.82 MB/s	Baseline (majority)
N=3, W=3, R=3	264.95 MB/s	Should be slower!

Wait, W=3 is faster than W=2?

In my test environment (no network latency), waiting for all 3 nodes is actually simpler than checking "did we hit quorum of 2 yet?". The goroutine coordination overhead is lower.

In production with real network delays, W=2 would be faster since it doesn't wait for the slowest node. This shows why real-world testing matters — local benchmarks can mislead you.

Production Architecture

For production, here's the recommended setup:

┌─────────────────────────────────────────────────────────────┐
│                    Load Balancer (NGINX/ALB)                │
└─────────────────────────┬───────────────────────────────────┘
                          │
    ┌─────────────────────┼─────────────────────┐
    │                     │                     │
    ▼                     ▼                     ▼
┌────────┐           ┌────────┐           ┌────────┐
│Lilio 1 │           │Lilio 2 │           │Lilio 3 │
│  AZ-1  │           │  AZ-2  │           │  AZ-3  │
└───┬────┘           └───┬────┘           └───┬────┘
    │                    │                    │
    └────────────────────┼────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│                    etcd Cluster (3 nodes)                   │
│               Distributed Metadata + Consensus              │
└─────────────────────────────────────────────────────────────┘
                         │
    ┌────────────────────┼────────────────────┐
    │                    │                    │
    ▼                    ▼                    ▼
┌────────┐          ┌────────┐          ┌────────┐
│Local   │          │  S3    │          │ GDrive │
│Storage │          │(backup)│          │(backup)│
└────────┘          └────────┘          └────────┘

Why etcd for metadata?

Raft consensus: Battle-tested distributed coordination (same algorithm as Consul, CockroachDB)
Strong consistency: Linearizable reads/writes for metadata
Production proven: Powers Kubernetes
Watch API: Get notified when metadata changes (enables caching)
Multi-node: Survives node failures with quorum

Alternative I considered:

PostgreSQL: Good for single-region, but complex to distribute
Consul: Similar to etcd, good choice too
DynamoDB: AWS-only, vendor lock-in
Local JSON: Works for dev, not for production

Chose etcd because it's designed for exactly this use case — distributed metadata with strong consistency.

Limitations & What I'd Do Differently

Being honest about limitations is important. Here's what's not perfect:

1. Small File Performance

Current: 3.16 MB/s for 1KB files
Issue: Quorum coordination overhead dominates
Fix: Fast path for <10KB files (planned)

2. Memory Not Truly Constant

I initially thought streaming would keep memory flat
Reality: Memory scales at 1.06× file size
Why: Still need buffers for chunks, encryption, checksums
It's still efficient (prevents 2-3× overhead of naive approach)

3. Test vs Production

My benchmarks show W=3 faster than W=2 (no network latency)
Production would flip this (network delays matter)
Learning: Always test in environment similar to production

4. No Rollback on Partial Writes

If 1 of 2 nodes fails mid-write, the successful chunk stays
This leaves the system in inconsistent state until retry
Fix: Implement two-phase commit or rollback logic

5. Version-Based Conflict Resolution Incomplete

ChunkInfo has a Version field but it's not used for conflict resolution
Read repair uses checksums, not timestamps
Fix: Implement proper last-write-wins with version comparison

These aren't bugs — they're priorities. For a portfolio project, getting 80% working beats 100% planning.

Lessons Learned

Start with interfaces: I refactored twice before learning this. Design interfaces first, implement later.
Streaming isn't optional: Memory will bite you. Always assume files can be larger than RAM.
Distributed ≠ Complicated: Consistent hashing and quorum are surprisingly implementable. Don't be scared.
Test failure modes: Kill nodes randomly. Corrupt data. See what breaks. That's where bugs hide.
Metrics from day one: Adding Prometheus later is painful. Instrument everything upfront.

What's Next?

Performance Optimizations (from benchmarking):

[ ] Small file fast path (10-100× improvement for <10KB files)
[ ] Early quorum exit (20-40% latency reduction when W < N)
[ ] Buffer pooling (30% memory reduction for high-throughput scenarios)
[ ] Parallel chunk upload (2-3× faster for large files)

Features:

[ ] Erasure coding (Reed-Solomon) for storage efficiency
[ ] Multi-region replication
[ ] Garbage collection for orphaned chunks
[ ] Web UI dashboard
[ ] Kubernetes operator

Try It Yourself

git clone https://github.com/subhammahanty235/lilio
cd lilio
go build -o lilio ./cmd/lilio

# Start server
./lilio server

# Create bucket and upload
./lilio bucket create photos
./lilio put vacation.jpg photos/vacation.jpg

# Download
./lilio get photos/vacation.jpg downloaded.jpg

Want to see it in action with metrics?

# Start infrastructure (etcd + Prometheus + Grafana)
docker-compose up -d

# Run with etcd metadata store
./lilio server --metadata etcd

# Open Grafana dashboard
open http://localhost:3000
# Login: admin/admin
# Navigate to pre-configured Lilio dashboard

Conclusion

Building a distributed storage system taught me more about systems design than any course or book. The concepts — consistent hashing, quorum, streaming I/O, interface-driven design — are applicable everywhere.

If you're learning Go or distributed systems, I highly recommend building something like this. Start simple, add complexity incrementally, and don't be afraid to break things.

Repository: github.com/subhammahanty235/lilio

Questions? Feedback? Found a bug? Open an issue or reach out!

If you found this useful, consider:

⭐ Starring the repo on GitHub
📢 Sharing this post with someone learning distributed systems
💬 Leaving feedback — what worked? what confused you?

Tags: #go #golang #distributedsystems #systemdesign #backend #storage #programming #s3 #objectstorage

Dissecting PostgreSQL — Why Being Read-Optimized Comes at the Cost of Write Speed

Subham — Sat, 14 Mar 2026 19:59:28 +0000

You've probably heard this before — "Postgres is great but not ideal for write-heavy workloads."

But why? What is actually happening under the hood that makes writes slow? And how is Cassandra or any LSM-tree database fundamentally different?

This post is a deep dive into Postgres internals — storage layout, MVCC, WAL, B-Tree indexes, and how all of these together create a write bottleneck by design. Not as a flaw, but as a deliberate tradeoff.

Let's dissect it.

First — Where Does Postgres Actually Store Your Data?

Before we talk about writes, you need to understand where the data lives physically. No magic here.

Your users table is literally a binary file on disk:

/var/lib/postgresql/16/main/base/16384/24601

That's it. A plain file. You can hexdump it and see your users' names and emails in raw bytes.

Run this to find yours:

SHOW data_directory;
-- /var/lib/postgresql/16/main

SELECT oid FROM pg_database WHERE datname = 'myapp';
-- 16384

SELECT relfilenode FROM pg_class WHERE relname = 'users';
-- 24601

The $PGDATA directory structure looks like this:

$PGDATA/
├── base/
│   └── 16384/          ← your database (identified by OID)
│       ├── 24601        ← users table (heap file)
│       ├── 24601_fsm    ← free space map
│       ├── 24601_vm     ← visibility map
│       └── 24605        ← users_pkey index
├── pg_wal/              ← write-ahead log segments
└── postgresql.conf

The 8KB Page — Storage's Atomic Unit

Postgres doesn't read individual rows from disk. It reads pages — 8KB blocks. Every table file is a sequence of these pages.

A single page looks like this:

┌─────────────────────────────────┐
│ Page Header (24 bytes)          │  ← LSN, checksum, flags
├─────────────────────────────────┤
│ Item Pointers (4 bytes each)    │  ← offsets pointing to tuples  below
├─────────────────────────────────┤
│                                 │
│         free space              │
│                                 │
├─────────────────────────────────┤
│ Tuple 3 │ Tuple 2 │ Tuple 1     │  ← actual row data, grows upward
├─────────────────────────────────┤
│ Special space                   │  ← used by indexes
└─────────────────────────────────┘

Item pointers grow downward, tuples grow upward. When they meet — page is full. New rows go to a new page.

What a Tuple (Row) Looks Like on Disk

Every row has a 23-byte header you never see:

Field	Size	Purpose
`t_xmin`	4B	Transaction ID that inserted this row
`t_xmax`	4B	Transaction ID that deleted/updated this row (0 = still alive)
`t_ctid`	6B	Physical location — `(page_number, item_index)`
`t_infomask`	2B	Flags — is it committed? frozen? has nulls?
`t_hoff`	1B	Header size (grows if there's a null bitmap)
null bitmap	varies	Which columns are NULL
column data	varies	Your actual data

That t_xmin and t_xmax — remember these. They are the key to understanding everything that follows.

The Write Path — Where Things Get Expensive

Now let's trace what actually happens when you run:

UPDATE users SET name = 'Bob' WHERE id = 5;

Step 1 — Write to WAL First

Before touching any data page, Postgres writes your change to the Write-Ahead Log (pg_wal/).

pg_wal/000000010000000000000001   ← 16MB binary segment

This is for crash safety — if the server dies mid-write, Postgres replays the WAL from the last checkpoint and recovers. Every write goes here first, then to the actual heap.

So already, one logical write = two physical writes.

WAL is sequential — always appending. That part is fast. The problem comes next.

Step 2 — Find the Right Page (Random I/O)

Postgres needs to find the page containing the row with id = 5. It checks the shared buffer pool first (RAM cache). If the page is there — great, cache hit. If not, it reads the 8KB page from disk.

This is a random I/O — the disk head jumps to wherever that page lives. On spinning disks, random I/O is painfully slow (~100 IOPS vs ~500 MB/s for sequential). Even on SSDs, random reads are more expensive than sequential.

Step 3 — MVCC: The Old Row is NOT Deleted

This is where Postgres gets genuinely interesting — and where the write overhead accumulates.

When you UPDATE a row, Postgres does not modify it in place. Instead:

The old row's t_xmax is set to the current transaction ID — marking it as "deleted by this transaction"
A brand new row version is inserted on the page, with t_xmin = current transaction ID

So after your UPDATE, the page has two versions of the row:

┌────────────────────────────────────────────────┐
│  Tuple (old): t_xmin=100, t_xmax=205 ← dead    │
│  name = 'Alice'                                │
├────────────────────────────────────────────────┤
│  Tuple (new): t_xmin=205, t_xmax=0   ← alive   │
│  name = 'Bob'                                  │
└────────────────────────────────────────────────┘

This is MVCC — Multi-Version Concurrency Control. The big benefit: readers and writers don't block each other. A SELECT running at transaction 204 will still see 'Alice' because the new version isn't visible to it yet. No locks needed for reads.

The cost: every UPDATE is secretly an INSERT + a mark operation. Dead tuples pile up on disk.

Step 4 — Update Every Index

If your users table has 5 indexes (primary key, email, created_at, status, name), then a single UPDATE that changes any indexed column must update all relevant index pages too.

Each index is a separate B-Tree file on disk. Each index update is another random I/O.

One UPDATE → 1 WAL write + 1 heap page read + 1 heap page write + N index page writes.

This is write amplification in action.

Step 5 — VACUUM Cleans Up the Mess (Later)

Dead tuples don't disappear. They sit on disk pages, taking up space, making full table scans slower. autovacuum runs periodically in the background to reclaim them.

But VACUUM itself consumes I/O and CPU. And if autovacuum can't keep up with your write rate — table bloat happens. Pages fill up with dead tuples. Sequential scans slow down. Index bloat increases.

Here's the full write path visualized:

UPDATE query
    │
    ▼
WAL Buffer (RAM)
    │
    ▼ flush
WAL on disk (sequential write — fast)
    │
    ▼
Shared Buffer Pool → find/load heap page (random I/O — slow if miss)
    │
    ├──► Mark old tuple t_xmax = txn_id     (dead tuple, stays on disk)
    │
    └──► Insert new tuple on same/new page  (MVCC)
              │
              └──► Update all N index B-Trees  (N × random I/O)
                        │
                        ▼
              VACUUM (later, async) cleans dead tuples

The B-Tree Problem

Postgres uses B-Tree as the default index structure. B-Trees are brilliant for reads:

WHERE id = 5 → O(log n) traversal from root to leaf
Range queries (WHERE created_at BETWEEN ...) → traverse to start, scan leaf nodes
Sorted results (ORDER BY id) → free, B-Tree is already sorted

But B-Trees have a fundamental write problem: updates are random I/O.

When you insert a new value, the B-Tree must find the correct leaf node and insert there. If the leaf is full, it splits — and the split propagates upward. Each node is a page on disk. Inserting into a B-Tree means jumping to a (potentially uncached) page, modifying it, writing it back.

At scale — millions of writes per second — this random I/O becomes the bottleneck.

Compare this to what write-optimized databases use.

Why LSM Trees (Cassandra, RocksDB) Are Different

LSM = Log-Structured Merge Tree. The core insight: sequential writes to disk are an order of magnitude faster than random writes.

Spinning disk:  sequential ~500 MB/s  vs  random ~1 MB/s  (500x difference!)
SSD:            sequential ~3 GB/s    vs  random ~200 MB/s (15x difference!)

LSM trees exploit this:

Write arrives
    │
    ▼
MemTable (RAM, sorted) ← microseconds, no disk I/O
    │
    ▼ when full, flush
SSTable on disk (immutable, sequential write) ← fast!
    │
    ▼ background
Compaction — merge SSTables, remove old versions ← sequential, amortized

No random I/O on the write path. No index updates. No MVCC dead tuples. Writes just go to RAM, then get flushed sequentially to disk.

The tradeoff: reads are slower. To read a value, you might need to check multiple SSTables (the value could be in any of them before compaction). Bloom filters help, but LSM reads are fundamentally more expensive than a B-Tree lookup.

The Core Tradeoff, Visualized

	PostgreSQL (B-Tree)	Cassandra (LSM Tree)
Write path	Random I/O + MVCC + index updates	Sequential write to RAM → disk
Read path	B-Tree O(log n), very fast	Check MemTable + multiple SSTables
Updates	New row version + dead tuple	Append new version, compaction later
Disk I/O pattern	Random	Sequential
Garbage collection	VACUUM (explicit)	Compaction (background)
Transactions	Full ACID	Eventual consistency (tunable)
Complex queries	Excellent (JOINs, aggregations)	Very limited

Neither is better. They solve different problems.

When Does Postgres Write Performance Actually Hurt?

Postgres handles typical web app write loads just fine. The bottleneck shows up when:

1. High UPDATE rate on wide tables with many indexes
Every UPDATE touches N index files. A table with 8 indexes and 10,000 updates/second = potentially 80,000 random I/O operations per second.

2. autovacuum can't keep up
Dead tuples accumulate faster than VACUUM can reclaim them. Table bloat increases. Full scans read more pages. The problem compounds.

-- Check table bloat
SELECT relname, n_dead_tup, n_live_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

3. Checkpoint I/O spikes
Every checkpoint_timeout (default 5 min), Postgres flushes all dirty pages to disk. This creates a spike. Tune checkpoint_completion_target = 0.9 to spread the I/O.

4. Transaction ID Wraparound
Every transaction gets a 32-bit ID. At ~2 billion transactions, wraparound happens — Postgres can no longer tell old from new. It enters emergency autovacuum. This is a real production incident waiting to happen if you're not monitoring it.

-- Monitor wraparound risk
SELECT datname, age(datfrozenxid) as txn_age
FROM pg_database
ORDER BY txn_age DESC;
-- If age > 1.5 billion — start worrying

How to Squeeze More Write Performance Out of Postgres

If you need Postgres for write-heavy workloads, here's what actually helps:

1. Tune autovacuum aggressively for busy tables

Default autovacuum_vacuum_scale_factor = 0.2 means VACUUM only triggers after 20% of rows are dead. For a 10M row table that's 2M dead tuples before cleanup. Tighten it:

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.01,   -- trigger at 1%
  autovacuum_analyze_scale_factor = 0.005
);

2. Batch writes, avoid single-row inserts

-- Bad: 1000 round trips, 1000 WAL flushes
INSERT INTO events VALUES (...);
INSERT INTO events VALUES (...);
-- ... × 1000

-- Good: 1 round trip, 1 WAL flush
INSERT INTO events VALUES (...), (...), (...); -- × 1000

3. Use COPY for bulk loads

COPY is 10-50x faster than INSERT for bulk data. It bypasses a lot of overhead:

COPY events FROM '/tmp/events.csv' WITH (FORMAT csv);

4. Partial indexes to reduce index maintenance

Instead of indexing everything:

-- Don't index all rows
CREATE INDEX ON orders (user_id);

-- Only index what you actually query
CREATE INDEX ON orders (user_id) WHERE status = 'active';

Fewer index entries = faster updates on non-active rows.

The Architecture in One Picture

Here's the full Postgres architecture, from your query to the disk:

Your App
    │  SQL string over TCP (port 5432)
    ▼
Postmaster ──► fork() → Backend Process (1 per connection)
                              │
                    ┌─────────┼──────────┐
                    ▼         ▼          ▼
              Buffer Pool   Locks     WAL Buffer
              (shared RAM)           (shared RAM)
                    │                    │
                    ▼                    ▼
               BgWriter             WAL Writer
               Autovacuum           Checkpointer
                    │                    │
                    ▼                    ▼
              Heap files (.base/)    pg_wal/ segments
              Index files

Every query goes through: Parse → Analyze → Rewrite → Plan → Execute.

The planner is where performance is won or lost — it estimates the cost of every possible execution plan and picks the cheapest. Feed it bad statistics (stale pg_statistic) and it picks wrong plans.

-- Always run after bulk loads
ANALYZE users;

-- See what plan the planner chose
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = 'bob@example.com';

Summary

Postgres is read-optimized not by accident — it's the consequence of specific, deliberate design choices:

MVCC → readers never block writers (great for concurrency, bad for write amplification)
B-Tree indexes → fast O(log n) reads, expensive random-I/O writes
Heap storage → rows stored in 8KB pages by insertion order, not update-friendly
WAL → crash safety, but doubles write I/O
VACUUM → background GC for MVCC dead tuples, adds overhead

These same choices are what make Postgres excellent at complex SQL queries, strong consistency, and concurrent reads. You can't have both without tradeoffs.

If you need extreme write throughput: Cassandra, ScyllaDB, or ClickHouse (for analytics). If you need correctness, complex queries, and strong consistency: Postgres is still one of the best databases ever built.

The trick is knowing which tradeoff you're making — and this post was about making sure you know exactly why.

Next up in this series — we flip sides completely. We'll dissect a write-heavy database from the inside: how LSM trees actually work, what happens during compaction, and why Cassandra can eat millions of writes per second without breaking a sweat.