Subham

Posted on Mar 24

Dissecting MongoDB — It Uses B-Trees Too, So Why Are Writes Actually Faster?

#database #mongodb #architecture #ai

Last time we dissected Postgres and found that its write slowness isn't a bug — it's the direct consequence of MVCC, B-Tree indexes, and heap storage all working together to make reads fast.

MongoDB is often pitched as the alternative for flexible, write-heavy workloads. But why is it better at writes? What does "flexible schema" actually mean on disk? And how does sharding actually work when your data grows beyond a single machine?

Let's open it up.

First — When Does MongoDB Actually Make Sense?

Before internals, the scenario. Because choosing a database without understanding your workload is how you end up rewriting everything in 18 months.

Consider a product catalog — something like Flipkart or Amazon. You have phones, t-shirts, furniture, and groceries. Each category has completely different attributes:

// Mobile phone
{
  "_id": "prod_123",
  "name": "iPhone 15",
  "category": "electronics",
  "specs": {
    "ram": "8GB",
    "storage": "256GB",
    "battery": "3877mAh"
  },
  "variants": [
    { "color": "black", "price": 79999, "stock": 45 },
    { "color": "white", "price": 79999, "stock": 12 }
  ]
}

// T-shirt — completely different shape
{
  "_id": "prod_456",
  "name": "Cotton T-Shirt",
  "category": "clothing",
  "specs": {
    "material": "100% cotton",
    "fit": "slim"
  },
  "variants": [
    { "size": "S", "color": "red", "price": 599, "stock": 200 }
  ]
}

In Postgres, you'd either create 50 nullable columns, use EAV (Entity-Attribute-Value — a known nightmare), or store a JSONB blob. The last option works, but then you're essentially building MongoDB inside Postgres.

MongoDB wins here because the document model is native — no schema migration when you add a new product category, no ALTER TABLE on a 50M row table at 2 AM.

Where MongoDB loses: Anything needing strong ACID across multiple documents — orders, payments, financial ledgers. Multi-document transactions exist in MongoDB (since v4.0) but they're slower and less battle-tested than Postgres. Use the right tool.

The Storage Engine — WiredTiger

MongoDB before version 3.2 used MMAPv1 — memory-mapped files directly. Simple, but had brutal limitations: collection-level locking (one write blocks all reads on that collection), no compression, poor write performance.

In 2014, MongoDB acquired WiredTiger. Everything changed.

What WiredTiger Actually Is

Here's the thing most articles get wrong — WiredTiger stores data in B-Trees on disk, not LSM trees. The files look like this:

/var/lib/mongodb/
├── collection-0-1234567890.wt    ← your users collection, B-Tree file
├── collection-2-1234567890.wt    ← another collection
├── index-1-1234567890.wt         ← index file
├── WiredTigerLog.0000000001      ← journal (WAL equivalent)
└── WiredTiger.wt                 ← metadata

So if both Postgres and WiredTiger use B-Trees on disk, why are MongoDB writes faster?

Because WiredTiger uses an aggressive in-memory buffer with in-place updates and automatic MVCC cleanup — the combination that Postgres doesn't have.

Three specific differences:

1. MVCC location. Postgres keeps dead tuple versions on disk pages until VACUUM cleans them. WiredTiger keeps old versions in RAM only, for the duration of active transactions. Transaction commits → old version gone from memory automatically. No VACUUM needed.

2. Compression. WiredTiger compresses every page (Snappy by default, Zstd optionally). Postgres has no page-level compression by default. Less data on disk = less I/O.

3. Cache size. WiredTiger defaults to 50% of RAM. Postgres shared_buffers defaults to 128MB — criminally small for production. More cache = more cache hits = less disk I/O.

What a Document Looks Like on Disk

MongoDB stores documents in BSON — Binary JSON. Compact, typed, fast to parse.

{ name: "Bob", age: 25 }

→ BSON bytes:
\x16\x00\x00\x00        (document size: 22 bytes)
\x02                    (type: string)
name\x00                (field name)
\x04\x00\x00\x00Bob\x00 (string value)
\x10                    (type: int32)
age\x00                 (field name)
\x19\x00\x00\x00        (value: 25)
\x00                    (document end)

Each document has an _id field — ObjectId by default (12-byte unique identifier). This becomes the primary key and the default index.

The Write Path — From Your Code to Durable Storage

When you run db.users.insertOne({name: "Bob"}), here's everything that actually happens:

Step 1 — Wire Protocol

Your driver serializes the document to BSON and sends it over TCP port 27017 using MongoDB's binary wire protocol (OP_MSG). Along with the document, it sends a Write Concern — this single parameter controls the entire durability guarantee.

More on Write Concern at the end of this section. First, what happens on the server.

Step 2 — WiredTiger Cache (RAM)

The write lands in WiredTiger's in-memory cache first — no disk I/O yet. The relevant B-Tree page is found (or created), the document is written in-place on that page, and the page is marked dirty.

If the document grew in size (new fields added) and doesn't fit on its current page, WiredTiger moves it to a new page — a B-Tree page split. This is the "document padding" strategy: WiredTiger leaves extra space after documents to accommodate small in-place updates without splits.

Step 3 — Journal Write (Crash Safety)

Simultaneously, the change is written to the journal — WiredTiger's WAL equivalent:

/var/lib/mongodb/journal/WiredTigerLog.0000000001

Sequential append. Fast. Every 100ms, the journal is flushed to disk (configurable via storage.journal.commitIntervalMs). This means in the default configuration, a crash can lose up to 100ms of writes.

If you set j: true in Write Concern, MongoDB waits for the journal flush before acknowledging — safer, slower.

Step 4 — Oplog Entry (Replication)

If you're running a replica set (and in production you always are), the write is also recorded in the oplog:

// local.oplog.rs — a special capped collection
{
  op: "i",                    // i=insert, u=update, d=delete
  ns: "myapp.users",          // namespace
  o: { _id: ObjectId("..."), name: "Bob" },  // the document
  ts: Timestamp(1234567890, 1),  // logical clock
  v: 2                        // oplog version
}

Secondaries tail this collection continuously — like tail -f on a log file — and apply each entry to their own WiredTiger instance. This is the entire replication mechanism.

Step 5 — Checkpoint (Durable to Disk)

Every 60 seconds (or when dirty data hits 5GB), WiredTiger runs a checkpoint — all dirty pages in the cache are flushed to their respective .wt B-Tree files on disk. After a checkpoint, the journal up to that point can be safely discarded.

The full write path:

insertOne({name: "Bob"})
    │
    ▼
BSON serialize → TCP (port 27017)
    │
    ▼
WiredTiger cache → dirty page marked
    │
    ├──► Journal append (sequential, every 100ms flush)
    │
    └──► Oplog entry (local.oplog.rs) → secondaries tail this
              │
              ▼  every 60s
         Checkpoint → .wt files on disk (durable)

Write Concern — The Durability Dial

This is the most important knob in MongoDB writes. Most developers leave it at the default and don't think about it — which is fine until something breaks.

Write Concern	What it means	Risk
`w: 0`	Fire and forget — no ack	Complete data loss possible
`w: 1`	Primary RAM — default	100ms window: crash → data loss
`w: 1, j: true`	Primary journal flushed	Single node crash safe
`w: majority`	Majority nodes RAM	Network partition safe
`w: majority, j: true`	Majority nodes journal	Safest. Use for financial data

The common mistake: running production with w: 1 (default) and assuming the data is safe. Kill the primary 50ms after a write — that write is gone. For anything that matters, use w: majority.

Replication — How Replica Sets Actually Work

MongoDB's replication unit is the Replica Set — a group of mongod processes that all hold the same data.

Minimum setup: 1 Primary + 2 Secondaries. Why three? Because elections require a majority — with two nodes, if one goes down, you can't form a majority of two, so writes stop.

The Oplog Tailing Mechanism

Secondary nodes maintain a long-polling cursor on the primary's local.oplog.rs. Every new oplog entry triggers the secondary to:

Read the entry
Apply the operation to its own WiredTiger instance
Update its lastAppliedTimestamp
Send a heartbeat back to primary with this timestamp

The primary uses these timestamps to determine replication lag — how far behind each secondary is. With w: majority write concern, the primary waits until enough secondaries have confirmed applying the write before acknowledging to the client.

Elections — What Happens When Primary Goes Down

Primary sends heartbeats every 2 seconds. If a secondary doesn't hear from primary for 10 seconds — it calls an election.

1. Secondary times out (10s no heartbeat)
2. Increments its term number → becomes CANDIDATE
3. Sends RequestVote to all members: "my oplog is at ts:X, vote for me?"
4. Members check: is candidate's oplog as fresh as mine?
   → Yes → vote granted
   → No → vote refused
5. Candidate gets majority → becomes PRIMARY
6. Others step down to SECONDARY
7. Writes resume (typically 10-30s total downtime)

The key insight: The candidate with the most up-to-date oplog wins. This prevents electing a secondary that missed recent writes.

Production implication: Your application must handle 10-30 seconds of write unavailability during elections. Use retryWrites: true in your connection string — MongoDB drivers automatically retry eligible operations after a new primary is elected.

Read Preference — Where Do Reads Go?

Mode	Behavior	Use case
`primary` (default)	Only primary	Always fresh data
`primaryPreferred`	Primary, fallback secondary	Good general default
`secondary`	Only secondaries	Read scaling, stale data ok
`secondaryPreferred`	Secondary, fallback primary	Read-heavy workloads
`nearest`	Lowest latency node	Multi-region, geo-distributed

The gotcha with secondary reads: replication lag. If your secondary is 500ms behind and a user writes then immediately reads, they won't see their own write. This is the read-your-own-writes problem.

Fix: use causal consistency at the session level:

const session = client.startSession({ causalConsistency: true });
await db.collection('users').insertOne({ name: 'Bob' }, { session });
// This read is guaranteed to see the above write
await db.collection('users').findOne({ name: 'Bob' }, { session });

Sharding — Scaling Beyond One Machine

Replication gives you high availability — same data on multiple nodes. Sharding gives you horizontal scale — different data on different nodes.

These are different problems. Both matter in production. In a sharded cluster, each shard is itself a replica set — so you get both.

The Three Components

mongos (router): This is what your application connects to. It's stateless — holds no data, just routes queries. Always run 2+ behind a load balancer for HA.

Config Servers: A dedicated replica set (3 nodes) that stores the cluster's metadata — which chunk of data lives on which shard, shard membership, balancer status. The cluster's brain. If config servers go down, no writes are allowed cluster-wide.

Shards: Where your actual data lives. Each shard is a full replica set.

Your App
    │
    ▼
mongos × 2 (stateless routers)
    │         ↕ chunk map
    │    Config Server RS (× 3)
    │
    ├──► Shard 1 RS  (users: country A-H)
    ├──► Shard 2 RS  (users: country I-R)
    └──► Shard 3 RS  (users: country S-Z)

Chunks — How Data Is Divided

MongoDB divides a collection into chunks — each chunk represents a contiguous range of shard key values. Default chunk size is 128MB.

// Enable sharding on database
sh.enableSharding("myapp")

// Shard the users collection by country
sh.shardCollection("myapp.users", { country: 1 })

Now MongoDB splits the country value space into chunks and distributes them across shards. When a chunk grows beyond 128MB, it splits automatically. The balancer — a background process — migrates chunks between shards to keep distribution even.

Chunk migration has a real cost: data is physically copied over the network from one shard to another. Schedule maintenance windows or disable the balancer during peak hours:

sh.setBalancerState(false)  // peak hours
sh.setBalancerState(true)   // off-peak

Shard Key — The Most Important Decision You'll Make

The shard key determines which shard a document lives on. Choose wrong and your cluster becomes useless. Changing it later is painful (though MongoDB 5.0+ supports online resharding).

The hotspot problem — ObjectId as shard key:

ObjectId is monotonically increasing (contains a timestamp). Every new document gets an ObjectId larger than the previous one. All new inserts land in the "last" chunk. That chunk lives on one shard. Result: one shard at 100% write load, others idle.

// BAD — ObjectId is monotonically increasing
sh.shardCollection("myapp.orders", { _id: 1 })

// GOOD — hash distributes uniformly
sh.shardCollection("myapp.orders", { _id: "hashed" })

Hashed shard key solves write hotspots — the hash of ObjectId is uniformly distributed. But you lose range queries — WHERE createdAt BETWEEN x AND y becomes a scatter-gather across all shards.

Compound shard key balances both:

// country gives locality, userId gives cardinality
sh.shardCollection("myapp.users", { country: 1, userId: 1 })

Queries filtering by country hit one shard. High cardinality (millions of user IDs) means many chunks = good distribution.

The three criteria for a good shard key:

High cardinality — enough distinct values to create many chunks
Non-monotonic — inserts spread across chunks, no hotspot
Query alignment — your most common queries include the shard key

Targeted vs Scatter-Gather Queries

This is where shard key choice shows up in your latency metrics.

Targeted query — shard key present in filter:

// country is the shard key → mongos knows exactly which shard
db.users.find({ country: "IN", name: "Bob" })
// → hits Shard 1 only → fast

Scatter-gather query — no shard key:

// name is not the shard key → mongos has no idea
db.users.find({ name: "Bob" })
// → broadcast to ALL shards → wait for all responses → merge
// → latency = slowest shard's response time

Scatter-gather gets worse with sort + limit:

db.users.find({}).sort({ createdAt: -1 }).limit(10)
// Each shard returns its top 10
// mongos merges 30 documents, returns final 10
// N shards × limit documents fetched, only limit returned

Check if your query is scatter-gather:

db.users.find({ name: "Bob" }).explain("executionStats")
// Look for: "SHARD_MERGE" in winningPlan → scatter-gather
// Look for: "SINGLE_SHARD" → targeted

Production Gotchas

Jumbo chunks — A chunk becomes "jumbo" when all documents in it have the same shard key value, so it can't split further. The balancer can't move jumbo chunks. They pile up on one shard. Fix: choose a higher-cardinality shard key upfront. If you're already here — manual chunk management required.

Transactions across shards — Multi-document transactions that touch multiple shards require a 2-phase commit internally. Slow and lock-heavy. Design your data so related documents live on the same shard (use the same shard key value for related documents).

$lookup across sharded collections — MongoDB's JOIN equivalent ($lookup) across two sharded collections requires pulling data across shards — extremely expensive. In a sharded cluster, favor embedding over referencing. This is MongoDB's design philosophy anyway.

Too-early sharding — Sharding adds operational complexity. A well-tuned single replica set handles a lot. Don't shard until you've genuinely exhausted vertical scaling and read replicas. A common rule: shard when your dataset exceeds what comfortably fits on your biggest available instance.

Putting It All Together

Here's the complete MongoDB architecture from a single write to a sharded, replicated cluster:

db.users.insertOne({ name: "Bob" }, { writeConcern: { w: "majority" } })
    │
    ▼
mongos router
    │  chunk map lookup (Config Servers)
    │  country: "IN" → Shard 1
    ▼
Shard 1 — Primary node
    ├── WiredTiger cache → dirty page (in-memory, fast)
    ├── Journal append → WiredTigerLog (crash safety)
    └── Oplog entry → local.oplog.rs
              │
              ├──────────────────► Secondary 1
              │   oplog tailing    apply + ack to primary
              │
              └──────────────────► Secondary 2
                  oplog tailing    apply + ack to primary
                        │
    Primary waits ◄──────┘
    majority ack (2/3 nodes) → mongos → client acknowledged

The chain: WiredTiger (how data is stored) → Journal (crash safety) → Oplog (replication fuel) → Replica Set (HA) → Sharding (horizontal scale). Each layer builds on the one below it.

MongoDB vs Postgres — When to Actually Use Each

After going through both internals:

Use MongoDB when:

Your data structure varies significantly across documents (product catalog, CMS, user profiles)
You need horizontal write scale from the start (sharding)
You're storing hierarchical data that maps naturally to documents
Schema flexibility is genuinely required, not just convenient

Use Postgres when:

You need strong ACID across multiple related entities (orders, payments, inventory)
Your data is relational — joins are frequent and important
You need complex aggregations and window functions
Consistency guarantees matter more than write throughput

Use both when:

MongoDB as primary store + Elasticsearch for full-text search
MongoDB for flexible catalog data + Postgres for transactional order data
This is not over-engineering — it's using the right tool for each problem

The mistake isn't choosing MongoDB or Postgres. The mistake is choosing one before you understand what your data actually looks like and how you'll query it.

Next in the series — we go fully write-heavy. Cassandra and the LSM Tree: why it can absorb millions of writes per second, what compaction actually does, and the read trade-offs you're silently accepting when you choose it.

DEV Community