Subham

Posted on Mar 14

Dissecting PostgreSQL — Why Being Read-Optimized Comes at the Cost of Write Speed

#architecture #database #postgres #watercooler

You've probably heard this before — "Postgres is great but not ideal for write-heavy workloads."

But why? What is actually happening under the hood that makes writes slow? And how is Cassandra or any LSM-tree database fundamentally different?

This post is a deep dive into Postgres internals — storage layout, MVCC, WAL, B-Tree indexes, and how all of these together create a write bottleneck by design. Not as a flaw, but as a deliberate tradeoff.

Let's dissect it.

First — Where Does Postgres Actually Store Your Data?

Before we talk about writes, you need to understand where the data lives physically. No magic here.

Your users table is literally a binary file on disk:

/var/lib/postgresql/16/main/base/16384/24601

That's it. A plain file. You can hexdump it and see your users' names and emails in raw bytes.

Run this to find yours:

SHOW data_directory;
-- /var/lib/postgresql/16/main

SELECT oid FROM pg_database WHERE datname = 'myapp';
-- 16384

SELECT relfilenode FROM pg_class WHERE relname = 'users';
-- 24601

The $PGDATA directory structure looks like this:

$PGDATA/
├── base/
│   └── 16384/          ← your database (identified by OID)
│       ├── 24601        ← users table (heap file)
│       ├── 24601_fsm    ← free space map
│       ├── 24601_vm     ← visibility map
│       └── 24605        ← users_pkey index
├── pg_wal/              ← write-ahead log segments
└── postgresql.conf

The 8KB Page — Storage's Atomic Unit

Postgres doesn't read individual rows from disk. It reads pages — 8KB blocks. Every table file is a sequence of these pages.

A single page looks like this:

┌─────────────────────────────────┐
│ Page Header (24 bytes)          │  ← LSN, checksum, flags
├─────────────────────────────────┤
│ Item Pointers (4 bytes each)    │  ← offsets pointing to tuples  below
├─────────────────────────────────┤
│                                 │
│         free space              │
│                                 │
├─────────────────────────────────┤
│ Tuple 3 │ Tuple 2 │ Tuple 1     │  ← actual row data, grows upward
├─────────────────────────────────┤
│ Special space                   │  ← used by indexes
└─────────────────────────────────┘

Item pointers grow downward, tuples grow upward. When they meet — page is full. New rows go to a new page.

What a Tuple (Row) Looks Like on Disk

Every row has a 23-byte header you never see:

Field	Size	Purpose
`t_xmin`	4B	Transaction ID that inserted this row
`t_xmax`	4B	Transaction ID that deleted/updated this row (0 = still alive)
`t_ctid`	6B	Physical location — `(page_number, item_index)`
`t_infomask`	2B	Flags — is it committed? frozen? has nulls?
`t_hoff`	1B	Header size (grows if there's a null bitmap)
null bitmap	varies	Which columns are NULL
column data	varies	Your actual data

That t_xmin and t_xmax — remember these. They are the key to understanding everything that follows.

The Write Path — Where Things Get Expensive

Now let's trace what actually happens when you run:

UPDATE users SET name = 'Bob' WHERE id = 5;

Step 1 — Write to WAL First

Before touching any data page, Postgres writes your change to the Write-Ahead Log (pg_wal/).

pg_wal/000000010000000000000001   ← 16MB binary segment

This is for crash safety — if the server dies mid-write, Postgres replays the WAL from the last checkpoint and recovers. Every write goes here first, then to the actual heap.

So already, one logical write = two physical writes.

WAL is sequential — always appending. That part is fast. The problem comes next.

Step 2 — Find the Right Page (Random I/O)

Postgres needs to find the page containing the row with id = 5. It checks the shared buffer pool first (RAM cache). If the page is there — great, cache hit. If not, it reads the 8KB page from disk.

This is a random I/O — the disk head jumps to wherever that page lives. On spinning disks, random I/O is painfully slow (~100 IOPS vs ~500 MB/s for sequential). Even on SSDs, random reads are more expensive than sequential.

Step 3 — MVCC: The Old Row is NOT Deleted

This is where Postgres gets genuinely interesting — and where the write overhead accumulates.

When you UPDATE a row, Postgres does not modify it in place. Instead:

The old row's t_xmax is set to the current transaction ID — marking it as "deleted by this transaction"
A brand new row version is inserted on the page, with t_xmin = current transaction ID

So after your UPDATE, the page has two versions of the row:

┌────────────────────────────────────────────────┐
│  Tuple (old): t_xmin=100, t_xmax=205 ← dead    │
│  name = 'Alice'                                │
├────────────────────────────────────────────────┤
│  Tuple (new): t_xmin=205, t_xmax=0   ← alive   │
│  name = 'Bob'                                  │
└────────────────────────────────────────────────┘

This is MVCC — Multi-Version Concurrency Control. The big benefit: readers and writers don't block each other. A SELECT running at transaction 204 will still see 'Alice' because the new version isn't visible to it yet. No locks needed for reads.

The cost: every UPDATE is secretly an INSERT + a mark operation. Dead tuples pile up on disk.

Step 4 — Update Every Index

If your users table has 5 indexes (primary key, email, created_at, status, name), then a single UPDATE that changes any indexed column must update all relevant index pages too.

Each index is a separate B-Tree file on disk. Each index update is another random I/O.

One UPDATE → 1 WAL write + 1 heap page read + 1 heap page write + N index page writes.

This is write amplification in action.

Step 5 — VACUUM Cleans Up the Mess (Later)

Dead tuples don't disappear. They sit on disk pages, taking up space, making full table scans slower. autovacuum runs periodically in the background to reclaim them.

But VACUUM itself consumes I/O and CPU. And if autovacuum can't keep up with your write rate — table bloat happens. Pages fill up with dead tuples. Sequential scans slow down. Index bloat increases.

Here's the full write path visualized:

UPDATE query
    │
    ▼
WAL Buffer (RAM)
    │
    ▼ flush
WAL on disk (sequential write — fast)
    │
    ▼
Shared Buffer Pool → find/load heap page (random I/O — slow if miss)
    │
    ├──► Mark old tuple t_xmax = txn_id     (dead tuple, stays on disk)
    │
    └──► Insert new tuple on same/new page  (MVCC)
              │
              └──► Update all N index B-Trees  (N × random I/O)
                        │
                        ▼
              VACUUM (later, async) cleans dead tuples

The B-Tree Problem

Postgres uses B-Tree as the default index structure. B-Trees are brilliant for reads:

WHERE id = 5 → O(log n) traversal from root to leaf
Range queries (WHERE created_at BETWEEN ...) → traverse to start, scan leaf nodes
Sorted results (ORDER BY id) → free, B-Tree is already sorted

But B-Trees have a fundamental write problem: updates are random I/O.

When you insert a new value, the B-Tree must find the correct leaf node and insert there. If the leaf is full, it splits — and the split propagates upward. Each node is a page on disk. Inserting into a B-Tree means jumping to a (potentially uncached) page, modifying it, writing it back.

At scale — millions of writes per second — this random I/O becomes the bottleneck.

Compare this to what write-optimized databases use.

Why LSM Trees (Cassandra, RocksDB) Are Different

LSM = Log-Structured Merge Tree. The core insight: sequential writes to disk are an order of magnitude faster than random writes.

Spinning disk:  sequential ~500 MB/s  vs  random ~1 MB/s  (500x difference!)
SSD:            sequential ~3 GB/s    vs  random ~200 MB/s (15x difference!)

LSM trees exploit this:

Write arrives
    │
    ▼
MemTable (RAM, sorted) ← microseconds, no disk I/O
    │
    ▼ when full, flush
SSTable on disk (immutable, sequential write) ← fast!
    │
    ▼ background
Compaction — merge SSTables, remove old versions ← sequential, amortized

No random I/O on the write path. No index updates. No MVCC dead tuples. Writes just go to RAM, then get flushed sequentially to disk.

The tradeoff: reads are slower. To read a value, you might need to check multiple SSTables (the value could be in any of them before compaction). Bloom filters help, but LSM reads are fundamentally more expensive than a B-Tree lookup.

The Core Tradeoff, Visualized

	PostgreSQL (B-Tree)	Cassandra (LSM Tree)
Write path	Random I/O + MVCC + index updates	Sequential write to RAM → disk
Read path	B-Tree O(log n), very fast	Check MemTable + multiple SSTables
Updates	New row version + dead tuple	Append new version, compaction later
Disk I/O pattern	Random	Sequential
Garbage collection	VACUUM (explicit)	Compaction (background)
Transactions	Full ACID	Eventual consistency (tunable)
Complex queries	Excellent (JOINs, aggregations)	Very limited

Neither is better. They solve different problems.

When Does Postgres Write Performance Actually Hurt?

Postgres handles typical web app write loads just fine. The bottleneck shows up when:

1. High UPDATE rate on wide tables with many indexes
Every UPDATE touches N index files. A table with 8 indexes and 10,000 updates/second = potentially 80,000 random I/O operations per second.

2. autovacuum can't keep up
Dead tuples accumulate faster than VACUUM can reclaim them. Table bloat increases. Full scans read more pages. The problem compounds.

-- Check table bloat
SELECT relname, n_dead_tup, n_live_tup,
       round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;

3. Checkpoint I/O spikes
Every checkpoint_timeout (default 5 min), Postgres flushes all dirty pages to disk. This creates a spike. Tune checkpoint_completion_target = 0.9 to spread the I/O.

4. Transaction ID Wraparound
Every transaction gets a 32-bit ID. At ~2 billion transactions, wraparound happens — Postgres can no longer tell old from new. It enters emergency autovacuum. This is a real production incident waiting to happen if you're not monitoring it.

-- Monitor wraparound risk
SELECT datname, age(datfrozenxid) as txn_age
FROM pg_database
ORDER BY txn_age DESC;
-- If age > 1.5 billion — start worrying

How to Squeeze More Write Performance Out of Postgres

If you need Postgres for write-heavy workloads, here's what actually helps:

1. Tune autovacuum aggressively for busy tables

Default autovacuum_vacuum_scale_factor = 0.2 means VACUUM only triggers after 20% of rows are dead. For a 10M row table that's 2M dead tuples before cleanup. Tighten it:

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.01,   -- trigger at 1%
  autovacuum_analyze_scale_factor = 0.005
);

2. Batch writes, avoid single-row inserts

-- Bad: 1000 round trips, 1000 WAL flushes
INSERT INTO events VALUES (...);
INSERT INTO events VALUES (...);
-- ... × 1000

-- Good: 1 round trip, 1 WAL flush
INSERT INTO events VALUES (...), (...), (...); -- × 1000

3. Use COPY for bulk loads

COPY is 10-50x faster than INSERT for bulk data. It bypasses a lot of overhead:

COPY events FROM '/tmp/events.csv' WITH (FORMAT csv);

4. Partial indexes to reduce index maintenance

Instead of indexing everything:

-- Don't index all rows
CREATE INDEX ON orders (user_id);

-- Only index what you actually query
CREATE INDEX ON orders (user_id) WHERE status = 'active';

Fewer index entries = faster updates on non-active rows.

The Architecture in One Picture

Here's the full Postgres architecture, from your query to the disk:

Your App
    │  SQL string over TCP (port 5432)
    ▼
Postmaster ──► fork() → Backend Process (1 per connection)
                              │
                    ┌─────────┼──────────┐
                    ▼         ▼          ▼
              Buffer Pool   Locks     WAL Buffer
              (shared RAM)           (shared RAM)
                    │                    │
                    ▼                    ▼
               BgWriter             WAL Writer
               Autovacuum           Checkpointer
                    │                    │
                    ▼                    ▼
              Heap files (.base/)    pg_wal/ segments
              Index files

Every query goes through: Parse → Analyze → Rewrite → Plan → Execute.

The planner is where performance is won or lost — it estimates the cost of every possible execution plan and picks the cheapest. Feed it bad statistics (stale pg_statistic) and it picks wrong plans.

-- Always run after bulk loads
ANALYZE users;

-- See what plan the planner chose
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = 'bob@example.com';

Summary

Postgres is read-optimized not by accident — it's the consequence of specific, deliberate design choices:

MVCC → readers never block writers (great for concurrency, bad for write amplification)
B-Tree indexes → fast O(log n) reads, expensive random-I/O writes
Heap storage → rows stored in 8KB pages by insertion order, not update-friendly
WAL → crash safety, but doubles write I/O
VACUUM → background GC for MVCC dead tuples, adds overhead

These same choices are what make Postgres excellent at complex SQL queries, strong consistency, and concurrent reads. You can't have both without tradeoffs.

If you need extreme write throughput: Cassandra, ScyllaDB, or ClickHouse (for analytics). If you need correctness, complex queries, and strong consistency: Postgres is still one of the best databases ever built.

The trick is knowing which tradeoff you're making — and this post was about making sure you know exactly why.

Next up in this series — we flip sides completely. We'll dissect a write-heavy database from the inside: how LSM trees actually work, what happens during compaction, and why Cassandra can eat millions of writes per second without breaking a sweat.

DEV Community