You've probably heard this before — "Postgres is great but not ideal for write-heavy workloads."
But why? What is actually happening under the hood that makes writes slow? And how is Cassandra or any LSM-tree database fundamentally different?
This post is a deep dive into Postgres internals — storage layout, MVCC, WAL, B-Tree indexes, and how all of these together create a write bottleneck by design. Not as a flaw, but as a deliberate tradeoff.
Let's dissect it.
First — Where Does Postgres Actually Store Your Data?
Before we talk about writes, you need to understand where the data lives physically. No magic here.
Your users table is literally a binary file on disk:
/var/lib/postgresql/16/main/base/16384/24601
That's it. A plain file. You can hexdump it and see your users' names and emails in raw bytes.
Run this to find yours:
SHOW data_directory;
-- /var/lib/postgresql/16/main
SELECT oid FROM pg_database WHERE datname = 'myapp';
-- 16384
SELECT relfilenode FROM pg_class WHERE relname = 'users';
-- 24601
The $PGDATA directory structure looks like this:
$PGDATA/
├── base/
│ └── 16384/ ← your database (identified by OID)
│ ├── 24601 ← users table (heap file)
│ ├── 24601_fsm ← free space map
│ ├── 24601_vm ← visibility map
│ └── 24605 ← users_pkey index
├── pg_wal/ ← write-ahead log segments
└── postgresql.conf
The 8KB Page — Storage's Atomic Unit
Postgres doesn't read individual rows from disk. It reads pages — 8KB blocks. Every table file is a sequence of these pages.
A single page looks like this:
┌─────────────────────────────────┐
│ Page Header (24 bytes) │ ← LSN, checksum, flags
├─────────────────────────────────┤
│ Item Pointers (4 bytes each) │ ← offsets pointing to tuples below
├─────────────────────────────────┤
│ │
│ free space │
│ │
├─────────────────────────────────┤
│ Tuple 3 │ Tuple 2 │ Tuple 1 │ ← actual row data, grows upward
├─────────────────────────────────┤
│ Special space │ ← used by indexes
└─────────────────────────────────┘
Item pointers grow downward, tuples grow upward. When they meet — page is full. New rows go to a new page.
What a Tuple (Row) Looks Like on Disk
Every row has a 23-byte header you never see:
| Field | Size | Purpose |
|---|---|---|
t_xmin |
4B | Transaction ID that inserted this row |
t_xmax |
4B | Transaction ID that deleted/updated this row (0 = still alive) |
t_ctid |
6B | Physical location — (page_number, item_index)
|
t_infomask |
2B | Flags — is it committed? frozen? has nulls? |
t_hoff |
1B | Header size (grows if there's a null bitmap) |
| null bitmap | varies | Which columns are NULL |
| column data | varies | Your actual data |
That t_xmin and t_xmax — remember these. They are the key to understanding everything that follows.
The Write Path — Where Things Get Expensive
Now let's trace what actually happens when you run:
UPDATE users SET name = 'Bob' WHERE id = 5;
Step 1 — Write to WAL First
Before touching any data page, Postgres writes your change to the Write-Ahead Log (pg_wal/).
pg_wal/000000010000000000000001 ← 16MB binary segment
This is for crash safety — if the server dies mid-write, Postgres replays the WAL from the last checkpoint and recovers. Every write goes here first, then to the actual heap.
So already, one logical write = two physical writes.
WAL is sequential — always appending. That part is fast. The problem comes next.
Step 2 — Find the Right Page (Random I/O)
Postgres needs to find the page containing the row with id = 5. It checks the shared buffer pool first (RAM cache). If the page is there — great, cache hit. If not, it reads the 8KB page from disk.
This is a random I/O — the disk head jumps to wherever that page lives. On spinning disks, random I/O is painfully slow (~100 IOPS vs ~500 MB/s for sequential). Even on SSDs, random reads are more expensive than sequential.
Step 3 — MVCC: The Old Row is NOT Deleted
This is where Postgres gets genuinely interesting — and where the write overhead accumulates.
When you UPDATE a row, Postgres does not modify it in place. Instead:
- The old row's
t_xmaxis set to the current transaction ID — marking it as "deleted by this transaction" - A brand new row version is inserted on the page, with
t_xmin= current transaction ID
So after your UPDATE, the page has two versions of the row:
┌────────────────────────────────────────────────┐
│ Tuple (old): t_xmin=100, t_xmax=205 ← dead │
│ name = 'Alice' │
├────────────────────────────────────────────────┤
│ Tuple (new): t_xmin=205, t_xmax=0 ← alive │
│ name = 'Bob' │
└────────────────────────────────────────────────┘
This is MVCC — Multi-Version Concurrency Control. The big benefit: readers and writers don't block each other. A SELECT running at transaction 204 will still see 'Alice' because the new version isn't visible to it yet. No locks needed for reads.
The cost: every UPDATE is secretly an INSERT + a mark operation. Dead tuples pile up on disk.
Step 4 — Update Every Index
If your users table has 5 indexes (primary key, email, created_at, status, name), then a single UPDATE that changes any indexed column must update all relevant index pages too.
Each index is a separate B-Tree file on disk. Each index update is another random I/O.
One UPDATE → 1 WAL write + 1 heap page read + 1 heap page write + N index page writes.
This is write amplification in action.
Step 5 — VACUUM Cleans Up the Mess (Later)
Dead tuples don't disappear. They sit on disk pages, taking up space, making full table scans slower. autovacuum runs periodically in the background to reclaim them.
But VACUUM itself consumes I/O and CPU. And if autovacuum can't keep up with your write rate — table bloat happens. Pages fill up with dead tuples. Sequential scans slow down. Index bloat increases.
Here's the full write path visualized:
UPDATE query
│
▼
WAL Buffer (RAM)
│
▼ flush
WAL on disk (sequential write — fast)
│
▼
Shared Buffer Pool → find/load heap page (random I/O — slow if miss)
│
├──► Mark old tuple t_xmax = txn_id (dead tuple, stays on disk)
│
└──► Insert new tuple on same/new page (MVCC)
│
└──► Update all N index B-Trees (N × random I/O)
│
▼
VACUUM (later, async) cleans dead tuples
The B-Tree Problem
Postgres uses B-Tree as the default index structure. B-Trees are brilliant for reads:
-
WHERE id = 5→ O(log n) traversal from root to leaf - Range queries (
WHERE created_at BETWEEN ...) → traverse to start, scan leaf nodes - Sorted results (
ORDER BY id) → free, B-Tree is already sorted
But B-Trees have a fundamental write problem: updates are random I/O.
When you insert a new value, the B-Tree must find the correct leaf node and insert there. If the leaf is full, it splits — and the split propagates upward. Each node is a page on disk. Inserting into a B-Tree means jumping to a (potentially uncached) page, modifying it, writing it back.
At scale — millions of writes per second — this random I/O becomes the bottleneck.
Compare this to what write-optimized databases use.
Why LSM Trees (Cassandra, RocksDB) Are Different
LSM = Log-Structured Merge Tree. The core insight: sequential writes to disk are an order of magnitude faster than random writes.
Spinning disk: sequential ~500 MB/s vs random ~1 MB/s (500x difference!)
SSD: sequential ~3 GB/s vs random ~200 MB/s (15x difference!)
LSM trees exploit this:
Write arrives
│
▼
MemTable (RAM, sorted) ← microseconds, no disk I/O
│
▼ when full, flush
SSTable on disk (immutable, sequential write) ← fast!
│
▼ background
Compaction — merge SSTables, remove old versions ← sequential, amortized
No random I/O on the write path. No index updates. No MVCC dead tuples. Writes just go to RAM, then get flushed sequentially to disk.
The tradeoff: reads are slower. To read a value, you might need to check multiple SSTables (the value could be in any of them before compaction). Bloom filters help, but LSM reads are fundamentally more expensive than a B-Tree lookup.
The Core Tradeoff, Visualized
| PostgreSQL (B-Tree) | Cassandra (LSM Tree) | |
|---|---|---|
| Write path | Random I/O + MVCC + index updates | Sequential write to RAM → disk |
| Read path | B-Tree O(log n), very fast | Check MemTable + multiple SSTables |
| Updates | New row version + dead tuple | Append new version, compaction later |
| Disk I/O pattern | Random | Sequential |
| Garbage collection | VACUUM (explicit) | Compaction (background) |
| Transactions | Full ACID | Eventual consistency (tunable) |
| Complex queries | Excellent (JOINs, aggregations) | Very limited |
Neither is better. They solve different problems.
When Does Postgres Write Performance Actually Hurt?
Postgres handles typical web app write loads just fine. The bottleneck shows up when:
1. High UPDATE rate on wide tables with many indexes
Every UPDATE touches N index files. A table with 8 indexes and 10,000 updates/second = potentially 80,000 random I/O operations per second.
2. autovacuum can't keep up
Dead tuples accumulate faster than VACUUM can reclaim them. Table bloat increases. Full scans read more pages. The problem compounds.
-- Check table bloat
SELECT relname, n_dead_tup, n_live_tup,
round(n_dead_tup * 100.0 / nullif(n_live_tup + n_dead_tup, 0), 2) AS dead_pct
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
3. Checkpoint I/O spikes
Every checkpoint_timeout (default 5 min), Postgres flushes all dirty pages to disk. This creates a spike. Tune checkpoint_completion_target = 0.9 to spread the I/O.
4. Transaction ID Wraparound
Every transaction gets a 32-bit ID. At ~2 billion transactions, wraparound happens — Postgres can no longer tell old from new. It enters emergency autovacuum. This is a real production incident waiting to happen if you're not monitoring it.
-- Monitor wraparound risk
SELECT datname, age(datfrozenxid) as txn_age
FROM pg_database
ORDER BY txn_age DESC;
-- If age > 1.5 billion — start worrying
How to Squeeze More Write Performance Out of Postgres
If you need Postgres for write-heavy workloads, here's what actually helps:
1. Tune autovacuum aggressively for busy tables
Default autovacuum_vacuum_scale_factor = 0.2 means VACUUM only triggers after 20% of rows are dead. For a 10M row table that's 2M dead tuples before cleanup. Tighten it:
ALTER TABLE events SET (
autovacuum_vacuum_scale_factor = 0.01, -- trigger at 1%
autovacuum_analyze_scale_factor = 0.005
);
2. Batch writes, avoid single-row inserts
-- Bad: 1000 round trips, 1000 WAL flushes
INSERT INTO events VALUES (...);
INSERT INTO events VALUES (...);
-- ... × 1000
-- Good: 1 round trip, 1 WAL flush
INSERT INTO events VALUES (...), (...), (...); -- × 1000
3. Use COPY for bulk loads
COPY is 10-50x faster than INSERT for bulk data. It bypasses a lot of overhead:
COPY events FROM '/tmp/events.csv' WITH (FORMAT csv);
4. Partial indexes to reduce index maintenance
Instead of indexing everything:
-- Don't index all rows
CREATE INDEX ON orders (user_id);
-- Only index what you actually query
CREATE INDEX ON orders (user_id) WHERE status = 'active';
Fewer index entries = faster updates on non-active rows.
The Architecture in One Picture
Here's the full Postgres architecture, from your query to the disk:
Your App
│ SQL string over TCP (port 5432)
▼
Postmaster ──► fork() → Backend Process (1 per connection)
│
┌─────────┼──────────┐
▼ ▼ ▼
Buffer Pool Locks WAL Buffer
(shared RAM) (shared RAM)
│ │
▼ ▼
BgWriter WAL Writer
Autovacuum Checkpointer
│ │
▼ ▼
Heap files (.base/) pg_wal/ segments
Index files
Every query goes through: Parse → Analyze → Rewrite → Plan → Execute.
The planner is where performance is won or lost — it estimates the cost of every possible execution plan and picks the cheapest. Feed it bad statistics (stale pg_statistic) and it picks wrong plans.
-- Always run after bulk loads
ANALYZE users;
-- See what plan the planner chose
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM users WHERE email = 'bob@example.com';
Summary
Postgres is read-optimized not by accident — it's the consequence of specific, deliberate design choices:
- MVCC → readers never block writers (great for concurrency, bad for write amplification)
- B-Tree indexes → fast O(log n) reads, expensive random-I/O writes
- Heap storage → rows stored in 8KB pages by insertion order, not update-friendly
- WAL → crash safety, but doubles write I/O
- VACUUM → background GC for MVCC dead tuples, adds overhead
These same choices are what make Postgres excellent at complex SQL queries, strong consistency, and concurrent reads. You can't have both without tradeoffs.
If you need extreme write throughput: Cassandra, ScyllaDB, or ClickHouse (for analytics). If you need correctness, complex queries, and strong consistency: Postgres is still one of the best databases ever built.
The trick is knowing which tradeoff you're making — and this post was about making sure you know exactly why.
Next up in this series — we flip sides completely. We'll dissect a write-heavy database from the inside: how LSM trees actually work, what happens during compaction, and why Cassandra can eat millions of writes per second without breaking a sweat.
Top comments (0)