You run a SELECT that takes 400ms to scan a table. Halfway through, another connection runs an UPDATE on a row you haven't reached yet. What should your query see — the old value or the new one? And should your read have blocked that write, or the write blocked your read?
Multi-Version Concurrency Control (MVCC) is the answer most production databases settled on: neither one waits. Postgres, MySQL's InnoDB engine, Oracle, and SQL Server's snapshot mode all keep more than one copy of a row at a time. A reader gets a consistent point-in-time view; a writer creates a new version alongside the old one. The two coexist instead of fighting over a lock.
The problem MVCC is solving
Without versioning, a database has two blunt options. It can take a read lock so writers wait until every reader finishes — correct, but readers and writers now serialize, and a long analytical query can stall writes for its entire duration. Or it can let reads see in-progress writes, which gives you dirty reads: you return a value that the writing transaction later rolls back, so you reported data that never officially existed.
MVCC sidesteps both. The core rule is the one you'll see repeated everywhere: readers don't block writers, and writers don't block readers. When a transaction modifies a row, the database doesn't overwrite the bytes in place. It writes a new version and leaves the old one in place until no running transaction can still need it. Every transaction reads against a snapshot — a definition of which versions are visible to it — so your 400ms scan sees one coherent state of the table even while other connections are busy changing it.
Writers still block other writers on the same row. Two transactions can't both update row 17 concurrently and pretend they didn't see each other; one waits. But that's the only contention MVCC doesn't remove, and it's the one you usually can't avoid anyway.
How Postgres actually stores the versions
Postgres is the cleanest system to reason about because the versioning lives directly in the table heap. Every row version (a tuple) carries two hidden system columns: xmin, the ID of the transaction that created it, and xmax, the ID of the transaction that deleted or superseded it. You can see them:
SELECT xmin, xmax, * FROM accounts WHERE id = 17;
Visibility is a comparison. A tuple is visible to your transaction if its xmin committed before your snapshot was taken and its xmax is either empty or belongs to a transaction your snapshot doesn't consider committed. An UPDATE is mechanically an insert plus a stamp: Postgres sets the old tuple's xmax to the updating transaction's ID and inserts a brand-new tuple with a fresh xmin. A DELETE just stamps xmax and inserts nothing.
The consequence is the thing that surprises people: deleted and updated rows don't free space immediately. The old versions — dead tuples — sit in the table until cleanup runs. That cleanup is VACUUM, which reclaims space from tuples no remaining transaction can see, and autovacuum runs it for you in the background. This is why a table you only ever DELETE from can keep growing on disk, and why a heavily-updated table develops bloat that slows sequential scans.
Postgres transaction IDs are 32-bit, so there are only about 4 billion of them before they wrap around.
VACUUMalso "freezes" old tuples to mark them as visible-to-everyone before that wraparound can make old data look like it's from the future. If autovacuum can't keep up — a long-idle transaction holding back cleanup is the classic cause — Postgres will eventually stop accepting writes to protect itself. Monitorage(datfrozenxid)on busy databases; don't wait to discover this at 3am.
InnoDB and Oracle reach the same outcome by a different route. Instead of keeping old versions in the main table, they keep the current row in place and store enough undo information in a separate area (the undo log in InnoDB, rollback/undo segments in Oracle) to reconstruct older versions on demand. The trade-off flips: the main table stays compact, but a long-running read has to walk undo records to rebuild the snapshot it needs, and an undo log that fills up because of an old open transaction is InnoDB's version of the same operational headache.
Snapshot isolation and the edge it hides
MVCC gives you snapshots cheaply, but when the snapshot is taken determines what anomalies you can still hit. That's the isolation level.
In Postgres, the default is Read Committed: every individual statement gets a fresh snapshot. So two SELECTs in the same transaction can return different data if another transaction committed in between. Repeatable Read takes one snapshot at the transaction's first statement and holds it — Postgres implements this as true snapshot isolation, so the whole transaction sees a frozen view.
Snapshot isolation feels like it should be airtight, and it nearly is, which is what makes its failure mode sneaky: write skew. Two transactions each read an overlapping set of rows, each checks a condition that's still true in its own snapshot, and each writes a different row. Neither sees the other's write because both snapshots predate it, so a constraint that depends on the combination — "at least one doctor must stay on call" — gets violated even though each transaction looked correct in isolation.
If write skew can break a business rule, Postgres's Serializable level (Serializable Snapshot Isolation, SSI) detects the dangerous read/write dependency at commit time and aborts one transaction with a serialization failure. You pay for it with retry logic: serializable transactions can fail and must be re-run by your application, so wrap them in a retry loop rather than assuming they always commit.
The practical takeaway is to know your database's default. Read Committed is fine for most CRUD work. The moment you're reading a value, making a decision, and writing based on it — balance checks, inventory decrements, allocation logic — you need to think about whether a concurrent transaction could invalidate that decision, and reach for a higher isolation level or an explicit SELECT ... FOR UPDATE lock.
MVCC isn't free — bloat, vacuum tuning, and undo-log pressure are the rent you pay for it. But the alternative, a database where every analytics query freezes your writes, is worse for almost every workload. Understanding the version mechanics underneath turns "why is this table 40GB when it has 2 million rows" from a mystery into a vacuum-tuning task.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)