Write-Ahead Logging: How Databases Survive a Power Cut

#webdev #tutorial

A database commits a transaction, returns OK, and a half-second later someone trips over the power cord. The machine is dead. When it boots back up, the row you just inserted is still there. That is not luck, and it is not magic. It is write-ahead logging doing the one job it exists to do: making a promise survive a crash.

The naive way to store data is to write it straight into the data file at the right offset. The problem is that a single logical change often touches several disk pages — an index entry here, a row there, a free-space map update somewhere else. If the power dies after page one and before page three, you are left with a data file that is internally inconsistent: an index that points at a row that was never written. There is no way to tell, on reboot, whether that file is whole or torn. You have lost the ability to trust your own storage.

The log-first rule

Write-ahead logging fixes this by inverting the order of operations. Before any change is applied to the actual data pages, the database first writes a description of that change to a separate, append-only file: the log. Only after that log record is safely on disk does the database touch the real data — and crucially, it can defer touching the real data for a long time.

The rule is in the name. The log is written ahead of the data. A transaction is considered durable the moment its commit record reaches stable storage in the log, not when the data pages are updated. This is the D in ACID — durability — and the log is where it lives.

The payoff shows up at recovery time. After a crash, the database reads the log from the last known-good checkpoint forward. For every committed transaction whose changes might not have made it into the data files, it replays the log record and reapplies the change. This is the redo pass. For any transaction that was still in flight when the lights went out — a log record with no matching commit — it rolls the change back. This is the undo pass. The canonical formulation of this redo/undo dance is the ARIES algorithm, and most production databases are a variation on its themes.

Why is replaying the log safe when writing the data directly was not? Because the log is append-only and each record is self-contained. You are never half-updating a structure; you are reading a sequence of "this happened, then this happened" entries and applying them in order. Append-only writes are about the only thing storage hardware is genuinely good at keeping consistent.

The entire guarantee rests on one system call: fsync. When the database writes a commit record, it must force that data out of the OS page cache and onto the physical disk before telling the client the commit succeeded. Some consumer SSDs and cheap USB drives lie — they acknowledge the flush while the data is still sitting in a volatile on-device cache. On those drives, write-ahead logging cannot protect you, because the log record you think is durable evaporates with the power. This is why "my database lost data after a power cut" is, more often than not, a hardware-honesty problem rather than a database bug.

What this looks like in PostgreSQL and SQLite

The concept is universal, but the two databases most developers actually touch implement it in instructively different ways.

PostgreSQL keeps its WAL as a stream of 16 MB segment files under pg_wal/. Every change generates a WAL record stamped with a Log Sequence Number (LSN), a monotonically increasing position in the log. Periodically the database runs a checkpoint: it flushes all the dirty data pages that the log has been describing out to the main data files, then records that the log up to a certain LSN is now fully reflected on disk. Everything before that point can be recycled. The synchronous_commit setting controls how aggressively commits wait for the WAL flush — turn it off and you trade a window of durability for throughput, which is a legitimate choice for data you can afford to lose.

SQLite ships with WAL mode as an opt-in, switched on with PRAGMA journal_mode=WAL;. By default SQLite uses a rollback journal instead, which works the other way around — it copies the original pages out before overwriting them, so it can put them back on a crash. WAL mode flips this: new changes go to a -wal sidecar file and the main database stays untouched until a checkpoint folds them in. The practical reason to switch is concurrency. In WAL mode, readers do not block the writer and the writer does not block readers, because readers see a consistent snapshot of the main file while new writes pile up in the log. SQLite checkpoints automatically once the WAL file grows past roughly 1000 pages, though you can trigger it yourself.

The shared idea across both: writes are cheap and sequential because they go to the log; the expensive, random-access work of updating the real data structures is batched up and done later, in bulk, when it is convenient.

If you want to see this rather than read about it, point a debugger or a hex viewer at a SQLite database in WAL mode and watch the -wal file grow as you insert rows, then shrink to nothing after a checkpoint. It turns an abstract concept into something you can poke. An AI-aware editor makes spelunking through the source of these systems far less intimidating when you hit a function you have never seen.

There is a cost to all this, and it is worth naming. Every committed change is written at least twice: once to the log, once to the data file at checkpoint time. This is write amplification, and it is the price of durability. Databases claw some of it back with group commit, batching the fsyncs of several concurrent transactions into a single disk flush, so ten commits arriving at once might cost one physical sync rather than ten. The log is sequential and the batching is generous, which is why the overhead is usually a rounding error against the safety it buys.

The mental model to keep: the log is the source of truth about what happened, and the data files are a cache of where things currently stand that can always be rebuilt by replaying the log. Get that backwards and crash recovery stops making sense. Get it right and the power cord becomes a non-event.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.