A few years ago I shipped a nightly batch job that wrote aggregated metrics to a file. It worked for months. Then one night the box ran out of disk halfway through a write, the process died, and the next morning I opened a half-written file full of garbage bytes. The downstream dashboard happily loaded it and showed numbers that were quietly wrong for two days before anyone noticed.
That was the day I actually understood why every serious database on the planet writes to a log before it touches the real data. Postgres does it. SQLite does it. Kafka is basically a log with a fan club. The pattern has a boring name: the Write-Ahead Log, or WAL.
So I did what I always do when I want to stop being scared of something: I built the toy version. No database, no C extensions, no libraries. Just Python and a file. Here is what I learned.
The one rule
The Write-Ahead Log has exactly one rule, and the whole thing falls out of it:
Before you change the real data, write what you are about to do to an append-only log, and make sure that log is safely on disk.
That is it. "Safely on disk" is doing a lot of work in that sentence, and we will get to it. But the core idea is: the log is the source of truth. The actual data file is just a cache of the log that you can always rebuild. If the machine dies, you replay the log and you are back exactly where you were.
The magic comes from two properties:
- Appends are simpler than in-place edits. You never go back and modify an old record, so a crash can only ever leave a torn record at the very end of the file. Everything before it is intact.
- You can detect the torn record with a checksum and just throw it away.
Let me show you what I mean with code.
A record format that survives a crash
The first thing I needed was a way to write records so I could tell a complete record from a half-written one. Here is the format I landed on for each entry:
[ 4 bytes: length ][ 4 bytes: crc32 ][ N bytes: payload ]
Length first so the reader knows how many bytes to grab. A CRC32 checksum so the reader can verify the payload was fully and correctly written. Then the payload itself.
import struct, zlib, os, json
HEADER = struct.Struct(">II") # length, crc32, big-endian
def encode_record(payload: bytes) -> bytes:
crc = zlib.crc32(payload) & 0xFFFFFFFF
return HEADER.pack(len(payload), crc) + payload
Writing a record is then just: encode it, append it, and flush it to disk.
class WAL:
def __init__(self, path):
self.path = path
# append + binary, create if missing
self.f = open(path, "ab")
def append(self, payload: bytes):
self.f.write(encode_record(payload))
self.f.flush() # push out of Python's buffer
os.fsync(self.f.fileno()) # push out of the OS buffer onto the disk
return self
fsync is the whole point
That os.fsync line is the single most important line in this entire article, and it is the one almost everyone forgets.
When you call f.write(), the bytes go into Python's buffer. When you call f.flush(), they go into the operating system's buffer. Neither of those is the disk. The OS is allowed to sit on those bytes for a while before actually writing them to physical storage, because writing to disk is slow and batching is fast.
That is great for performance and catastrophic for durability. If the power cuts after flush() but before the OS got around to the real write, your data is gone, and you had no idea it was still in flight. os.fsync() is you telling the kernel: "no, really, I will wait right here until this is physically on the disk."
This is the trade every database makes. fsync on every record is the safe, slow setting. Batching many records per fsync is the fast, slightly-riskier setting. Postgres exposes this as synchronous_commit. Now you know what that knob actually does.
Reading it back, and surviving the torn tail
Now the fun part. Reading the log means walking the file record by record. If we hit a record whose length runs past the end of the file, or whose checksum does not match, we know that is the torn record from a crash, and we stop. Everything before it is good.
def read_all(path):
records = []
with open(path, "rb") as f:
data = f.read()
offset = 0
while offset + HEADER.size <= len(data):
length, crc = HEADER.unpack(data[offset:offset + HEADER.size])
start = offset + HEADER.size
end = start + length
if end > len(data):
break # truncated payload, torn write
payload = data[start:end]
if (zlib.crc32(payload) & 0xFFFFFFFF) != crc:
break # corrupt record, stop here
records.append(payload)
offset = end
return records
Notice how forgiving this is. It does not raise, it does not panic. It just reads as far as it safely can and returns clean records. That is the recovery story for free, and it only works because we never edit old records in place.
Turning the log into an actual key-value store
A log of raw bytes is not very useful on its own. Let me put a tiny key-value store on top of it so you can see how a real system uses this. Every set and delete becomes a log entry first, then updates an in-memory dictionary.
class KVStore:
def __init__(self, path):
self.wal = WAL(path)
self.data = {}
self._recover(path)
def _recover(self, path):
for raw in read_all(path):
op = json.loads(raw)
if op["t"] == "set":
self.data[op["k"]] = op["v"]
elif op["t"] == "del":
self.data.pop(op["k"], None)
def set(self, key, value):
self.wal.append(json.dumps({"t": "set", "k": key, "v": value}).encode())
self.data[key] = value
def delete(self, key):
self.wal.append(json.dumps({"t": "del", "k": key}).encode())
self.data.pop(key, None)
def get(self, key):
return self.data.get(key)
The order inside set matters and it is deliberate: log first, memory second. If the process dies between the two lines, recovery still replays the log entry and the memory state ends up correct. If you did it the other way around, you could acknowledge a write that never made it to durable storage. That is the exact bug that bit me with my batch job, just dressed up nicely.
Here is the whole thing working, including a simulated crash:
# first run
kv = KVStore("store.wal")
kv.set("balance", 100)
kv.set("balance", 250)
kv.delete("temp")
# pretend the process is killed right here, no clean shutdown
# brand new process, same file
kv2 = KVStore("store.wal")
print(kv2.get("balance")) # 250, replayed straight from the log
No clean shutdown, no flush-on-exit, no try/finally. We just open the file again and replay. The state is exactly what it should be because every change was durable the moment it was acknowledged.
Compaction: the log cannot grow forever
There is an obvious problem. If I set the same key a million times, the log has a million entries but the store has one value. Replaying a giant log on startup gets slow, and the file eats disk.
Real systems fix this with compaction (or checkpointing). The idea is simple: write the current state out as a fresh, minimal log, then atomically swap it in.
def compact(self, path):
tmp = path + ".compact"
with open(tmp, "wb") as f:
for k, v in self.data.items():
payload = json.dumps({"t": "set", "k": k, "v": v}).encode()
f.write(encode_record(payload))
f.flush()
os.fsync(f.fileno())
os.replace(tmp, path) # atomic rename, the safe swap
os.replace is the hero here. On every major OS, renaming a file over another is atomic: a reader either sees the entire old file or the entire new file, never a mix. So even if you crash mid-compaction, you either have the old complete log or the new complete log. You can never lose data in the gap. This is the same trick people use for "write to temp file, then rename" config saves, and it is one of the most useful patterns in all of systems programming.
What building this actually taught me
I have used Postgres for a decade. I have run Kafka in production. I thought I understood durability. I did not, not really, until I wrote those forty lines and watched a "crashed" process come back with perfect state.
A few things clicked that no blog post had ever made stick:
- fsync is the line between "I saved it" and "I hope I saved it." Performance tuning in databases is almost always a negotiation about how often you are allowed to skip it.
- Append-only is not a limitation, it is the feature. It is the reason a crash can only ever damage the tail, and the reason recovery is trivial instead of terrifying.
- Atomic rename is a load-bearing wall under more of your stack than you realize.
The toy version is never the real thing. Production WALs deal with log rotation, segment files, group commit, partial-page writes, and a dozen other things I cheerfully ignored. But the next time I read about synchronous_commit, or watch Postgres replay its WAL on startup, or reason about what Kafka actually guarantees, I am not looking at a black box anymore. I am looking at the forty-line version with more careful engineers and more edge cases.
That is the whole reason I keep doing these builds. The fastest way I know to stop being intimidated by a piece of infrastructure is to build the smallest honest version of it with my own hands.
So that is my pitch to you: pick the one piece of your stack that feels like magic and build the toy version this weekend. A WAL. A cache. A rate limiter. You will read the real docs differently forever after.
If you build a tiny WAL of your own, reply and tell me what surprised you. And if you want more of these from-scratch builds, follow me here. I do one whenever I get curious enough about something to stop trusting it and start understanding it.
Top comments (0)