💾 Durability in Databases: How OS Cache & WAL Keep Your Data Safe

#postgres #database #programming #beginners

When we talk about ACID in databases, Durability is often the most misunderstood property.
It sounds simple:

“Once a transaction is committed, it must survive crashes.”

But the devil is in the details — especially when the OS cache and Write-Ahead Logging (WAL) are involved.

In this post, we’ll unpack how durability actually works under the hood, and why your data might not be as safe as you think until it’s flushed in the right way.

1️⃣ The Promise of Durability

Durability means:

If your database tells you “✅ Commit successful,” it’s making a contract with you: that transaction will still exist after a restart, power loss, or kernel panic.
This isn’t about just writing to disk — it’s about ensuring that write survives even catastrophic failures.

The challenge? Modern systems don’t write directly to disk the instant you call write().

2️⃣ The OS Page Cache: Friend & Foe

When your database writes to a file, it goes through the Operating System’s page cache:

write() system call → data goes to RAM (OS page cache).
OS flushes the data to disk later, when it feels like it.
Until then, your “committed” data is sitting in volatile memory.

💡 The problem:
If the machine crashes before that flush, the data is gone — violating durability.

3️⃣ Forcing the Data Down: fsync()

Databases can’t just trust the OS to flush when it’s convenient. That’s why they use fsync() (or similar calls):

write(fd, data, size); // goes to OS cache
fsync(fd);             // force flush to disk

fsync() tells the OS: flush this file’s data from the page cache to actual storage, and don’t lie about it.
It’s slower than plain writes — but it’s the only way to guarantee durability.

4️⃣ WAL to the Rescue

If we fsync() the main database file on every commit, performance tanks.
Enter Write-Ahead Logging (WAL) — the durability superhero.

WAL principle:

Before changing data files, append the change to a sequential log file (the WAL).
Flush the WAL to disk with fsync().
Acknowledge the commit to the client.
Apply the change to the main DB file later, in bulk.

This way:

Commit is fast because WAL writes are sequential.
Recovery after a crash is possible by replaying the WAL.

5️⃣ How a Durable Commit Actually Works

Here’s a typical commit flow in a WAL-enabled database:

[Start transaction]
       ↓
Write WAL entry → OS page cache
       ↓
fsync(WAL) → SSD/HDD
       ↓
Send COMMIT OK to client
       ↓
(Background) Apply WAL changes to DB file
       ↓
fsync(DB file periodically)

6️⃣ Crash Scenarios

Crash before WAL fsync → Transaction lost (never truly committed).
Crash after WAL fsync but before DB file update → Recovery replays WAL → Transaction preserved.
Crash after both fsyncs → Data fully integrated.

7️⃣ Durability Trade-offs

Skip fsync: Some DBs let you turn it off for speed (PRAGMA synchronous=OFF in SQLite). Fast, but unsafe.
Hardware lies: Some drives with volatile write caches may claim durability before actually flushing. Enterprise setups disable or battery-back these caches.
Checkpoints: WAL is periodically merged into the main DB to avoid unbounded growth.

8️⃣ TL;DR

OS page cache makes writes fast but not durable until fsync() is called.
WAL ensures crash recovery by recording changes before applying them to main data files.
True durability = WAL entry safely flushed before commit acknowledgement.

Final Thought:
Durability isn’t just a “database feature” — it’s an end-to-end chain involving your app, the OS, the filesystem, the storage controller, and sometimes even hardware firmware.
Break any link in that chain, and your “permanent” data might just vanish.