Anapeksha Mukherjee

Posted on Jun 9

Write-ahead logs: what fsync actually means and why it matters

#database #rust #distributedsystems #architecture

write() returned OK. Your data did not make it to disk.

There is a line of code that almost every developer has written and trusted completely.

file.write(data)

It returns. No error. You move on.

What actually happened is that the operating system accepted your data into a buffer in memory, marked some pages as dirty, and returned control to your program. The data has not touched the disk yet. The kernel will flush it eventually, in batches, when it decides the time is right.

If the process crashes, or the machine loses power, or the kernel panics between your write and that eventual flush, your data is gone. The write call succeeded. The data did not survive.

This is not a bug. It is how operating systems work. Buffered writes are one of the most significant performance optimizations in the entire I/O stack. The kernel batches small writes into larger sequential flushes, coalesces writes to the same blocks, and avoids saturating the disk with every individual write call. For most workloads, this is exactly what you want.

For a database, it is a disaster waiting to happen.

The lie at the heart of write()

When you issue a write command on a file descriptor, the data is mainly copied from user space to kernel space into the operating system's buffers. The kernel does not write the data directly to storage. It marks the pages as dirty and returns success to the user. The kernel periodically detects dirty data in its page buffers and writes it lazily in batches, trying to optimize write throughput.

So when your code does this:

let mut file = File::create("data.bin")?;
file.write_all(&payload)?;
// payload is in kernel buffer, not on disk

The write succeeded in the kernel's accounting. It has not reached persistent storage.

The gap between "write returned OK" and "data is on disk" is the window where a crash causes data loss. For an application that just saved a user's document, this might mean a few seconds of lost work. For a database that just told a client their write succeeded, it means a durability lie.

What fsync actually does

fsync() transfers all modified in-core data of the file referred to by the file descriptor to the disk device so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.

That blocking is the important part. fsync does not return until the hardware confirms the data is on non-volatile storage. Your program waits. The disk works. When fsync returns, you have a guarantee.

let mut file = File::create("data.bin")?;
file.write_all(&payload)?;
file.sync_all()?; // blocks until disk confirms
// now you can tell the client OK

The cost is real. Compared to a buffered write, fsync is slow. A modern NVMe SSD can handle tens of thousands of fsync operations per second under ideal conditions, but random synchronous writes at scale add up fast. Every database that takes durability seriously has to decide where to put this cost and how to amortize it.

There is also one subtlety worth knowing: fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that, an explicit fsync() on a file descriptor for the directory is also needed. This matters for atomic file replacement. If you write to a temp file and rename it over the old one, you need to fsync the directory too, or the rename may not survive a crash even if the file contents did.

Enter the write-ahead log

The naive solution to the durability problem is to fsync every write. Write the data, fsync, return OK to the client. That is correct but painful because every client write now pays the full disk latency cost synchronously.

The write-ahead log is the solution that every serious database converged on independently.

The name Write-Ahead Log says it all. It is a log that gets written before any risky change actually touches the database files.

Instead of writing changed data directly to its final location on disk, the database first appends a record of the change to a sequential log file, fsyncs that log entry, and only then acknowledges the write to the client. The actual data structures are updated separately, in the background, without being on the critical path for the client's write.

The write path looks like this:

client: SET config:env production

1. append entry to WAL:
   [seq:001][term:1][SET config:env production][checksum]

2. fsync WAL segment

3. update in-memory state:
   map.insert("config:env", "production")

4. return OK to client

The disk write is to the WAL, not to the primary data structure. And WAL writes are sequential appends, which are significantly faster than random writes to arbitrary locations in a data file.

If we follow this procedure, we do not need to flush data pages to disk on every transaction commit, because we know that in the event of a crash we will be able to recover the database using the log: any changes that have not been applied to the data pages can be redone from the WAL records.

What crash recovery actually looks like

When the process starts after a crash, it cannot trust its in-memory state because that state is gone. It cannot trust its primary data structures entirely because they may reflect partial writes. What it can trust is the WAL, because every entry in the WAL was fsynced before the client was told OK.

Recovery is replay:

startup:
  1. load last known good snapshot (if any)
  2. find WAL segments that postdate the snapshot
  3. replay each entry in strict sequence order
  4. skip entries past the last committed sequence
  5. in-memory state is now consistent with last committed write

This is deterministic. Given the same WAL, recovery always produces the same state. There is no ambiguity about what happened before the crash.

The sequence numbers matter. If a WAL segment has a gap, the entries after the gap are suspect. If entries are out of order, replay cannot be trusted. If a checksum fails on a WAL entry, that entry is corrupt and recovery should fail closed rather than apply a corrupted change.

In Vaylix, startup recovery fails closed on any of these conditions:

gap in sequence numbers → startup error, actionable message
out of order entries    → startup error
checksum mismatch       → startup error
unsupported format      → startup error

Failing closed is the correct choice. A database that recovers silently from a corrupt WAL and presents a subtly wrong state is more dangerous than one that refuses to start and tells you exactly what is wrong.

The snapshot problem

WAL segments cannot grow forever. If a process has run for months and handled millions of writes, replaying the entire WAL history on every restart is not practical.

The solution is snapshots. Periodically, the database serializes its entire current state to disk as a point-in-time snapshot. After a successful snapshot, WAL segments that predate it can be discarded. Recovery becomes: load the snapshot, then replay only the WAL entries that came after it.

But snapshots introduce their own failure modes.

What if the process crashes while writing the snapshot? A half-written snapshot file is worse than no snapshot at all, because you might load it and think you have valid state when you have garbage.

The solution is the atomic rename pattern:

1. serialize current state
2. write to a temporary file: snapshot.tmp
3. fsync the temporary file
4. fsync the parent directory
5. atomically rename snapshot.tmp → snapshot
6. fsync the parent directory again
7. write manifest pointing to new snapshot
8. fsync manifest
9. prune old WAL segments

The rename is atomic on all major filesystems. Either the old snapshot is there or the new one is. There is no state where a half-written file is the active snapshot.

The directory fsyncs around the rename are easy to forget and important. Without them, the rename itself may not survive a crash even if the file contents are fine.

WAL segments and retention

A single WAL file that grows indefinitely would be inefficient to manage. Most implementations split the WAL into fixed-size segments.

In Vaylix, the active segment is named to reflect its starting sequence:

wal/
  active-000001.wal       ← current segment, still being written
  000001-000500.wal       ← sealed segment
  000501-001000.wal       ← sealed segment

When the active segment reaches the configured size limit, it is sealed: renamed to include its sequence range, and a new active segment is opened. Segments older than the configured retention window are pruned after a successful snapshot.

The sealed segment names encode their sequence range deliberately. During recovery, the system can determine the correct replay order from the filenames alone, without needing to open every segment to find its position.

Point-in-time recovery

A WAL that records every change is also a time machine.

If you have a snapshot from Monday night and all WAL segments since then, you can replay the WAL to any point in time: to just before a bad deploy at 2pm Tuesday, to exactly the state the database was in when a bug was first reported, or to any sequence number you choose.

This is what PITR (point-in-time recovery) means in practice. It is not magic. It is just WAL replay stopped at a specific boundary.

vaylix pitr restore \
  --source-dir /var/lib/vaylix \
  --target-dir /var/lib/vaylix-restored \
  --to-timestamp-ms 1749200000000

The restore writes to a new target directory and never touches the source. If the restore produces the wrong state, you still have the original data intact to try again.

What this cost in practice

Building the WAL in Vaylix surfaced a few things that are easy to get wrong.

The WAL I/O worker runs on a dedicated thread, separate from the engine coordinator that assigns sequence numbers. Sequence assignment is fast and in-memory. The disk I/O is pushed off the hot path. But those two things have to stay in sync: the engine worker must not acknowledge a write to a client until the WAL I/O worker confirms the corresponding entry has been fsynced.

Getting that handoff wrong in either direction is a bug. Too early and you have the durability problem again. Too late and you serialize unnecessarily and hurt throughput.

The checksum on each WAL entry is also not optional. Without it, a partial write to the WAL is indistinguishable from a complete one. The checksum is what lets recovery distinguish a truncated entry, which happens when the process is killed mid-append, from a valid entry that happens to contain unusual bytes.

Directory fsyncs, as mentioned above, are easy to forget. They are not obviously required by the code. They show up as flaky data loss that only reproduces under specific crash conditions during specific filesystem operations. They are the kind of thing that only gets discovered through careful testing or production incidents.

The short version

write() buffers data in the kernel. The data is not on disk until something forces it there.

fsync() forces it there and blocks until the hardware confirms.

A write-ahead log puts fsync on the path of every acknowledged write, but makes that cost manageable by writing to a sequential append-only log rather than to arbitrary locations in the primary data structure.

On crash, replay the WAL to reconstruct state. On growth, take snapshots and prune old segments.

The details are subtle: sequence gaps, checksums, directory fsyncs, the atomic rename pattern, the ordering of snapshot and WAL pruning steps. Each one is a failure mode that shows up in the wrong conditions at the wrong time.

But the core idea is simple. Write first, confirm later, replay if needed.

Vaylix is an open source key-value database engine built around these principles. The WAL implementation, crash recovery path, and snapshot logic are all in the public repository.

vaylix / vaylix

A key-value database engine in Rust. Custom binary protocol, RBAC, encrypted WAL persistence, Raft-style replication, TLS/mTLS, and binary-safe values with versioned compare-and-set.

Vaylix

Vaylix is a Rust key/value database built around a strict transport boundary:

client -> transport -> TCP/TLS -> transport -> server -> engine

The current server stores UTF-8 keys with opaque byte values using segmented WAL plus encrypted snapshot persistence. It includes a shared framed binary transport, a Tokio multi-client server, authentication with RBAC, optional TLS/mTLS, default-on frame compression, logical backup/restore commands, offline PITR-oriented storage subcommands, maintenance mode, hash-chained audit logging, and Raft-style HA replication with automatic leader election and quorum-backed writes.

Detailed architecture context lives in LLM.md Benchmark guidance lives in BENCHMARKING.md Stability and compatibility contracts live in STABILITY.md, COMPATIBILITY_1_0.md, ERROR_CODES.md, NON_GOALS.md, and DEPLOYMENT.md.

Downloads

Release binaries are published from tagged releases:

Server and client archives: https://github.com/vaylix/vaylix/releases
Server image: ghcr.io/vaylix/vaylix:latest
Versioned server image example: ghcr.io/vaylix/vaylix:0.9.0

Release builds also publish SBOMs and keyless Sigstore/cosign attestations.

Run with Docker

docker pull ghcr.io/vaylix/vaylix:latest
docker

…

View on GitHub

DEV Community