beefed.ai

Posted on Mar 16 • Originally published at beefed.ai

ACID Storage Engine Deep Dive (WAL, MVCC, Recovery)

#database

Why strong ACID storage engine guarantees matter for a storage engine
Write-Ahead Log: engineering the ordering, fsync boundaries, and recovery path
Buffer pool & memory hierarchy: keeping hot pages hot and bounding latency
MVCC mechanics: snapshots, visibility rules, and transaction lifecycle
Crash recovery and checkpoints: ARIES-style redo/undo and automated tests
Practical application: checklists, code patterns, and crash-testing recipes

Durability and isolation are the contract you make with users when you accept their writes; violating that contract produces silent, intermittent corruption that ruins trust faster than any performance bug. Implementing a storage engine that holds up under crash, concurrency, and operational mistakes requires aligning a correct write-ahead log, a well-behaved buffer pool, and a rigorous MVCC model — and proving it with automated crash recovery tests.

You are seeing three common, related failures: (1) committed transactions that vanish after a crash, (2) long-tail latency spikes during checkpoints or flushes, and (3) runaway storage growth because multi-versioned rows never get reclaimed. Those symptoms point at the same root causes: broken ordering between log and page writes, weak or mis-tuned buffer-pool lifecycle management, and MVCC garbage-collection that lacks a safe horizon. The remedy is not clever heuristics — it's engineering discipline: log-first ordering (WAL); explicit, testable fsync boundaries; deterministic snapshot visibility; and repeatable crash-and-recover tests.

Why strong ACID storage engine guarantees matter for a storage engine

ACID is not academic punctuation — it's the operational contract: Atomicity and Durability give users confidence that a commit means their change will survive crashes; Isolation prevents subtle anomalies under concurrency. The transaction model and the log manager are the parts of a storage engine that make that contract testable and auditable . Real-world audits and fault-injection tests show that small deviations from these guarantees produce correlated, hard-to-diagnose failures (lost increments, split-brain state in replicas, stale secondary reads) that persist through backups and replication .

Measurable goals you should instrument from the start:

Durable commit correctness: 100% of committed transactions remain visible after forced crash/restart (per-test).
Recovery time objective: target a deterministic maximum recovery time (e.g., restart and accept traffic within 30s for a 1TB dataset).
p99 read latency under normal load: track baseline and the delta introduced by checkpointing. Those are the business metrics that link your low-level engine choices to operational risk.

Important: The storage engine is the authoritative source of truth. If log ordering, buffer flushing, or MVCC visibility are wrong, application-level retries will not rescue the data.

Write-Ahead Log: engineering the ordering, fsync boundaries, and recovery path

The central rule is simple and non-negotiable: persist the log that describes a change before you make the on-disk data reflect that change. The log is law: write-ahead logging gives you atomicity and durability at crash time because recovery replays (redo) the log to reconstruct the committed state and rolls back (undo) uncommitted changes . Practically this means: append commit records to the WAL, ensure the WAL commit record hits stable storage (via fsync() or equivalent), only then consider the transaction durable. The canonical recovery architecture (redo then undo) comes from the ARIES family of algorithms and is the basis for modern engines' recovery passes .

Key WAL design elements

Record format: LSN | txid | prev_lsn | type | payload | checksum (LSN = log sequence number). Keep fixed-size headers for fast scans; append payloads for variable data.
Durable commit: a commit record must be persisted to stable storage before the engine reports success to clients. Use a stable LSN to drive later page flushing.
Group commit: coalesce multiple commit records into the same disk sync window to amortize fsync() latency.
Checkpointing: move durable changes from WAL into data files and advance the checkpoint LSN so recovery scans start from a later point. Checkpoint frequency trades restart time against foreground latency; tune it to meet recovery-time objectives.

Practical WAL append pseudocode (simplified, C++-style):

struct WALRecord { uint64_t lsn; uint64_t txid; uint32_t type; std::vector<char> payload; uint32_t crc; };

uint64_t wal_append(int wal_fd, const WALRecord &rec) {
    auto buf = serialize(rec);                       // produce bytes with header + payload
    off_t offset = pwrite(wal_fd, buf.data(), buf.size(), wal_tail_offset);
    // make durable before returning the committed LSN
    fdatasync(wal_fd);                               // or fsync(wal_fd) depending on platform
    uint64_t assigned_lsn = update_in_memory_tail(buf.size());
    return assigned_lsn;
}

Notes on fsync() and durability: fsync() (and fdatasync()) are the system guarantees that in-core buffers get synchronized to the underlying storage device; relying on the VFS or OS without calling an explicit sync exposes you to power-loss windows and caching behavior . Group commit and background flush threads reduce fsync() pressure while preserving safety.

SQLite’s WAL mode illustrates the separation of commit (append) and checkpoint: commits append to the WAL and readers consult the WAL-index for the correct page version; the checkpoint transfers WAL contents back into the database file later, making commits fast most of the time and occasionally slower when checkpoints run . ARIES then formalizes the recovery pass you must implement — redo from the checkpoint LSN forward, then undo for transactions still active at the crash point .

Buffer pool & memory hierarchy: keeping hot pages hot and bounding latency

Your buffer pool is the primary lever for read latency and for controlling write amplification. Design it with explicit page states and a deterministic lifecycle: pinned (in-use), dirty (modified in-memory), clean (not modified), and evictable (candidate for eviction). Maintain a pin-count and LRU/clock-like policy; do not rely on implicit OS caching to replace a proper buffer pool strategy.

Core buffer-pool responsibilities

Pin/unpin semantics around I/O and latching to prevent tearing during concurrent access.
A low-latency path for reads from memory; page faults go to async I/O to avoid blocking the foreground thread.
Asynchronous flusher: a background thread writes dirty pages to disk in LSN-order up to the stable checkpoint to constrain recovery work.
Checkpoint coordination: checkpoints should copy pages up to a target LSN; they must avoid overwriting pages in use by active readers.

Example page lifecycle snippet (pseudo):

read_page(page_id):
  if page in buffer and not being evicted: pin and return
  else: read from disk into buffer, pin, return

write_page(page):
  pin page
  mark dirty with new LSN
  unpin page
  schedule for background flush

Sizing guidance and realities: for dedicated storage nodes, engines commonly allocate a large fraction of RAM to the buffer pool (MySQL/InnoDB documentation suggests up to ~80% for dedicated servers) to keep hot data resident and reduce I/O pressure; this must be balanced against OS needs and other processes . The buffer pool algorithm choice (single LRU list vs. multi-queue or segmented LRU) matters when workload has both scanning and hotspot access patterns.

Performance knobs you will tune:

Buffer pool size and number of instances (reduce contention).
Dirty-page threshold to trigger flush threads.
Eviction policy aging windows to avoid evicting pages that will be reused soon.
Asynchronous write size and concurrency.

MVCC mechanics: snapshots, visibility rules, and transaction lifecycle

MVCC gives you concurrency without turning reads into full stop-the-world operations. In a typical MVCC implementation (the one PostgreSQL uses as a robust example), each tuple (row) carries metadata for the creating transaction and the deleting transaction — usually fields like xmin and xmax — which, combined with a transaction snapshot, determine visibility . A snapshot is a lightweight description of which transactions were in-progress at the snapshot time (often stored as xmin, xmax and an active_txn_list) rather than a physical copy of the database.

Tuple-version example (conceptual):

TupleVersion {
  TxId xmin;   // transaction that created this version
  TxId xmax;   // transaction that deleted/replaced this version (0 == alive)
  Payload data;
  LSN   lsn;   // LSN at which this version was created (optional, for correlation)
}

Read path (high level)

Acquire a snapshot at statement or transaction start (depends on isolation level).
For each tuple, evaluate visibility against the snapshot: visible if xmin committed before snapshot and xmax not committed before snapshot (details depend on engine).
Return visible versions; do not block writers.

Write path (high level)

For UPDATE: create a new version with xmin = current_txid, set xmax on the old version to the same txid when the update commits (or during update depending on in-place update policy).
Writers serialize conflicting writes via locks at the row level or by detecting conflicts at commit.

Garbage collection and vacuuming

MVCC creates historical versions that must be reclaimed safely. The safe reclamation "horizon" equals the oldest active snapshot across the system; versions older than that are unreachable and can be swept .
Vacuuming or purge threads remove versions below the horizon; if you miss vacuuming you accumulate bloat and slow scans.

Snapshot and isolation edge cases

Snapshot isolation avoids dirty reads but allows write skew; achieving full serializability requires additional mechanisms (predicate locking, SSI) .
Transaction ID wraparound and long-running snapshots require careful operational guards; engines like PostgreSQL track xmin/xmax lists and require periodic vacuums.

Crash recovery and checkpoints: ARIES-style redo/undo and automated tests

Recovery design pattern (ARIES-style) you should implement:

On startup, locate the last checkpoint LSN (written to the control file or a known header).
Redo pass: scan WAL records from the checkpoint LSN forward and apply idempotent changes to data files up to the end-of-log to bring on-disk state up to the point of crash. Redo is safe because every applied change has its corresponding WAL entry written before it was considered durable .
Undo pass: identify transactions that were active at crash (no durable commit record) and apply compensating undo operations to revert their partial effects. Undo can be performed in parallel with accepting connections in many engines, but correctness requires careful sequencing .

Checkpointing design choices

Incremental versus full checkpoints: incremental checkpoints move the replay start forward while minimizing foreground pauses; full checkpoints truncate the WAL but are costlier.
Coordinated checkpoints need to honor the oldest reader’s snapshot so they do not overwrite data expected by an active read transaction (SQLite’s WAL-index behavior illustrates reader end-marks and checkpoint stopping logic) .

Crash-testing and automated recovery verification

Use deterministic, repeatable harnesses that:
- Generate a workload with monotonic markers (sequence numbers, checksums).
- Periodically force crashes (kill -9, stop the VM, or simulate power failure via a test filesystem) at randomized points in the workload.
- Restart and compare the visible state to the expected post-commit state to detect missing commits or phantom updates.
Jepsen-style fault injection provides a mature methodology and a library of tests to exercise node-level failures, fsync semantics, and network partitions . Jepsen also recommends filesystem-level fault injection (FUSE) to simulate lost, unsynced writes and to validate your use of fsync() .

Simple recovery pseudocode (very high level):

on_startup():
  checkpoint_lsn = read_checkpoint()
  redo_from(checkpoint_lsn)
  active_txns = build_active_txn_table()
  parallel_undo(active_txns)
  accept_connections()

Practical notes:

If your WAL or checkpoint metadata is stored separately (for example, a WAL file and a WAL-index like SQLite), make the metadata self-consistent and durable; tests show that mixing file-system semantics and application assumptions causes surprises on some NFS and virtualized filesystems .
Rely on fsync() semantics where specified by POSIX; do not assume the kernel will make your writes durable without explicit synchronization calls . Test on the full range of target platforms and underlying storage (spinning disk, SSD, NVM, virtualized block devices).

Practical application: checklists, code patterns, and crash-testing recipes

Operational checklist — design and implementation

WAL format: fixed header, per-record LSN, txid, and checksum. Reserve a commit record type and expose a stable durable_lsn.
Commit path: append commit record → persist WAL (group commit or fsync) → mark transaction as durable → return success to client → enqueue pages for background flush.
Buffer pool: implement pin/unpin, maintain dirty flags, and run a background flusher that writes up to the checkpoint LSN. Track pin counts to avoid evicting in-use pages.
MVCC: store xmin/xmax or equivalent version metadata; implement snapshot creation that records the active transaction set or uses a compact representation; implement vacuum/purge threads using the oldest active snapshot as horizon.
Checkpoints: incremental checkpoints that move recovery_lsn forward without blocking reads; provide an operator-facing tool that can force a safe restart-time checkpoint for safe backups or upgrades.
Recovery: implement redo-then-undo, write idempotent apply functions for redo records, and design undo records (or use compensation records) for correct rollback.

Implementation recipe — WAL append & commit (Rust-like pseudocode)

fn commit(tx: &Transaction, wal: &mut Wal, data_files: &mut DataFiles) -> Result<()> {
    let rec = WalRecord::commit(tx.id, tx.changes());
    let lsn = wal.append(&rec)?;         // append and persist to WAL file
    wal.fsync()?;                        // durable commit point
    tx.set_durable(lsn);
    // schedule background data-file flushes that will write pages with lsn <= lsn
    data_files.schedule_flush_up_to(lsn);
    Ok(())
}

Crash-testing recipe (repeatable harness)

Create workload generator that writes (key, sequence number) pairs and records expected visible state.
Start target engine (single-node for unit testing).
Run the workload with high write concurrency and periodic reads validating sequence monotonicity.
At random intervals, trigger a crash: kill -9 <pid> or simulate delayed fsync semantics using a test FUSE filesystem that drops unsynced writes (Jepsen-style) .
Restart the engine and validate:
- All committed sequence numbers are present.
- No corrupted pages (run checksums or internal consistency checks).
- Uncommitted transactions were rolled back.
Repeat thousands of times; automate and record failure histograms to find patterns.

Acceptance checks for a release candidate

Pass N consecutive crash-recovery runs (N ≥ 1000 for new engines, with a mix of workloads and crash points).
Verify recovery-time bounds and that WAL growth is controlled across workloads.
Validate vacuum/purge under long-running read transactions to avoid unbounded MVCC bloat.

Quick validation commands and tools

Use checksumming of logical state (e.g., aggregated sequence numbers per key) to compare pre-crash expected state and post-crash recovered state.
Use strace or I/O tracing to assert that your commit path issues the expected pwrite()/fsync() sequence during commit in the correct order .
Run Jepsen tests or Jepsen-style harnesses to simulate abnormal device behavior and mixed failure modes .

Operational callout: Failing to call fsync() where you need it, or misordering page writes relative to WAL commits, is by far the most common root cause of silent data loss. Validate at the syscall level and with simulated power loss tests on each target platform .

Build the parts in the right order, and test the whole with realistic faults. Engineers who treat the WAL as a first-class, auditable artifact — with durable commit semantics, a clear LSN model, and repeatable crash tests — produce engines that survive real operations. Apply the checklist, run the harness, and let the crash logs teach you where assumptions leak. The log is law; design your buffer pool and MVCC to obey that law and your recovery path will be provable.

Sources:
SQLite Write-Ahead Logging - Details on WAL mode semantics, checkpoint behavior, reader end-marks, and practical properties of WAL implementations used as an example for commit/checkpoint separation.

ARIES: A Transaction Recovery Method (IBM Research / ACM) - Foundational description of redo/undo recovery, log ordering, and recovery passes for transactional systems.

Transaction Processing: Concepts and Techniques (Jim Gray & Andreas Reuter) - Classic reference on transaction semantics, log managers, and the theory of ACID for databases.

PostgreSQL MVCC and Concurrency Control (official docs) - Authoritative explanation of snapshot creation, xmin/xmax visibility rules, and MVCC maintenance.

MySQL / InnoDB Recovery and Buffer Pool docs (MySQL Reference Manual) - Practical behavior of InnoDB crash recovery, background rollback, and buffer-pool sizing and eviction behaviors.

Jepsen — Distributed Systems Testing and Fault Injection - Methodology and tooling for crash-injection, fsync-safety tests, and repeatable verification harnesses used to validate durability claims.

fsync(2) and fdatasync(2) manual pages (man7.org) - System-level guarantees for file synchronization methods used to make WAL records durable.