Nasit Sony

Posted on May 29

I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage

#distributedsystems #cpp #mlop #opensource

I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage

I wanted to understand one thing: how does data actually survive a crash?

Not what the documentation says. Not what the abstraction promises. What actually happens at the byte level when a process dies mid-write, and how a storage engine recovers from it.

So I built VeriStore — a correctness-first key-value storage engine in C++, built from first principles, evolving from a simple in-memory store to a Raft-replicated distributed system with a mini S3-style object storage layer.

Why Build a Storage Engine?

Every database, stream processor, and distributed system you've ever used is built on top of primitives like:

Write-Ahead Logging (WAL)
Crash-consistent recovery
Group commit batching
Consensus replication Understanding these primitives doesn't just make you a better infrastructure engineer — it makes you better at using these systems because you understand what guarantees they actually provide.

How It Works

v0.1 — In-Memory KV Store

The foundation: a thread-safe key-value map with PUT, GET, and DEL operations, protected by a reader-writer lock using std::shared_mutex.

Simple, but sets the pattern for everything above it.

v0.2 — Write-Ahead Log + Crash Recovery

The first real challenge: making writes survive crashes.

The WAL is an append-only log. Every write is recorded in the log before being applied to the in-memory map. On startup, the log is replayed to reconstruct state.

PUT x 100  → append to WAL → apply to map
PUT y 200  → append to WAL → apply to map
FLUSH      → fsync to disk

[crash]

restart    → replay WAL → x=100, y=200 ✓

CRC validation detects partial or torn writes — if a record is incomplete, it's ignored and replay stops at that point.

Guarantee: If a write returns OK, it survives crashes.

v0.3 — Snapshots + Log Compaction

Replaying the full WAL on every restart gets slow as the log grows. Snapshots solve this:

Serialize the current in-memory state to disk
Truncate the WAL to remove entries before the snapshot
On restart: load snapshot, then replay only the recent WAL entries This keeps recovery time bounded regardless of how long the system has been running.

v0.4 — Group Commit + Performance

fsyncing every write is correct but slow. Group commit batches writes and fsyncs at boundaries:

Mode	Setting	Throughput
Immediate flush	SETBATCH 1	39,216 ops/s
Group commit	SETBATCH 5	104,167 ops/s

~2.7× throughput improvement by reducing fsync frequency. This is the same technique used by PostgreSQL, RocksDB, and etcd.

v0.5 — Raft Consensus Replication

Single-node durability is not enough for production systems. Raft makes the storage engine fault-tolerant across a cluster:

Leader election — nodes elect a leader via randomized timeouts
Log replication — the leader replicates writes to followers before committing
Majority quorum commit — a write is committed only when a majority of nodes acknowledge it
Follower catch-up — a crashed follower replays missed entries on restart Example output from the Raft demo:

[raft] node 3 became LEADER term=1
ProposePut(a=100) -> true
s1.get(a)=100  s2.get(a)=100  s3.get(a)=100

=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
ProposePut(b=200) -> true
s1.get(b)=200  s2.get(b)=200  s3.get(b)=200

Guarantee: The cluster remains consistent despite node failures.

v0.6–v0.8 — Object Storage Layer

On top of the KV engine, I built a mini S3-style object storage system:

Bucket creation and object PUT/GET/DELETE
Chunked storage — large objects are split into fixed-size chunks, each stored as a KV entry
Metadata-based commit — object metadata is written last and acts as the commit point. An object is valid only if committed metadata exists.
Prefix-based listing — ListObjects(bucket, prefix) via prefix scans over the bucket index namespace
Overwrite semantics — new metadata commits atomically replace previous versions
Mark-and-sweep garbage collection — orphaned chunks from overwrites are reclaimed The commit semantics are the key insight:

1. Write chunk data  → KV entries
2. Write metadata    → commit point

Recovery: objects without committed metadata are ignored

This makes crash recovery deterministic — you either have the full object or nothing.

Failure Scenarios Tested

✔ Process crash (kill -9)
✔ Partial disk writes
✔ Leader node crash
✔ Follower crash and recovery
✔ Log divergence repair

- ✔ Replica catch-up via log backtracking

What I Learned

fsync is the durability boundary. A write is only durable once it's fsynced. Group commit is the standard tradeoff — batch writes, fsync at boundaries, accept a small window of potential data loss.

The commit point is everything. Whether it's a WAL record, a metadata entry, or a Raft log index — the commit point is the line between "this happened" and "this might not have happened." Design your commit points explicitly.

Raft is simpler than it looks, but the edge cases are brutal. The basic algorithm is straightforward. But leader crash during replication, log divergence between followers, and split-brain scenarios each required careful handling.

Mac is more forgiving than Linux. The codebase compiled perfectly on macOS but failed on Linux GCC because Apple's headers include <mutex> indirectly. Always test on Linux before shipping.

Try It Yourself

git clone https://github.com/NasitSony/VeriStore.git
cd VeriStore
cmake -S . -B build
cmake --build build

# Run the KV CLI
./build/kv_cli

# Run the Raft demo
./build/raft_demo

GitHub: https://github.com/NasitSony/VeriStore

VeriStore is the storage foundation of a larger AI infrastructure stack I've been building. The next layer up is llm-serving-cache — a KV-cache placement and routing control plane for LLM inference, backed by VeriStore.

If you found this useful, a ⭐ on GitHub goes a long way!

DEV Community

I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage

I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage

Why Build a Storage Engine?

How It Works

v0.1 — In-Memory KV Store

v0.2 — Write-Ahead Log + Crash Recovery

v0.3 — Snapshots + Log Compaction

v0.4 — Group Commit + Performance

v0.5 — Raft Consensus Replication

v0.6–v0.8 — Object Storage Layer

Failure Scenarios Tested

- ✔ Replica catch-up via log backtracking

What I Learned

Try It Yourself

Top comments (0)