I Built a Storage Engine from Scratch in C++ — WAL, Raft, and Object Storage
I wanted to understand one thing: how does data actually survive a crash?
Not what the documentation says. Not what the abstraction promises. What actually happens at the byte level when a process dies mid-write, and how a storage engine recovers from it.
So I built VeriStore — a correctness-first key-value storage engine in C++, built from first principles, evolving from a simple in-memory store to a Raft-replicated distributed system with a mini S3-style object storage layer.
Why Build a Storage Engine?
Every database, stream processor, and distributed system you've ever used is built on top of primitives like:
- Write-Ahead Logging (WAL)
- Crash-consistent recovery
- Group commit batching
- Consensus replication Understanding these primitives doesn't just make you a better infrastructure engineer — it makes you better at using these systems because you understand what guarantees they actually provide.
How It Works
v0.1 — In-Memory KV Store
The foundation: a thread-safe key-value map with PUT, GET, and DEL operations, protected by a reader-writer lock using std::shared_mutex.
Simple, but sets the pattern for everything above it.
v0.2 — Write-Ahead Log + Crash Recovery
The first real challenge: making writes survive crashes.
The WAL is an append-only log. Every write is recorded in the log before being applied to the in-memory map. On startup, the log is replayed to reconstruct state.
PUT x 100 → append to WAL → apply to map
PUT y 200 → append to WAL → apply to map
FLUSH → fsync to disk
[crash]
restart → replay WAL → x=100, y=200 ✓
CRC validation detects partial or torn writes — if a record is incomplete, it's ignored and replay stops at that point.
Guarantee: If a write returns OK, it survives crashes.
v0.3 — Snapshots + Log Compaction
Replaying the full WAL on every restart gets slow as the log grows. Snapshots solve this:
- Serialize the current in-memory state to disk
- Truncate the WAL to remove entries before the snapshot
- On restart: load snapshot, then replay only the recent WAL entries This keeps recovery time bounded regardless of how long the system has been running.
v0.4 — Group Commit + Performance
fsyncing every write is correct but slow. Group commit batches writes and fsyncs at boundaries:
| Mode | Setting | Throughput |
|---|---|---|
| Immediate flush | SETBATCH 1 | 39,216 ops/s |
| Group commit | SETBATCH 5 | 104,167 ops/s |
~2.7× throughput improvement by reducing fsync frequency. This is the same technique used by PostgreSQL, RocksDB, and etcd.
v0.5 — Raft Consensus Replication
Single-node durability is not enough for production systems. Raft makes the storage engine fault-tolerant across a cluster:
- Leader election — nodes elect a leader via randomized timeouts
- Log replication — the leader replicates writes to followers before committing
- Majority quorum commit — a write is committed only when a majority of nodes acknowledge it
- Follower catch-up — a crashed follower replays missed entries on restart Example output from the Raft demo:
[raft] node 3 became LEADER term=1
ProposePut(a=100) -> true
s1.get(a)=100 s2.get(a)=100 s3.get(a)=100
=== Simulating leader crash ===
[raft] node 2 became LEADER term=2
ProposePut(b=200) -> true
s1.get(b)=200 s2.get(b)=200 s3.get(b)=200
Guarantee: The cluster remains consistent despite node failures.
v0.6–v0.8 — Object Storage Layer
On top of the KV engine, I built a mini S3-style object storage system:
-
Bucket creation and object
PUT/GET/DELETE - Chunked storage — large objects are split into fixed-size chunks, each stored as a KV entry
- Metadata-based commit — object metadata is written last and acts as the commit point. An object is valid only if committed metadata exists.
-
Prefix-based listing —
ListObjects(bucket, prefix)via prefix scans over the bucket index namespace - Overwrite semantics — new metadata commits atomically replace previous versions
- Mark-and-sweep garbage collection — orphaned chunks from overwrites are reclaimed The commit semantics are the key insight:
1. Write chunk data → KV entries
2. Write metadata → commit point
Recovery: objects without committed metadata are ignored
This makes crash recovery deterministic — you either have the full object or nothing.
Failure Scenarios Tested
- ✔ Process crash (
kill -9) - ✔ Partial disk writes
- ✔ Leader node crash
- ✔ Follower crash and recovery
- ✔ Log divergence repair
- ✔ Replica catch-up via log backtracking
What I Learned
fsync is the durability boundary. A write is only durable once it's fsynced. Group commit is the standard tradeoff — batch writes, fsync at boundaries, accept a small window of potential data loss.
The commit point is everything. Whether it's a WAL record, a metadata entry, or a Raft log index — the commit point is the line between "this happened" and "this might not have happened." Design your commit points explicitly.
Raft is simpler than it looks, but the edge cases are brutal. The basic algorithm is straightforward. But leader crash during replication, log divergence between followers, and split-brain scenarios each required careful handling.
Mac is more forgiving than Linux. The codebase compiled perfectly on macOS but failed on Linux GCC because Apple's headers include <mutex> indirectly. Always test on Linux before shipping.
Try It Yourself
git clone https://github.com/NasitSony/VeriStore.git
cd VeriStore
cmake -S . -B build
cmake --build build
# Run the KV CLI
./build/kv_cli
# Run the Raft demo
./build/raft_demo
GitHub: https://github.com/NasitSony/VeriStore
VeriStore is the storage foundation of a larger AI infrastructure stack I've been building. The next layer up is llm-serving-cache — a KV-cache placement and routing control plane for LLM inference, backed by VeriStore.
If you found this useful, a ⭐ on GitHub goes a long way!
Top comments (0)