Saksham Kapoor

Posted on Mar 19 • Edited on Mar 23

I Built a Key-Value Database from Scratch — Here’s What I Learned

#database #googlecloud #backend #systemdesign

Modern storage systems are built around a small set of fundamental principles: optimize for sequential I/O, minimize in-place mutation, and design for recoverability from failure.

Systems such as Bitcask, Kafka, and LSM-based databases all embrace these ideas in different forms. To better understand these trade-offs at a systems level, I built KestrelCache — a minimal persistent key-value storage engine implemented in C# and .NET.

This article focuses on the core mechanics of the storage engine, including its architecture, data layout, write/read paths, failure handling, and the practical trade-offs that emerge from a log-structured design.

1. Problem Framing

At its core, a storage engine must answer a simple question:

How do we store and retrieve data efficiently while preserving correctness under failure?

Traditional designs rely on in-place updates, but these introduce complexity:

random disk I/O
fragmentation
complicated concurrency control
difficult crash recovery

KestrelCache instead adopts a log-structured approach, where all writes are appended to disk. This eliminates in-place mutation entirely.

2. System Architecture

The entire system is defined by two components:

In-Memory Index (HashMap<Key, Offset>)
                ↓
Append-Only Log File (Disk)

The responsibilities are clearly separated:

Disk provides durability and write throughput
Memory provides fast lookup

There are no secondary indexes or complex data structures. The append-only log is the source of truth.

3. Write Path

The write path (Put) is designed for simplicity and performance:

Serialize the key and value into bytes
Compute a CRC32 checksum
Append the record to the end of the log file
Update the in-memory index with the new offset

This design guarantees:

strictly sequential disk writes
no in-place updates
inherent crash safety (append-only)

Sequential writes are significantly more efficient than random writes, especially under sustained workloads.

4. Read Path

Reads (Get) avoid scanning the log entirely:

Lookup the key in the in-memory index
Retrieve the corresponding file offset
Seek directly to the location on disk
Read and return the value

This results in:

constant-time lookup in memory
a single disk seek

The performance of reads is therefore predictable and efficient.

5. Delete Semantics (Tombstones)

Deletes are implemented using tombstones, rather than removing data from disk.

Instead of deleting a record:

[previous value]
[tombstone record]

A tombstone is represented by a record where:

ValueSize = -1

The in-memory index removes the key, but the log retains the history.

This approach:

avoids rewriting files
maintains write performance
simplifies concurrency

However, it introduces the need for periodic compaction.

6. On-Disk Record Format

Each record is stored in a compact binary format:

[CRC32][KeySize][ValueSize][Key][Value]

Where:

CRC32 ensures data integrity
KeySize and ValueSize define record boundaries
ValueSize = -1 indicates a tombstone

This format enables:

corruption detection
safe parsing during recovery
validation of partial writes

In storage systems, integrity checks are not optional — they are foundational.

7. Memory Index Design

The in-memory index is a simple:

Dictionary<string, long>

It maps each key to its latest offset in the log file.

This design provides:

O(1) lookup time
minimal overhead
simplicity

However, it introduces a key limitation:

The index must fit entirely in memory.

This makes the system memory-bound as the dataset grows.

8. Startup and Recovery

On startup, the system reconstructs its state by scanning the log file:

Read records sequentially
Validate checksums
Update the index for each key

The final state reflects the most recent record for each key.

Advantages:

simple and reliable recovery
no need for separate metadata

Trade-off:

startup time increases with log size

9. Observability and Logging

Unlike kernel-level systems, user-space storage engines can implement structured logging.

Typical log events include:

write operations with offsets
tombstone creation
recovery progress
checksum failures

Example:

[INFO] Appended key=user:1 at offset=0
[INFO] Tombstone written for key=user:1
[WARN] Checksum mismatch at offset=170

Observability is critical for:

debugging
detecting corruption
understanding system behavior

Without it, diagnosing issues becomes extremely difficult.

10. Failure Scenarios

This design handles several failure modes naturally:

Crash During Write

Partial writes can be detected via checksum mismatch
Invalid records can be skipped during recovery

Data Corruption

CRC32 validation prevents silent data corruption

Power Loss

Since writes are append-only, previously written data remains intact

However, durability depends on the underlying filesystem and flush behavior.

11. Performance Characteristics

Strengths

high write throughput (sequential I/O)
simple and predictable behavior
efficient point lookups
strong crash recovery guarantees

Limitations

unbounded file growth
memory-bound index
no range queries
no transactions or isolation
no concurrent write optimization

These limitations reflect deliberate design simplicity.

12. Compaction Strategy

Over time, the log accumulates:

outdated values
tombstones

Compaction reclaims space:

scan log → keep latest entries → rewrite new file

This reduces:

disk usage
read amplification

Compaction is a core requirement in all log-structured systems.

13. Design Trade-offs

This project highlights several key trade-offs:

Design Choice	Benefit	Cost
Append-only writes	Fast, simple	Requires compaction
In-memory index	Fast reads	Memory usage
Tombstones	Cheap deletes	Disk growth
Simple format	Easy recovery	Limited features

Storage systems are ultimately about choosing the right trade-offs.

14. Lessons Learned

Building a storage engine from scratch reveals several important insights:

Performance is primarily driven by I/O patterns
Sequential access is the most important optimization
Simplicity reduces failure modes
Observability must be designed early
Data integrity must be enforced at the lowest level Most importantly:

Reliable systems are not built by adding features, but by controlling complexity.

Conclusion

KestrelCache is intentionally minimal, but it captures the essence of modern storage engine design.

Even a simple log-structured key-value store demonstrates:

how durability is achieved
how performance is optimized
how systems recover from failure

Understanding these fundamentals is essential for building scalable backend systems and infrastructure.

Top comments (2)

Andre Cytryn • Mar 19

the memory-bound index is the part I'd tackle next. Bitcask addresses this with hint files: compact snapshots written during compaction so startup doesn't require replaying the full log. beyond that, a sparse index combined with bloom filters prevents unnecessary disk seeks for missing keys. did you run into any edge cases during CRC32 validation, specifically around partially flushed records at log boundaries?

Saksham Kapoor • Mar 20

That’s a great point, the memory-bound index is definitely one of the first limitations I’d address next.

Hint files are a really clean solution for reducing startup time without replaying the entire log, and I like how Bit cask keeps it simple by persisting the index snapshot during compaction. I was thinking along similar lines for improving recovery time.

A sparse index + bloom filters also makes a lot of sense, especially to avoid unnecessary disk seeks for non-existent keys, that’s something I haven’t implemented yet, but it would be a natural next step as the dataset grows.

Regarding CRC32 and partially flushed records, yes, I did run into that edge case.

During recovery, I treat any record with:

invalid checksum, or
incomplete header/body (e.g. truncated write at the end of the file)

as the logical end of the log.

So effectively:

valid records are applied
the first invalid/truncated record stops the scan

This works reasonably well for crash scenarios where the last write wasn’t fully flushed.

That said, it’s still a fairly basic approach, a more robust design would probably include explicit record framing or length-prefix validation to make boundary detection more reliable.

Appreciate the insights, especially around hint files, that’s something I’m definitely planning to explore next.