DEV Community

Cover image for I Built a Key-Value Database from Scratch — Here’s What I Learned
Saksham Kapoor
Saksham Kapoor

Posted on • Edited on

I Built a Key-Value Database from Scratch — Here’s What I Learned

Modern storage systems are built around a small set of fundamental principles: optimize for sequential I/O, minimize in-place mutation, and design for recoverability from failure.

Systems such as Bitcask, Kafka, and LSM-based databases all embrace these ideas in different forms. To better understand these trade-offs at a systems level, I built KestrelCache — a minimal persistent key-value storage engine implemented in C# and .NET.

This article focuses on the core mechanics of the storage engine, including its architecture, data layout, write/read paths, failure handling, and the practical trade-offs that emerge from a log-structured design.

1. Problem Framing

At its core, a storage engine must answer a simple question:

How do we store and retrieve data efficiently while preserving correctness under failure?

Traditional designs rely on in-place updates, but these introduce complexity:

  • random disk I/O
  • fragmentation
  • complicated concurrency control
  • difficult crash recovery

KestrelCache instead adopts a log-structured approach, where all writes are appended to disk. This eliminates in-place mutation entirely.

2. System Architecture

The entire system is defined by two components:

In-Memory Index (HashMap<Key, Offset>)
                ↓
Append-Only Log File (Disk)
Enter fullscreen mode Exit fullscreen mode

The responsibilities are clearly separated:

  • Disk provides durability and write throughput
  • Memory provides fast lookup

There are no secondary indexes or complex data structures. The append-only log is the source of truth.

3. Write Path

The write path (Put) is designed for simplicity and performance:

  1. Serialize the key and value into bytes
  2. Compute a CRC32 checksum
  3. Append the record to the end of the log file
  4. Update the in-memory index with the new offset

This design guarantees:

  • strictly sequential disk writes
  • no in-place updates
  • inherent crash safety (append-only)

Sequential writes are significantly more efficient than random writes, especially under sustained workloads.

4. Read Path

Reads (Get) avoid scanning the log entirely:

  1. Lookup the key in the in-memory index
  2. Retrieve the corresponding file offset
  3. Seek directly to the location on disk
  4. Read and return the value

This results in:

  • constant-time lookup in memory
  • a single disk seek

The performance of reads is therefore predictable and efficient.

5. Delete Semantics (Tombstones)

Deletes are implemented using tombstones, rather than removing data from disk.

Instead of deleting a record:

[previous value]
[tombstone record]
Enter fullscreen mode Exit fullscreen mode

A tombstone is represented by a record where:

ValueSize = -1
Enter fullscreen mode Exit fullscreen mode

The in-memory index removes the key, but the log retains the history.

This approach:

  • avoids rewriting files
  • maintains write performance
  • simplifies concurrency

However, it introduces the need for periodic compaction.

6. On-Disk Record Format

Each record is stored in a compact binary format:

[CRC32][KeySize][ValueSize][Key][Value]
Enter fullscreen mode Exit fullscreen mode

Where:

  • CRC32 ensures data integrity
  • KeySize and ValueSize define record boundaries
  • ValueSize = -1 indicates a tombstone

This format enables:

  • corruption detection
  • safe parsing during recovery
  • validation of partial writes

In storage systems, integrity checks are not optional — they are foundational.

7. Memory Index Design

The in-memory index is a simple:

Dictionary<string, long>
Enter fullscreen mode Exit fullscreen mode

It maps each key to its latest offset in the log file.

This design provides:

  • O(1) lookup time
  • minimal overhead
  • simplicity

However, it introduces a key limitation:

The index must fit entirely in memory.

This makes the system memory-bound as the dataset grows.

8. Startup and Recovery

On startup, the system reconstructs its state by scanning the log file:

  1. Read records sequentially
  2. Validate checksums
  3. Update the index for each key

The final state reflects the most recent record for each key.

Advantages:

  • simple and reliable recovery
  • no need for separate metadata

Trade-off:

  • startup time increases with log size

9. Observability and Logging

Unlike kernel-level systems, user-space storage engines can implement structured logging.

Typical log events include:

  • write operations with offsets
  • tombstone creation
  • recovery progress
  • checksum failures

Example:

[INFO] Appended key=user:1 at offset=0
[INFO] Tombstone written for key=user:1
[WARN] Checksum mismatch at offset=170
Enter fullscreen mode Exit fullscreen mode

Observability is critical for:

  • debugging
  • detecting corruption
  • understanding system behavior

Without it, diagnosing issues becomes extremely difficult.

10. Failure Scenarios

This design handles several failure modes naturally:

Crash During Write

  • Partial writes can be detected via checksum mismatch
  • Invalid records can be skipped during recovery

Data Corruption

  • CRC32 validation prevents silent data corruption

Power Loss

  • Since writes are append-only, previously written data remains intact

However, durability depends on the underlying filesystem and flush behavior.

11. Performance Characteristics

Strengths

  • high write throughput (sequential I/O)
  • simple and predictable behavior
  • efficient point lookups
  • strong crash recovery guarantees

Limitations

  • unbounded file growth
  • memory-bound index
  • no range queries
  • no transactions or isolation
  • no concurrent write optimization

These limitations reflect deliberate design simplicity.

12. Compaction Strategy

Over time, the log accumulates:

  • outdated values
  • tombstones

Compaction reclaims space:

scan log → keep latest entries → rewrite new file
Enter fullscreen mode Exit fullscreen mode

This reduces:

  • disk usage
  • read amplification

Compaction is a core requirement in all log-structured systems.

13. Design Trade-offs

This project highlights several key trade-offs:

Design Choice Benefit Cost
Append-only writes Fast, simple Requires compaction
In-memory index Fast reads Memory usage
Tombstones Cheap deletes Disk growth
Simple format Easy recovery Limited features

Storage systems are ultimately about choosing the right trade-offs.

14. Lessons Learned

Building a storage engine from scratch reveals several important insights:

  • Performance is primarily driven by I/O patterns
  • Sequential access is the most important optimization
  • Simplicity reduces failure modes
  • Observability must be designed early
  • Data integrity must be enforced at the lowest level Most importantly:

Reliable systems are not built by adding features, but by controlling complexity.

Conclusion

KestrelCache is intentionally minimal, but it captures the essence of modern storage engine design.

Even a simple log-structured key-value store demonstrates:

  • how durability is achieved
  • how performance is optimized
  • how systems recover from failure

Understanding these fundamentals is essential for building scalable backend systems and infrastructure.

Top comments (2)

Collapse
 
acytryn profile image
Andre Cytryn

the memory-bound index is the part I'd tackle next. Bitcask addresses this with hint files: compact snapshots written during compaction so startup doesn't require replaying the full log. beyond that, a sparse index combined with bloom filters prevents unnecessary disk seeks for missing keys. did you run into any edge cases during CRC32 validation, specifically around partially flushed records at log boundaries?

Collapse
 
saksham_kapoor profile image
Saksham Kapoor

That’s a great point, the memory-bound index is definitely one of the first limitations I’d address next.

Hint files are a really clean solution for reducing startup time without replaying the entire log, and I like how Bit cask keeps it simple by persisting the index snapshot during compaction. I was thinking along similar lines for improving recovery time.

A sparse index + bloom filters also makes a lot of sense, especially to avoid unnecessary disk seeks for non-existent keys, that’s something I haven’t implemented yet, but it would be a natural next step as the dataset grows.

Regarding CRC32 and partially flushed records, yes, I did run into that edge case.

During recovery, I treat any record with:

  • invalid checksum, or
  • incomplete header/body (e.g. truncated write at the end of the file)

as the logical end of the log.

So effectively:

  • valid records are applied
  • the first invalid/truncated record stops the scan

This works reasonably well for crash scenarios where the last write wasn’t fully flushed.

That said, it’s still a fairly basic approach, a more robust design would probably include explicit record framing or length-prefix validation to make boundary detection more reliable.

Appreciate the insights, especially around hint files, that’s something I’m definitely planning to explore next.