Modern storage systems are built around a small set of fundamental principles: optimize for sequential I/O, minimize in-place mutation, and design for recoverability from failure.
Systems such as Bitcask, Kafka, and LSM-based databases all embrace these ideas in different forms. To better understand these trade-offs at a systems level, I built KestrelCache — a minimal persistent key-value storage engine implemented in C# and .NET.
This article focuses on the core mechanics of the storage engine, including its architecture, data layout, write/read paths, failure handling, and the practical trade-offs that emerge from a log-structured design.
1. Problem Framing
At its core, a storage engine must answer a simple question:
How do we store and retrieve data efficiently while preserving correctness under failure?
Traditional designs rely on in-place updates, but these introduce complexity:
- random disk I/O
- fragmentation
- complicated concurrency control
- difficult crash recovery
KestrelCache instead adopts a log-structured approach, where all writes are appended to disk. This eliminates in-place mutation entirely.
2. System Architecture
The entire system is defined by two components:
In-Memory Index (HashMap<Key, Offset>)
↓
Append-Only Log File (Disk)
The responsibilities are clearly separated:
- Disk provides durability and write throughput
- Memory provides fast lookup
There are no secondary indexes or complex data structures. The append-only log is the source of truth.
3. Write Path
The write path (Put) is designed for simplicity and performance:
- Serialize the key and value into bytes
- Compute a CRC32 checksum
- Append the record to the end of the log file
- Update the in-memory index with the new offset
This design guarantees:
- strictly sequential disk writes
- no in-place updates
- inherent crash safety (append-only)
Sequential writes are significantly more efficient than random writes, especially under sustained workloads.
4. Read Path
Reads (Get) avoid scanning the log entirely:
- Lookup the key in the in-memory index
- Retrieve the corresponding file offset
- Seek directly to the location on disk
- Read and return the value
This results in:
- constant-time lookup in memory
- a single disk seek
The performance of reads is therefore predictable and efficient.
5. Delete Semantics (Tombstones)
Deletes are implemented using tombstones, rather than removing data from disk.
Instead of deleting a record:
[previous value]
[tombstone record]
A tombstone is represented by a record where:
ValueSize = -1
The in-memory index removes the key, but the log retains the history.
This approach:
- avoids rewriting files
- maintains write performance
- simplifies concurrency
However, it introduces the need for periodic compaction.
6. On-Disk Record Format
Each record is stored in a compact binary format:
[CRC32][KeySize][ValueSize][Key][Value]
Where:
- CRC32 ensures data integrity
- KeySize and ValueSize define record boundaries
- ValueSize = -1 indicates a tombstone
This format enables:
- corruption detection
- safe parsing during recovery
- validation of partial writes
In storage systems, integrity checks are not optional — they are foundational.
7. Memory Index Design
The in-memory index is a simple:
Dictionary<string, long>
It maps each key to its latest offset in the log file.
This design provides:
- O(1) lookup time
- minimal overhead
- simplicity
However, it introduces a key limitation:
The index must fit entirely in memory.
This makes the system memory-bound as the dataset grows.
8. Startup and Recovery
On startup, the system reconstructs its state by scanning the log file:
- Read records sequentially
- Validate checksums
- Update the index for each key
The final state reflects the most recent record for each key.
Advantages:
- simple and reliable recovery
- no need for separate metadata
Trade-off:
- startup time increases with log size
9. Observability and Logging
Unlike kernel-level systems, user-space storage engines can implement structured logging.
Typical log events include:
- write operations with offsets
- tombstone creation
- recovery progress
- checksum failures
Example:
[INFO] Appended key=user:1 at offset=0
[INFO] Tombstone written for key=user:1
[WARN] Checksum mismatch at offset=170
Observability is critical for:
- debugging
- detecting corruption
- understanding system behavior
Without it, diagnosing issues becomes extremely difficult.
10. Failure Scenarios
This design handles several failure modes naturally:
Crash During Write
- Partial writes can be detected via checksum mismatch
- Invalid records can be skipped during recovery
Data Corruption
- CRC32 validation prevents silent data corruption
Power Loss
- Since writes are append-only, previously written data remains intact
However, durability depends on the underlying filesystem and flush behavior.
11. Performance Characteristics
Strengths
- high write throughput (sequential I/O)
- simple and predictable behavior
- efficient point lookups
- strong crash recovery guarantees
Limitations
- unbounded file growth
- memory-bound index
- no range queries
- no transactions or isolation
- no concurrent write optimization
These limitations reflect deliberate design simplicity.
12. Compaction Strategy
Over time, the log accumulates:
- outdated values
- tombstones
Compaction reclaims space:
scan log → keep latest entries → rewrite new file
This reduces:
- disk usage
- read amplification
Compaction is a core requirement in all log-structured systems.
13. Design Trade-offs
This project highlights several key trade-offs:
| Design Choice | Benefit | Cost |
|---|---|---|
| Append-only writes | Fast, simple | Requires compaction |
| In-memory index | Fast reads | Memory usage |
| Tombstones | Cheap deletes | Disk growth |
| Simple format | Easy recovery | Limited features |
Storage systems are ultimately about choosing the right trade-offs.
14. Lessons Learned
Building a storage engine from scratch reveals several important insights:
- Performance is primarily driven by I/O patterns
- Sequential access is the most important optimization
- Simplicity reduces failure modes
- Observability must be designed early
- Data integrity must be enforced at the lowest level Most importantly:
Reliable systems are not built by adding features, but by controlling complexity.
Conclusion
KestrelCache is intentionally minimal, but it captures the essence of modern storage engine design.
Even a simple log-structured key-value store demonstrates:
- how durability is achieved
- how performance is optimized
- how systems recover from failure
Understanding these fundamentals is essential for building scalable backend systems and infrastructure.
Top comments (2)
the memory-bound index is the part I'd tackle next. Bitcask addresses this with hint files: compact snapshots written during compaction so startup doesn't require replaying the full log. beyond that, a sparse index combined with bloom filters prevents unnecessary disk seeks for missing keys. did you run into any edge cases during CRC32 validation, specifically around partially flushed records at log boundaries?
That’s a great point, the memory-bound index is definitely one of the first limitations I’d address next.
Hint files are a really clean solution for reducing startup time without replaying the entire log, and I like how Bit cask keeps it simple by persisting the index snapshot during compaction. I was thinking along similar lines for improving recovery time.
A sparse index + bloom filters also makes a lot of sense, especially to avoid unnecessary disk seeks for non-existent keys, that’s something I haven’t implemented yet, but it would be a natural next step as the dataset grows.
Regarding CRC32 and partially flushed records, yes, I did run into that edge case.
During recovery, I treat any record with:
as the logical end of the log.
So effectively:
This works reasonably well for crash scenarios where the last write wasn’t fully flushed.
That said, it’s still a fairly basic approach, a more robust design would probably include explicit record framing or length-prefix validation to make boundary detection more reliable.
Appreciate the insights, especially around hint files, that’s something I’m definitely planning to explore next.