geek-electro

Posted on May 20

I built a storage engine from scratch. Here’s everything I learned.

#systemdesign #backenddevelopment #database

I built a storage engine from scratch. Here's everything I learned.

Not a wrapper. Not a library call. A real, working storage engine — written in C++, exposed over gRPC, running inside Docker. This is the story of how I built it and why.

The problem I was trying to solve

I needed to store structured data in a very specific shape:

A document (HTML, text, anything)
A chain of inputs attached to that document
An expected output for each input

Simple enough, right? But every database I looked at felt wrong.

A relational database wanted me to define a schema upfront and write JOIN queries just to read a document with its inputs. A key-value store was too flat — I'd have to model the relationships myself. A document database handled the top level fine but got awkward when I needed an ordered, linked chain of sub-documents.

So I built exactly what I needed. Nothing more.

What is Stratum?

Stratum is a log-structured, hierarchical storage engine with O(1) lookup, built in C++ and exposed over gRPC so any language can talk to it.

Here's the data model:

Document ──→ Input node 1 ──→ Input node 2 ──→ Input node 3
                  │                 │                 │
                  ▼                 ▼                 ▼
             Output node 1    Output node 2    Output node 3

Every document lives at a known byte offset on disk. An in-memory hash map holds document_id → byte_offset. Every read is one hash map lookup (O(1)) plus one disk seek. No indexes to maintain, no query planner, no surprises.

How it actually works under the hood

Log-structured writes

Every write — insert, update, delete — is an append to the end of the active segment file. Nothing is ever mutated in place.

When you update a document, the old version stays on disk. The new version is appended. The in-memory index is updated to point at the new offset. The old version becomes garbage.

This means writes are always O(1) and you never get partial writes corrupting your data.

Segment rotation

When the active segment grows beyond a configured threshold, it gets sealed — renamed to seg_NNN.seg — and a fresh active.seg is opened. You end up with a stack of segments on disk, which is where the name comes from. Like geological strata.

Background compaction

A background C++ thread wakes periodically and checks total segment size. When it crosses a threshold, it:

Scans all sealed segments
For every record ID, keeps only the version with the highest timestamp
Discards tombstoned (deleted) records entirely
Writes a single merged.seg
Atomically replaces the old segments

Storage only grows proportionally to live data — not to total write history. This is the same strategy used by Bitcask and LSM-tree databases like RocksDB, just much simpler.

Thread safety

Reads use a std::shared_mutex — any number of readers can run concurrently. Writes take an exclusive lock only to update the in-memory index. The disk append itself is serialized by a per-segment mutex.

The architecture

Your Python/Go/Node app
        │
        │  gRPC (auto-generated client from a single .proto file)
        ▼
  Stratum server (C++, always running)
        │
        ├── In-memory hash index  ← O(1) lookups
        ├── Segment manager       ← append-only log files
        └── Compactor thread      ← background GC
        │
        ▼
  Disk (log-structured .seg files)

The gRPC part is what makes this useful beyond a single project. You write the .proto file once, run protoc, and get a client in Python, Go, Java, Node, Rust — any language gRPC supports. The C++ engine doesn't care who's calling it.

The data types problem

One thing I underestimated early on: input and output nodes can hold anything. An integer. A string. A list of integers. A list of strings. A map.

I ended up using std::variant in C++ to represent this:

using ValueVariant = std::variant<
    int64_t,
    double,
    std::string,
    std::vector<int64_t>,
    std::vector<std::string>,
    std::unordered_map<std::string, std::string>
>;

On the wire (gRPC), this becomes a oneof in the proto definition:

message Value {
    oneof kind {
        int64      int_val    = 1;
        double     double_val = 2;
        string     str_val    = 3;
        Int64List  vec_int    = 4;
        StringList vec_str    = 5;
        StringMap  map_val    = 6;
    }
}

The Python client wraps this so you never think about it — you just pass a Python int, list, or dict and the client figures out the right proto type.

What calling it from Python looks like

Once the Docker container is running, this is the entire client-side API:

from lse_client import LseClient

lse = LseClient("localhost:50051")

# Create a document
doc_id = lse.create_problem(
    "<h1>My document</h1>",
    category="tutorial",
    version="1.0"
)

# Add input nodes — any data type
node1 = lse.add_test_case(doc_id, [2, 7, 11, 15])
node2 = lse.add_test_case(doc_id, "some string")
node3 = lse.add_test_case(doc_id, {"key": "value"})

# Attach output nodes
lse.set_expected_output(doc_id, node1, [0, 1])

# Read everything back
doc   = lse.get_problem_html(doc_id)
nodes = lse.get_all_test_cases(doc_id)
out   = lse.get_expected_output(doc_id, node1)

No HTTP, no JSON parsing, no URL construction. Just function calls. The generated gRPC stub handles serialization, connection management, and error handling.

Running it

The engine ships as a Docker container. Starting it is one command:

docker-compose up --build

That's it. The server starts on port 50051. Data is written to a named Docker volume so it survives container restarts and rebuilds. To stop: docker-compose down. Data stays. To wipe everything: docker-compose down -v.

What I learned building this

1. Append-only is underrated. The moment I stopped trying to update records in place, the entire concurrency problem got simpler. Readers and writers can't conflict on the same byte offset because writers never touch old offsets.

2. Separate the index from the data. The in-memory hash index is tiny — just id → offset. It loads from disk on startup in milliseconds. The actual data — potentially gigabytes of blobs — never touches RAM until you ask for it. This is the core insight behind Bitcask.

3. gRPC over a custom protocol every time. I briefly considered rolling a custom TCP protocol. The moment I saw what a .proto file + protoc gives you — a typed, versioned, language-agnostic API with generated clients in every language — there was no reason to do anything else.

4. The compactor is the hardest part. Not because the algorithm is complex, but because it has to be correct under concurrent access. Getting the rename-then-reload sequence right — so readers never see a half-compacted state — took more iteration than anything else in the codebase.

5. CRC on every record. I added CRC-32 checksums on every record header from day one. It caught two bugs during development that would have been nearly impossible to find otherwise. Storage engines fail silently at the byte level. Checksums surface those failures immediately.

What's next

Stratum is open source. You can use it for anything that fits the document → inputs → outputs model. I'm actively working on:

A write-ahead log (WAL) for crash recovery
Snapshot / backup support
Benchmarks against SQLite for this specific access pattern
A Rust client

If you build something with it, I'd genuinely love to know.

DEV Community