Haji Rufai

Posted on Jun 11

I Built an LSM-Tree in Pure Python and Finally Understood How RocksDB and Cassandra Handle Millions of Writes

#database #datascience #python #programming

I have built a write-ahead log and a bloom filter from scratch in the last week. Today I wanted to put them together into something bigger.

So I built an LSM-tree. An LSM-tree is the storage engine behind RocksDB, Cassandra, LevelDB, ScyllaDB, and the metadata layer of a dozen databases you use every day. If you have ever wondered why these systems can swallow millions of writes per second, this is the answer.

The whole thing fits in about 150 lines of pure Python. No dependencies. By the end you will understand exactly why an LSM-tree trades read speed for absurd write speed, and how it claws some of that read speed back.

The problem LSM-trees solve

A B-tree updates data in place. Every write finds the right page on disk and modifies it. That means random I/O, and random I/O on spinning disks or even SSDs is slow.

The LSM-tree makes a different bet. What if you never updated anything in place? What if every write was just an append to the end of a file?

Appends are the fastest thing a disk can do. The catch is that reads become harder, because the latest value for a key could be anywhere. The genius of the LSM-tree is the set of tricks that make those reads fast again.

Here is the shape of the design:

Writes go to an in-memory sorted structure called a memtable.
When the memtable fills up, it gets flushed to disk as a sorted, immutable file called an SSTable.
Reads check the memtable first, then the SSTables newest to oldest.
A background process merges SSTables together to keep reads fast. This is compaction.

Let me build each piece.

The memtable

The memtable is just a sorted dictionary in memory. Writes are O(log n) inserts. I will use a plain dict and sort on flush, which keeps the code simple and is fine for our purposes.

class Memtable:
    def __init__(self):
        self.data = {}
        self.size_bytes = 0

    def put(self, key, value):
        self.data[key] = value
        self.size_bytes += len(key) + len(value)

    def get(self, key):
        return self.data.get(key)

    def items_sorted(self):
        return sorted(self.data.items())

Notice there is no delete method that removes a key. In an LSM-tree you never truly delete. Instead you write a special marker called a tombstone. I will represent it with a sentinel value.

TOMBSTONE = "__tombstone__"

def delete(self, key):
    self.put(key, TOMBSTONE)

Why a tombstone instead of removing the key? Because an older SSTable on disk might still have the real value. If you just dropped the key from memory, an old version would resurface on the next read. The tombstone says "this key is dead" and shadows everything older. Compaction cleans it up later.

The SSTable

When the memtable gets too big, we flush it to disk as an SSTable. The key property is that the entries are sorted by key. Sorted order is what makes everything else efficient.

import json

def flush_memtable(memtable, path):
    with open(path, "w") as f:
        for key, value in memtable.items_sorted():
            f.write(json.dumps([key, value]) + "\n")

That is it. A flat file of sorted key-value pairs, one per line. Real systems use a binary format with a block index, but the principle is identical. Sorted on disk, immutable once written.

Immutability is a feature, not a limitation. Because an SSTable never changes after it is written, there are no locks, no in-place updates, and no torn writes. Concurrent readers can scan it freely while new data lands in fresh files.

Reading from an SSTable

To read a key, we scan the file. Because it is sorted we can stop early once we pass the key we want.

def sstable_get(path, target):
    with open(path) as f:
        for line in f:
            key, value = json.loads(line)
            if key == target:
                return value
            if key > target:
                return None  # sorted, so it cannot appear later
    return None

A linear scan sounds slow, and it is. This is where the bloom filter from my last post earns its keep. Before scanning an SSTable, you ask its bloom filter "could this key be here?" If the answer is no, you skip the entire file without touching disk. Most negative lookups never read a single SSTable byte.

Tying it together: the LSM engine

Now the full engine. Writes hit the memtable. When the memtable crosses a size threshold, we flush it and start a new one. Reads walk from newest data to oldest and stop at the first hit.

import os
import time

class LSMTree:
    def __init__(self, directory, flush_threshold=4096):
        self.directory = directory
        self.flush_threshold = flush_threshold
        self.memtable = Memtable()
        self.sstables = []  # newest last
        os.makedirs(directory, exist_ok=True)

    def put(self, key, value):
        self.memtable.put(key, value)
        if self.memtable.size_bytes >= self.flush_threshold:
            self._flush()

    def delete(self, key):
        self.put(key, TOMBSTONE)

    def _flush(self):
        path = os.path.join(self.directory, f"sstable_{time.time_ns()}.log")
        flush_memtable(self.memtable, path)
        self.sstables.append(path)
        self.memtable = Memtable()

    def get(self, key):
        # 1. memtable is freshest
        value = self.memtable.get(key)
        if value is not None:
            return None if value == TOMBSTONE else value
        # 2. SSTables, newest to oldest
        for path in reversed(self.sstables):
            value = sstable_get(path, key)
            if value is not None:
                return None if value == TOMBSTONE else value
        return None

The ordering in get is the whole game. The memtable holds the most recent writes, so it wins. Then we check SSTables from newest to oldest. The first time we find the key, that is the current value, and we stop. A tombstone counts as a hit and returns None, which is how deletes work across files.

Let me prove it behaves like a real key-value store.

db = LSMTree("./data", flush_threshold=64)

db.put("user:1", "alice")
db.put("user:2", "bob")
db.put("user:1", "alice_updated")  # newer write shadows the old one
db.delete("user:2")                # tombstone

print(db.get("user:1"))  # alice_updated
print(db.get("user:2"))  # None
print(db.get("user:9"))  # None

Even after the memtable flushes to disk and later writes land in newer SSTables, the read path resolves the right version every time, because newer always shadows older.

The cost nobody warns you about: read amplification

Run that engine for a while and you will end up with hundreds of SSTables. A read for a missing key might have to check every one of them. That is read amplification, and it is the dark side of the write-optimized design.

Two things fight it. The bloom filter skips files that cannot contain the key. And compaction merges many small SSTables into fewer big ones.

Compaction

Compaction reads several SSTables, merges them by key keeping only the newest value for each, drops tombstones, and writes one fresh SSTable. Because the inputs are already sorted, this is a k-way merge.

def compact(self):
    merged = {}
    # oldest first so newer values overwrite older ones
    for path in self.sstables:
        with open(path) as f:
            for line in f:
                key, value = json.loads(line)
                merged[key] = value
    # drop tombstones during the final compaction
    clean = {k: v for k, v in merged.items() if v != TOMBSTONE}

    new_path = os.path.join(self.directory, f"sstable_{time.time_ns()}.log")
    tmp = Memtable()
    for k, v in clean.items():
        tmp.put(k, v)
    flush_memtable(tmp, new_path)

    for path in self.sstables:
        os.remove(path)
    self.sstables = [new_path]

After compaction, dozens of files collapse into one, stale versions vanish, and deleted keys are physically gone. Reads get fast again. The price is the I/O spent rewriting data, which is called write amplification. Tuning the balance between read amplification, write amplification, and space amplification is the entire art of running RocksDB or Cassandra in production. The RUM conjecture says you can optimize for at most two of read, update, and memory at once. You never get all three.

Where the durability comes from

You might have noticed a gap. If the process crashes while data is still in the memtable, those writes are gone. The memtable lives in RAM.

This is exactly where the write-ahead log comes in. Before a write touches the memtable, you append it to the WAL on disk. On restart you replay the WAL to rebuild the memtable. The WAL plus the memtable plus SSTables plus bloom filters plus compaction is the complete LSM-tree. Every piece I have built this week was a part of this one machine, and I did not plan it that way when I started.

What this taught me

The LSM-tree is not a clever data structure. It is a clever pipeline. Each component is simple. The memtable is a dict. An SSTable is a sorted file. A bloom filter is a bit array. Compaction is a merge. The power comes from how they compose.

That is the real lesson of building storage engines by hand. The famous systems are not magic. They are a handful of boring ideas wired together with care, and you can build the toy version in an afternoon.

If you want to actually understand the database you depend on, stop reading the docs and build the smallest version that runs. You will never look at RocksDB the same way again.

I am publishing one of these from-scratch builds every day. Follow along if you want to understand the systems under your stack one toy at a time, and tell me in the comments which engine I should tear apart next.

DEV Community