AGP Marka

Posted on Feb 19

Distributed Database Internals: The Engineering Behind Log-Structured Merge (LSM) Trees

#database #distributedsystems #storage #performance

In the world of high-performance distributed databases like Cassandra, ScyllaDB, and RocksDB, the traditional B-Tree architecture often hits a wall. While B-Trees are excellent for read-heavy workloads, they struggle with high-velocity write traffic due to random I/O and page fragmentation.

The industry's answer to this 'write problem' is the Log-Structured Merge (LSM) Tree. This architecture transforms random writes into sequential writes, allowing databases to ingest millions of records per second with minimal latency. In this deep-dive, we will explore the internals of how LSM trees work, why they are so fast, and the trade-offs they make.

1. The Write Path: Sequential is King

The fundamental principle of an LSM tree is that appending to a log is always faster than updating a page in a B-Tree. Instead of modifying data in place, an LSM tree treats every write as an 'upsert'—it simply appends the new data to a log.

The Three Core Components

Write-Ahead Log (WAL): A persistent append-only log on disk. If the server crashes, the WAL is used to reconstruct the in-memory data.
MemTable: An in-memory data structure (typically a SkipList or a Balanced Tree) that stores incoming writes in sorted order.
Sorted String Tables (SSTables): Once the MemTable reaches a certain size, it is 'flushed' to disk as an immutable, sorted file.

2. Deep Dive: MemTable Flushes and SSTable Immutability

When the MemTable is full, the database starts a background thread to write its contents to disk. Because the MemTable is already sorted in memory, the resulting SSTable is written sequentially. This is a critical performance win: sequential disk I/O is orders of magnitude faster than random I/O, even on modern NVMe drives.

Once an SSTable is written, it is immutable. It is never changed. If a user updates a key, a new version of that key is written to a new SSTable. This eliminates the need for complex locking mechanisms and page splits found in B-Trees.

3. The Challenge: Read Amplification and Compaction

If data is spread across dozens of immutable SSTables, how do we find a specific key? We have to check the MemTable first, and then check every SSTable from newest to oldest. This is called Read Amplification.

To solve this, LSM trees use a process called Compaction. Compaction merges multiple SSTables into a single, larger SSTable, discarding old versions of keys and deleted records (tombstones).

Leveled vs. Size-Tiered Compaction

Size-Tiered Compaction Strategy (STCS): Good for write-heavy workloads (Cassandra default). It groups SSTables of similar sizes together and merges them.
Leveled Compaction Strategy (LCS): Good for read-heavy workloads (RocksDB/ScyllaDB). It organizes SSTables into hierarchical levels, ensuring that each level contains non-overlapping keys.

4. Engineering Implementation: A Simple MemTable in Python

To understand the logic, let's look at a simplified implementation of a MemTable using a Python dictionary (acting as our sorted map) and a simulated flush trigger.

import time

class LSMStore:
    def __init__(self, memtable_limit=1000):
        self.memtable = {}
        self.memtable_limit = memtable_limit
        self.sstables = [] # List of filenames

    def put(self, key, value):
        # 1. In a real DB, we'd write to WAL first
        self.memtable[key] = value

        # 2. Check if we need to flush
        if len(self.memtable) >= self.memtable_limit:
            self.flush_to_sstable()

    def flush_to_sstable(self):
        filename = f'sstable_{int(time.time())}.db'
        # Sort the memtable and write to 'disk'
        sorted_data = sorted(self.memtable.items())
        print(f'[*] Flushing {len(sorted_data)} keys to {filename}')

        # Clear MemTable for new writes
        self.memtable = {}
        self.sstables.append(filename)

    def get(self, key):
        # Check MemTable first
        if key in self.memtable: return self.memtable[key]

        # Check SSTables from newest to oldest (simulated)
        for sstable in reversed(self.sstables):
            # In a real DB, we use Bloom Filters here to skip files
            pass
        return None

5. Performance Comparison: LSM vs. B-Tree

When choosing a storage engine, the decision usually boils down to the RUM Conjecture (Read, Update, Memory overhead).

Feature	B-Tree (PostgreSQL/MySQL)	LSM Tree (RocksDB/Cassandra)
Write Throughput	Lower (Random I/O)	Ultra-High (Sequential I/O)
Read Throughput	Very High	Moderate (Read Amplification)
Space Efficiency	Lower (Page Fragmentation)	High (Compressed SSTables)
Write Amplification	Moderate	High (due to Compaction)

6. Real-World Applications

LSM trees are the engine behind the world's most scalable data platforms:

Apache Cassandra: Uses LSM trees to provide high availability and write performance for massive datasets.
RocksDB: Facebook's high-performance embeddable key-value store, which many other databases (like CockroachDB and TiDB) use as their underlying storage engine.
ScyllaDB: A C++ rewrite of Cassandra that uses advanced Leveled Compaction to minimize tail latency.

Final Thoughts

The Log-Structured Merge Tree is a masterpiece of systems engineering. By accepting the cost of background compaction, it unlocks a level of write performance that B-Trees simply cannot match. If your application needs to ingest telemetry data, logs, or real-time event streams at scale, understanding the LSM tree is not just useful—it's essential.

DEV Community