Dylan Dumont

Posted on Apr 26

LSM Trees vs B-Trees: How Storage Engines Choose Their Data Structure

#architecture #database #systems #backend

"Choosing between LSM Trees and B-Trees dictates the throughput ceiling of your write-heavy or read-heavy workload."

What We're Building

We are analyzing the fundamental trade-offs between two dominant key-value storage paradigms. The goal is not to declare one superior, but to understand the architectural implications of each. This comparison focuses on write amplification, read latency, and disk seek patterns. We will examine how these structures handle concurrent writes and sequential reads, providing a decision framework for engineering teams selecting a persistence layer.

Step 1 — B-Tree Random Access Optimization

B-Trees enforce a balanced height and sorted order, ensuring that insertion, deletion, and lookup operations run in O(log n) time. Maintaining balance requires frequent random writes to disk whenever a node splits. This structure minimizes read latency because any key is accessed in a predictable number of disk seeks. However, the overhead of splitting nodes during updates can slow down throughput during high-write scenarios.

struct BTreeNode {
    keys: Vec<u32>,
    values: HashMap<u32, Data>,
    children: Vec<NodeRef>,
}

This structure minimizes random I/O but creates write amplification during splits.

Step 2 — LSM Memtable Buffering

LSM Trees separate mutable memory from immutable storage to optimize write performance. Incoming writes go into an in-memory sorted structure called a Memtable. Once the Memtable reaches a size threshold, it flushes to the disk as an immutable Sorted String Table (SSTable). This buffering allows the system to absorb millions of writes per second without touching the physical disk immediately.

struct Memtable {
    entries: BTreeMap<K, V>,
    max_size: u64,
}

impl Memtable {
    pub fn flush(&mut self) {
        if self.entries.len() > self.max_size {
            self.sst.write(&self.entries);
            self.entries.clear();
        }
    }
}

This buffers writes in memory before flushing to disk, drastically improving throughput.

Step 3 — Compaction Lifecycle

The disk eventually contains multiple SSTables with overlapping keys. A compaction process merges these sorted files into larger, more compact files. This process is critical for space reclamation and read efficiency. It involves scanning multiple sorted files, removing duplicates, and writing a new file. Over time, this reduces file fragmentation and ensures that sequential reads hit contiguous blocks of data.

Memtable -> SSTable1
Memtable -> SSTable2
Compaction: SSTable1 + SSTable2 -> New SSTable

Over time, smaller files merge into larger ones to optimize sequential read performance.

Step 4 — Handling Read Amplification Costs

Read operations in an LSM Tree are more complex than in a B-Tree. When searching for a key, the engine checks the Memtable first. If the key isn't found, it scans through the SSTables. While LSM Trees are optimized for writes, reads can suffer from increased latency due to this multi-level lookup. This is a trade-off for the write performance.

pub fn get(&self, key: K) -> Option<V> {
    self.memtable.get(&key)
        .or_else(|| self.find_sstable(&key))
}

This adds latency per read but allows massive write concurrency without locking.

Step 5 — Engine Selection Matrix

The decision to use one structure over the other depends on the workload. If random writes dominate, avoid LSM Trees. If read amplification is acceptable, choose LSM for high throughput. Use RocksDB for KV stores requiring massive write rates, and B-Trees (like InnoDB) for relational SQL databases where fast point lookups are vital.

High Write Load: Choose LSM Trees.
High Read Load: Choose B-Trees.
Random Writes: Avoid LSM Trees.
Sequential Writes: Favor LSM Trees.
HDD Storage: Favor B-Trees.
SSD Storage: Favor LSM Trees.

Takeaways

Write Amplification is the primary cost of LSM Trees, increasing physical writes. Read Latency increases due to multiple lookups through the Memtable and SSTables. Sequential I/O is heavily favored by LSM Trees for compaction and flushing. Memory Footprint is higher for LSM Trees due to the Memtable buffering. Failure Domain risks increase with large Memtables due to memory loss during a crash.

What's Next?

Future discussions will cover cloud storage patterns like S3 object stores and new storage engine abstractions like RocksDB. We will also explore how to implement custom SSTable merging strategies in Rust.

DEV Community