LSM-Trees vs B-Trees: The Write-Optimized Database Tradeoff

#tutorial #webdev

Every database answers one question before it stores a single byte: when you write a row, do you change the data where it already lives, or do you append the change somewhere new and reconcile later? B-trees pick the first answer. LSM-trees pick the second. That one decision ripples outward into write throughput, read latency, disk usage, and the shape of the workloads each engine handles well. If you have ever wondered why Postgres and Cassandra feel so different under load, this is the root of it.

How each structure actually writes

A B-tree — really a B+tree in almost every production database — keeps keys sorted inside fixed-size pages: 8KB in PostgreSQL, 16KB in MySQL's InnoDB. To update a row, the engine walks from the root to the leaf page holding that key, then modifies that page in place. The walk is shallow (a tree holding millions of rows is usually three or four levels deep), but the final write lands wherever that page already sits on disk. That means random I/O, and it means write amplification: change 50 bytes and the engine still rewrites the whole 8KB page, plus the write-ahead log entry that protects it.

An LSM-tree (log-structured merge tree) refuses to touch existing files. A write first appends to a write-ahead log, then lands in an in-memory sorted structure called the memtable. When the memtable fills, it is flushed to disk as an immutable, sorted file called an SSTable. Because SSTables are never modified, every flush is a large sequential write — the kind of I/O that both spinning disks and SSDs handle far faster than scattered random writes. Updates and deletes do not overwrite anything; a delete writes a small marker called a tombstone, and the newest value simply shadows older ones.

The catch is that those immutable files pile up. A background process called compaction continuously merges SSTables, drops shadowed values, and reclaims space. Compaction is where the LSM-tree pays its bill: the same logical row may be rewritten several times over its life as it moves through compaction levels.

Write-optimized does not mean write-cheap. With the leveled compaction that engines like RocksDB use by default, a single logical write can be physically rewritten roughly ten to thirty times before it settles. On write-heavy workloads that compaction work competes with your foreground traffic for disk and CPU, which is why LSM systems expose so many compaction-tuning knobs. You are trading random-write cost for background rewrite cost, not eliminating cost.

Where the tradeoffs bite

Reads. A B-tree read is predictable: walk root to leaf, read one page, done. An LSM read may have to check the memtable and then several SSTables across levels, because the key could live in any of them. To avoid touching files that cannot contain the key, each SSTable carries a Bloom filter — a probabilistic index that answers "definitely not here" cheaply, with a tunable false-positive rate around 1%. Bloom filters make point lookups close to a single disk read in practice, but range scans still have to merge results from multiple files, so they remain the B-tree's stronger suit.

Space. B-trees fragment. Pages split when they fill and rarely sit perfectly packed, so a B-tree often carries meaningful slack on disk. LSM-trees hold obsolete versions and tombstones until compaction clears them, which means transient bloat that depends on how far behind compaction is running. Leveled compaction keeps steady-state overhead low — often around 10% extra — at the cost of more rewriting; size-tiered compaction rewrites less but can temporarily hold multiple full copies during a merge.

Concurrency. Appending to a log and a memtable serializes cleanly, so LSM-trees absorb bursts of concurrent writes gracefully. B-trees must latch the pages they modify, and hot pages can become contention points under heavy concurrent writes.

A useful frame here is the RUM conjecture: across Read overhead, Update overhead, and Memory (space) overhead, you can optimize for two at the expense of the third. B-trees favor reads and space and pay on random updates. LSM-trees favor updates and space efficiency under compaction and pay on read and rewrite amplification. There is no structure that wins all three — only structures that pick a side.

Dimension	B-tree	LSM-tree
Write pattern	In-place, random I/O	Append-only, sequential I/O
Point read	One root-to-leaf walk	Memtable + SSTables, cut by Bloom filters
Range scan	Strong, keys laid out in order	Weaker, merges across files
Main cost	Page write amplification	Compaction rewrite amplification
Best fit	Read-heavy, mixed OLTP	Write-heavy, high ingest

Which one is in the database you already use

You are almost certainly running both, depending on the project. PostgreSQL, MySQL's InnoDB, and SQLite are B-tree engines — which is why they shine on transactional, read-heavy work with predictable latency. Cassandra, ScyllaDB, HBase, and the embedded engines LevelDB and RocksDB are LSM-trees, built for heavy ingest and write throughput. The line blurs further inside modern systems: CockroachDB and TiKV store their data in RocksDB-derived LSM engines (Pebble and RocksDB respectively), and MyRocks puts an LSM storage engine under a MySQL front end.

When you are choosing or tuning a store, profile the write-to-read ratio of your actual workload before reaching for either answer. An append-heavy event log, a metrics pipeline, or a message queue leans LSM. A system that reads far more than it writes, runs lots of range scans, or needs flat, predictable latency leans B-tree. "It depends" is not a cop-out here — the workload genuinely decides.

The fastest way to internalize this is to read the source. Open the SSTable flush path in RocksDB, or the page-split logic in Postgres's nbtree, and the abstract tradeoff turns concrete.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.