Erich

Posted on May 20 • Edited on May 27

B+Tree vs LSM-Tree on a matched budget

#rust #database

B+Trees and LSM-Trees are two on-disk index designs in production databases. The textbook framing is B+Trees win reads, LSMs win writes. Which engine wins, and by how much, depends on the workload. InterchangeDB implements both, so I matched their memory budgets and ran seven benchmarks against the same data on the same disk.

A 256 KB buffer pool for the B+Tree (64 frames at 4 KB, with ARC eviction). A 256 KB memtable for the LSM. Both hit disk on flush, both are single-threaded, both have the same key/value sizes, and the same RNG seeds.

Storage amplification

Metric	B+Tree	LSM	Winner
Space amp (post-insert)	2.55×	1.35×	LSM
Space amp (post-update)	2.55×	1.40×	LSM
Bloat factor (updates / inserts)	1.00×	1.03×	tie

LSM packs user data into roughly half the disk space the B+Tree uses. The B+Tree's 2.55× amplification comes from incomplete leaf pages and internal routing nodes. Leaves aren't 100% full so inserts have somewhere to land, and the routing layer adds a little more on top. The number is stable across inserts and updates because the B+Tree mutates pages in place. An update replaces the old value in its existing slot and the file size doesn't grow.

LSM's 1.35× amplification reflects how sorted runs pack tightly when they don't need to leave room for in-place mutation. The 1.40× amplification post-update figure tells us compaction is keeping up with the update load on this dataset.

Write amplification

Workload	B+Tree WAF	LSM WAF	Advantage
Bulk insert	2.56	3.12	B+Tree slightly better
Uniform updates	189.03	1.23	154× LSM

Bulk inserts come out close because both engines pay roughly the same overhead to land 100K monotonic keys. The B+Tree splits leaves and bumps internal nodes. The LSM flushes the memtable and runs compaction. The B+Tree at 2.56×, the LSM at 3.12×.

Under uniform updates the gap opens to 154×. The B+Tree dirties an entire 4 KB page for every 16-byte update, and on a uniform-random workload almost every update lands on a different page. The maximum write amplification is 4 KB / 16 B = 256×. 189× tells us the BPM is absorbing some same-page hits but not many. The LSM batches writes into the memtable and only flushes whole runs, so for the same 16-byte update it pays 1.23 bytes written per byte changed. O'Neil et al. (1996) framed this as the original motivation for the LSM design, amortizing random-write cost across sorted batches.

The LSM number is a lower bound. Every 10 ms, a thread snapshots the SST directory and counts files that vanished, so anything that flushes and compacts faster than that doesn't get counted. The real number on uniform updates is somewhere between 1.23 and 2.

The LSM's write amplification stays near 1× at this scale because there's barely any compaction yet. Production LSMs at multi-GB scale pay considerably more, in the 10–30× range on leveled compaction.

Range scans

Scan length	B+Tree median	LSM median	Advantage
10	39 µs	15.2 ms	386×
100	76 µs	14.9 ms	195×
1,000	409 µs	14.9 ms	36×
10,000	3.7 ms	14.2 ms	3.8×
100,000	11.9 ms	12.6 ms	1.1×

The 386× gap is partly an InterchangeDB artifact. The LSM scan does a full-tree merge across the memtable and every SSTable, then filters by range. RocksDB and LevelDB seek to the start key at the block level, so a 10-key scan on production LSM is microseconds, not milliseconds.

The LSM pays a fixed ~14 ms per scan cost in this implementation, which is the merge cost. The B+Tree pays for a logarithmic descent plus a linear leaf walk. This scales cleanly from 39 µs at length 10 to 11.9 ms at length 100K. The engines even out around 100K keys, where any fixed cost amortizes.

Even with a block-level seek implementation, the LSM still pays a per-level deduplication cost across however many SSTables overlap the range. The B+Tree's leaves are a sorted linked list, traversed after one descent, which is why it wins short walks. Merge trees win the long ones.

YCSB under cache

Workload	B+Tree ops/s	LSM ops/s	Advantage
A — 50r/50u	27,545	29,325	tie
B — 95r/5u	27,317	29,537	tie
C — 100r	25,869	29,520	tie
D — 95r/5i (latest)	25,662	28,867	tie
E — 95scan/5i	16,208	84	193× B+Tree
F — 50r/50rmw	27,050	29,400	tie

The dataset is 25K keys, and fits in cache. Point-op workloads tie at ~28K ops/s on both engines. At this scale both engines stay in memory. The BPM caches the B+Tree's pages. The memtable holds the LSM's recent writes. They only flush to disk occasionally.

Workload E is the exception. 95% scans, 5% inserts. The B+Tree clears 16K ops/s. The LSM clears 84. Each scan pays the 14 ms fixed cost from the previous section, 95 times per 100 ops. That's 1.33 seconds from scanning. This brings us closer to the textbook definition where B+Trees have a clear advantage.

The performance divide

Workload	B+Tree ops/s	LSM ops/s	Advantage
Uniform writes	100	181,915	1,819×
Read-heavy 95/5	99	184,463	1,863×
Mixed 50/50	92	150,500	1,635×

Pre-fill the working set to 10× the cache size, then run sustained mixed traffic.

The B+Tree caps at ~100 ops/s across all three workloads. Almost every operation has to fetch a page from disk. With 256 KB of cache and 2.5 MB of working data, the BPM hit rate collapses toward (cache size / working set size), about 10%. The other 90% of accesses pay a full disk seek of ~10 ms each, and the I/O cost dominates the throughput measurement.

The LSM doesn't hit this cap. Writes go to the memtable in sub-microsecond time and reads probe the memtable plus a few SSTables, this is made even faster because of Bloom filters. Compaction runs in the background. User-visible latency stays in the µs range and throughput hits 184K ops/s on the read-heavy workload.

The divide develops over multiple cache ratios. Sweeping the working set from half the cache to 25× the cache shows where the divergence starts:

WS/cache	B+Tree reads/s	LSM reads/s	B+Tree inserts/s	LSM inserts/s
0.5×	31,517	24,034	3,249	450,488
1.0×	31,213	24,181	3,662	1,004,417
2.0×	26,005	25,482	3,692	623,208
5.0×	27,869	26,680	3,763	532,816
10.0×	26,449	26,393	3,896	431,245
25.0×	22,974	23,941	3,796	407,747

The engines start to tie on reads after 2x WS/cache. Below that, the B+Tree is faster on reads because the BPM serves cached pages directly while the LSM still has to merge memtable plus SSTables on every read. Above that, both converge at ~25K once neither engine can stay in cache.

The divide is the predictable part of these numbers. The B+Tree insert column is more surprising. It hits a wall at 3–4K writes per second across every working-set size in the sweep, including the row where the data fits in cache twice over. Writes hit disk on flush regardless of the cache state, and the write amplification cost is paid in disk bandwidth.

The LSM peaks at 1M inserts/sec right at 1x WS/cache and declines as compaction overhead grows. Even at 25× the cache size it still does 408K writes/sec, two orders of magnitude faster than the B+Tree at any cache ratio.

Skew

Skew describes how concentrated traffic is on a small set of keys. Uniform spreads it evenly. Zipfian concentrates it on a small hot subset. Real workloads like product catalogs, hot caches, and social feeds are skewed.

Distribution	B+Tree ops/s	LSM ops/s	Advantage
Uniform	125	1,143,614	9,149× LSM
Zipfian θ=0.99	24,541	3,722,961	152× LSM

Both engines benefit from skew. The B+Tree gets a 196× lift from uniform to Zipfian because hot pages stay pinned in the BPM, eliminating most disk I/O. ARC retains the frequent set against scan pressure, so even under spill the BPM hit rate climbs to between 75% and 92%. The eviction policy benchmark on the same engine showed ARC's frequent list surviving one-time scans. This is the same mechanism doing the same work.

The LSM gets a 3.3× lift, smaller in relative terms but starting from 1.1M ops/s on uniform. Writes to the same hot keys coalesce in the memtable before flush, so each new entry doesn't necessarily produce its own SSTable entry. Compaction has less work to do because hot keys produce more dedup-eligible entries.

What production chose

Turso, PostgreSQL, MySQL/InnoDB, and SQLite all use B+Trees. They serve OLTP workloads where the working set typically fits in memory, scans are common (range queries, ORDER BY, foreign-key lookups), and write amplification is acceptable because most writes hit the same hot pages anyway. The performance gap numbers in this post are roughly what you'd expect from any of them on an undersized server.

RocksDB, LevelDB, Cassandra, and ScyllaDB use LSMs. They serve workloads where the dataset is much larger than memory, so write throughput matters more than scan latency.

The pattern is the same as it was for cache policies. Production picks the structure that solves the biggest problem the system actually faces. OLTP databases need range scans on a working set that mostly fits, where B+Trees handle the work cleanly. Write-heavy stores larger than memory run into these performance issues, where LSMs absorb the writes.

Where this goes

Next is concurrent access. The same benchmarks will run against multiple threads, and we will see what the numbers say.

Code and benchmarks: InterchangeDB. Run cargo bench --bench engine_steady_state to reproduce these benchmarks.

References

O'Neil, Cheng, Gawlick, O'Neil (1996). The Log-Structured Merge-Tree. Acta Informatica.

DEV Community