Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*
In the previous article, we explored why traditional B-Trees break down under modern workloads—large page granularity, write amplification, and cache pollution driven by the tight coupling between disk pages and cache pages.
Today, let’s look at the other side of the debate: LSM-Trees, the design that emerged to fix the weaknesses of B-Trees in write-heavy environments.
Modern databases often choose between two indexing giants: B-Trees and LSM-Trees.
B-Trees dominate traditional OLTP workloads with balanced read-write behavior, while LSM-Trees shine when writes are heavy and continuous.
But the trade-offs that once made LSM-Trees attractive are being challenged by modern hardware, especially NVMe SSDs.
Understanding LSM-Trees deeply requires acknowledging both their elegance and the pain points they introduce.
What Problem LSM-Trees Were Designed to Solve
Traditional storage systems suffered under random writes.
HDDs had mechanical heads that made seeks extremely expensive.
If the workload is write-heavy, random writes destroy performance.
LSM-Trees solved this by transforming random writes into sequential ones.
How LSM-Trees Work
1. Write Path
Incoming writes are first appended to an in-memory buffer (MemTable). Once full, this buffer is:
- Flushed to disk as an immutable sorted file (SSTable)
- Later merged with existing data through compaction
Compaction maintains sorted order and removes old versions of records, but comes with a heavy price:
- Large background IO traffic
- Multiple reads and writes of the same data
- High CPU and bandwidth usage
- Tail latency spikes when compaction storms occur
Write amplification is the unavoidable side effect, writing a record once may result in writing it many times across levels.
2. Point Lookups
To read a single key, the system may need to search:
- MemTable (latest data)
- Multiple SST files across multiple levels
To avoid scanning every file, LSM-Trees use optimizations:
| Technique | Purpose |
|---|---|
| Bloom Filters | Skip files that definitely don’t contain the key |
| Block Cache | Cache data blocks from SSTs to avoid disk IO |
| Row Cache | Cache individual records for hot keys |
Even with all this, point lookups can still suffer because:
You may need to probe multiple levels before finding one value.
3. Range Scans
Range scans are LSM-Trees’ weak spot:
- Must merge results across all levels
- Row cache does not help
- High memory churn due to block-level caching
This makes LSM-Trees a poor fit for analytic or mixed workloads where scans and point reads coexist.
Why NVMe SSDs Change the Game
LSM-Trees were born for spinning disks. But hardware evolved.
Random vs Sequential Writes
HDDs love sequential writes. NVMe doesn’t care much:
- No mechanical movement
- Flash-backed parallel channels
- Hardware wear leveling and garbage collection
Recent studies show that 4 KB random writes can nearly saturate NVMe bandwidth. Meaning:
The original motivation for LSM-Trees avoiding random writes, no longer holds as strongly.
Log-structured writing becomes less compelling when random writes are cheap.
Kernel Bypass IO
Traditional IO requires:
- A kernel context switch
- Memory copying between kernel and user space
With SPDK and io_uring:
- Applications talk directly to NVMe
- Microsecond-scale latencies become realistic
- Storage bandwidth exceeds 10 GB/s
When IO latency approaches context switch time, software inefficiency becomes the bottleneck, not hardware.
Optimal Page Sizes on NVMe
Smaller pages reduce amplification, but extremely tiny pages increase overhead. Evaluations show:
- 4 KB page size is the sweet spot—below that the flash translation layer suffers
- Traditional 8–32 KB B-Tree pages are outdated for NVMe access patterns
This aligns with Bf-Tree choosing 4 KB as its default physical page size.
So Where Does LSM-Tree Stand Today?
Despite limitations, LSM-Trees are everywhere:
- RocksDB, TiDB, Bigtable, DynamoDB, Cassandra, LevelDB, HBase
They are unbeatable for pure ingestion systems like logging, metrics, and time-series.
But when workloads demand:
- Mixed OLTP + analytics
- Latency-sensitive reads
- Balanced scans and writes
The traditional LSM trade-offs become painful.
Compaction storms, memory pressure, and cache inefficiency dominate operational cost.
Hardware changed, but database architecture hasn’t kept up. Both B-Trees and LSM-Trees carry assumptions from disk-era design.
Modern NVMe SSDs demand:
- Fine-grained data management
- Reduced amplification
- Cache and buffer unification
- Page redesign, not tuning
This is where Bf-Tree enters the picture—not as a small optimization, but as a rethinking of pages and memory layout.
👉 Check out: FreeDevTools
Any feedback or contributors are welcome!
It’s online, open-source, and ready for anyone to use.
⭐ Star it on GitHub: freedevtools

Top comments (0)