Athreya aka Maneshwar

Posted on Dec 6, 2025

LSM-Trees, NVMe SSDs, and Modern Storage Realities

#programming #database #architecture #algorithms

Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building **one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.*

In the previous article, we explored why traditional B-Trees break down under modern workloads—large page granularity, write amplification, and cache pollution driven by the tight coupling between disk pages and cache pages.

Today, let’s look at the other side of the debate: LSM-Trees, the design that emerged to fix the weaknesses of B-Trees in write-heavy environments.

Modern databases often choose between two indexing giants: B-Trees and LSM-Trees.

B-Trees dominate traditional OLTP workloads with balanced read-write behavior, while LSM-Trees shine when writes are heavy and continuous.

But the trade-offs that once made LSM-Trees attractive are being challenged by modern hardware, especially NVMe SSDs.

Understanding LSM-Trees deeply requires acknowledging both their elegance and the pain points they introduce.

What Problem LSM-Trees Were Designed to Solve

Traditional storage systems suffered under random writes.

HDDs had mechanical heads that made seeks extremely expensive.

If the workload is write-heavy, random writes destroy performance.

LSM-Trees solved this by transforming random writes into sequential ones.

How LSM-Trees Work

1. Write Path

Incoming writes are first appended to an in-memory buffer (MemTable). Once full, this buffer is:

Flushed to disk as an immutable sorted file (SSTable)
Later merged with existing data through compaction

Compaction maintains sorted order and removes old versions of records, but comes with a heavy price:

Large background IO traffic
Multiple reads and writes of the same data
High CPU and bandwidth usage
Tail latency spikes when compaction storms occur

Write amplification is the unavoidable side effect, writing a record once may result in writing it many times across levels.

2. Point Lookups

To read a single key, the system may need to search:

MemTable (latest data)
Multiple SST files across multiple levels

To avoid scanning every file, LSM-Trees use optimizations:

Technique	Purpose
Bloom Filters	Skip files that definitely don’t contain the key
Block Cache	Cache data blocks from SSTs to avoid disk IO
Row Cache	Cache individual records for hot keys

Even with all this, point lookups can still suffer because:

You may need to probe multiple levels before finding one value.

3. Range Scans

Range scans are LSM-Trees’ weak spot:

Must merge results across all levels
Row cache does not help
High memory churn due to block-level caching

This makes LSM-Trees a poor fit for analytic or mixed workloads where scans and point reads coexist.

Why NVMe SSDs Change the Game

LSM-Trees were born for spinning disks. But hardware evolved.

Random vs Sequential Writes

HDDs love sequential writes. NVMe doesn’t care much:

No mechanical movement
Flash-backed parallel channels
Hardware wear leveling and garbage collection

Recent studies show that 4 KB random writes can nearly saturate NVMe bandwidth. Meaning:

The original motivation for LSM-Trees avoiding random writes, no longer holds as strongly.

Log-structured writing becomes less compelling when random writes are cheap.

Kernel Bypass IO

Traditional IO requires:

A kernel context switch
Memory copying between kernel and user space

With SPDK and io_uring:

Applications talk directly to NVMe
Microsecond-scale latencies become realistic
Storage bandwidth exceeds 10 GB/s

When IO latency approaches context switch time, software inefficiency becomes the bottleneck, not hardware.

Optimal Page Sizes on NVMe

Smaller pages reduce amplification, but extremely tiny pages increase overhead. Evaluations show:

4 KB page size is the sweet spot—below that the flash translation layer suffers
Traditional 8–32 KB B-Tree pages are outdated for NVMe access patterns

This aligns with Bf-Tree choosing 4 KB as its default physical page size.

So Where Does LSM-Tree Stand Today?

Despite limitations, LSM-Trees are everywhere:

RocksDB, TiDB, Bigtable, DynamoDB, Cassandra, LevelDB, HBase

They are unbeatable for pure ingestion systems like logging, metrics, and time-series.

But when workloads demand:

Mixed OLTP + analytics
Latency-sensitive reads
Balanced scans and writes

The traditional LSM trade-offs become painful.

Compaction storms, memory pressure, and cache inefficiency dominate operational cost.

Hardware changed, but database architecture hasn’t kept up. Both B-Trees and LSM-Trees carry assumptions from disk-era design.

Modern NVMe SSDs demand:

Fine-grained data management
Reduced amplification
Cache and buffer unification
Page redesign, not tuning

This is where Bf-Tree enters the picture—not as a small optimization, but as a rethinking of pages and memory layout.

👉 Check out: FreeDevTools

Any feedback or contributors are welcome!

It’s online, open-source, and ready for anyone to use.

⭐ Star it on GitHub: freedevtools

DEV Community