우병수

Posted on May 7 • Originally published at techdigestor.com

LSM Trees: Why Your Database Writes Are Fast and Your Reads Are Lying to You

#productivity #tools #webdev #discuss

TL;DR: The thing that caught me off guard the first time I profiled a write-heavy Postgres instance wasn't CPU saturation or memory pressure — it was await time on disk I/O. The B-tree index was healthy, the query plans looked fine, but we were hammering somewhere around 40-50k inser

📖 Reading time: ~38 min

What's in this article

The Problem That Made LSM Trees Necessary
How an LSM Tree Actually Works (No Hand-Waving)
Write Amplification vs Read Amplification vs Space Amplification
Leveled Compaction vs Tiered Compaction — When Each Actually Makes Sense
LSM Trees in the Wild: RocksDB, LevelDB, Cassandra, and ClickHouse
Hands-On: Poking at RocksDB Internals
Bloom Filters: The Thing That Saves LSM Reads
Compaction Tuning: The Stuff the Docs Gloss Over

The Problem That Made LSM Trees Necessary

B-Trees Are Excellent — Right Up Until Your Write Rate Destroys You

The thing that caught me off guard the first time I profiled a write-heavy Postgres instance wasn't CPU saturation or memory pressure — it was await time on disk I/O. The B-tree index was healthy, the query plans looked fine, but we were hammering somewhere around 40-50k inserts per second from an event pipeline and the disk controller was just... done. iostat -x 1 was showing 100% utilization with queue depths in the double digits. The problem wasn't the data volume, it was the pattern of writes a B-tree forces.

Every insert into a B-tree index touches a leaf node that lives at some essentially random location on disk. At 50k rows/sec, you're issuing 50k random write seeks per second across your index pages. On a spinning disk, a random seek takes 5-10ms — you physically cannot do more than a few hundred per second per drive. But even on NVMe SSDs, random writes still hurt you, just for a different reason. NAND flash erases in blocks of 128KB–512KB minimum, so writing a single 4KB page forces the SSD controller to read the surrounding block, erase it, and rewrite the whole thing. That's write amplification baked into the hardware. Sequential writes bypass this almost entirely because you're filling erase blocks cleanly in one pass.

# What "write amplification" looks like in practice on a B-tree:
# You insert 1 row → B-tree updates 1 leaf page (4KB)
# NVMe erase block = 256KB
# Actual flash writes = 256KB erased + 256KB rewritten = 64x amplification
# At 50k inserts/sec: 50,000 × 256KB = ~12.5 GB/sec of actual flash I/O
# Most NVMe drives top out around 3-7 GB/sec sequential write bandwidth
# You've already exceeded your hardware before accounting for WAL or vacuum

The write amplification compounds because B-trees also need to maintain balance. A leaf node fills up, triggers a page split, which propagates up to parent nodes, which may split again. Each of those splits is another random write. Under sustained high write throughput, you stop doing useful work and start doing maintenance work. I've seen Postgres instances where pg_stat_bgwriter showed buffers_backend_fsync spiking — meaning foreground queries were being forced to flush dirty pages because the background writer couldn't keep up. That's the wall.

LSM trees make a deliberate trade: they absorb writes sequentially into an in-memory buffer (the MemTable), then flush to disk as immutable sorted files (SSTables) in one big sequential write. Reads become more expensive because you potentially have to check multiple SSTables to answer a query — you lose the B-tree's clean O(log n) guarantee for point reads. But your write path is now sequential I/O all the way down. RocksDB, which uses LSM, routinely sustains 400k+ writes/sec on the same hardware where a B-tree index would be completely saturated at 50k. That's not a tuning difference — it's an architectural one. You're trading read overhead (bloom filters and compaction help recover some of this) for write throughput that B-trees structurally cannot match.

How an LSM Tree Actually Works (No Hand-Waving)

The thing that surprised me most when I first read through RocksDB's internals: the "write" you issue doesn't touch a sorted structure at all. It hits a write-ahead log and a skip list in memory. The beautiful B-tree-style sorted order you eventually read from? That gets assembled lazily, in the background, while your application is doing other things. That inversion — writes are cheap, ordering is deferred — is the entire personality of an LSM tree.

The Three Layers You Need to Keep in Your Head

Every LSM implementation you'll encounter — LevelDB, RocksDB, Cassandra's storage engine, FoundationDB, ScyllaDB — maps to the same mental model. First: the MemTable, a mutable in-memory sorted structure (usually a skip list or red-black tree) that absorbs all incoming writes. Second: SSTables on disk, immutable files that get written when a MemTable fills up. Third: the compaction process, background threads that merge and rewrite SSTables to reclaim space and impose ordering. Most bugs and performance surprises in LSM-based databases live in that third layer.

The Write Path: Why Appends Win

Every write touches two places: the WAL first, then the MemTable. The WAL (write-ahead log) is a sequential append — one fdatasync call and you're done. This is physically fast because rotating disks love sequential writes and NVMe SSDs love them even more. Then the MemTable gets updated in memory. That's it. No page splits, no B-tree rebalancing, no reading a disk block before you can modify it. When the MemTable hits its size threshold (128MB in RocksDB's default config), it gets frozen, a new MemTable takes over, and a background thread flushes the frozen one to disk as a new SSTable. The application never blocks on that flush.

# RocksDB config that controls this behavior
# rocksdb/options.h equivalents in a typical config file

write_buffer_size = 134217728       # 128MB per MemTable
max_write_buffer_number = 3         # how many MemTables can exist simultaneously
min_write_buffer_number_to_merge = 1 # flush when 1 MemTable is full

# WAL behavior
wal_dir = /data/wal                 # separate WAL on a different device if you want
sync_log_entry = true               # fsync per write — safe but slower

The Read Path: Where the Pain Starts

Reading is the honest cost you pay for those cheap writes. The lookup order is: check the active MemTable → check any frozen (not-yet-flushed) MemTables → check Level 0 SSTables → check Level 1, Level 2... and so on. In the worst case — a key that doesn't exist — you might touch every SSTable at every level before confirming the miss. That's why bloom filters aren't optional in practice. Each SSTable carries a bloom filter that lets you skip it with ~99% confidence for keys that aren't there. RocksDB's default bloom filter uses 10 bits per key. Without bloom filters, a point lookup on a cold key in a database with deep compaction levels is genuinely slow — I've seen this hit 50ms+ per read in poorly tuned setups where bloom filters were accidentally disabled.

# In a RocksDB ColumnFamilyOptions — bloom filter per SSTable
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(
    NewBloomFilterPolicy(
        10,    // bits per key — 10 gives ~1% false positive rate
        false  // use block-based filter, not full filter
    )
);
table_options.whole_key_filtering = true;

What an SSTable Actually Looks Like on Disk

An SSTable is not a B-tree. It's a flat, sorted, immutable file. Physically, it's divided into fixed-size data blocks (4KB default in RocksDB), an index block that maps the last key of each data block to its offset, the bloom filter block, and a footer with magic bytes and metadata pointers. Every key-value pair in the file appears exactly once, sorted lexicographically. Once written, the file is never modified — not even for a delete. A delete writes a special marker called a tombstone with the same key, which shadows any older value. The old value physically persists on disk until compaction runs and rewrites the relevant SSTables without it.

SSTable file layout (simplified):
┌──────────────────────────────┐
│  Data Block 0                │  key0:val, key1:val, key2:val ... (sorted)
│  Data Block 1                │
│  ...                         │
├──────────────────────────────┤
│  Index Block                 │  last_key_of_block0 → offset0
│                              │  last_key_of_block1 → offset1
├──────────────────────────────┤
│  Filter Block (bloom)        │  one filter per data block or whole file
├──────────────────────────────┤
│  Metaindex Block             │  offsets to filter/stats blocks
│  Footer (48 bytes)           │  magic, version, metaindex offset
└──────────────────────────────┘

Compaction: The Real Cost Center

This is where I'd focus first if you're debugging a slow LSM-backed database. Compaction is a background process that picks overlapping SSTables, merges them (resolving duplicates and dropping tombstones past their retention window), and writes new, non-overlapping SSTables in their place. The old files get deleted. The two dominant strategies are leveled compaction (RocksDB default — each level has a size limit, SSTables within a level don't overlap) and size-tiered compaction (Cassandra default — batch similar-sized SSTables together). Leveled gives you better read performance and space amplification. Size-tiered gives you better write throughput at the cost of potentially reading more files per query. The hidden cost of compaction is write amplification: in leveled compaction, data can be rewritten 10–30x before it settles at the bottom level. On an NVMe drive rated for 400TB write endurance, this matters if you're doing heavy ingest workloads.

# RocksDB leveled compaction tuning
max_bytes_for_level_base = 268435456   # 256MB at L1
max_bytes_for_level_multiplier = 10    # L2 = 2.5GB, L3 = 25GB, ...
level0_file_num_compaction_trigger = 4 # start L0→L1 compaction at 4 files
level0_slowdown_writes_trigger = 20    # throttle writes at 20 L0 files
level0_stop_writes_trigger = 36        # hard stop at 36 — this is a crisis signal

# If you see level0_stop_writes_trigger firing in production:
# - your compaction threads can't keep up with write rate
# - increase max_background_compactions or reduce write rate
max_background_compactions = 4
max_background_flushes = 2

The thing that catches people off guard: compaction doesn't just run when you ask it to. It's continuous, and it competes with your reads and writes for I/O bandwidth. Under sustained write load, I've watched compaction consume 80% of available disk I/O, causing read latency to spike 5-10x. RocksDB exposes rate_limiter to cap compaction I/O at a bytes-per-second ceiling — use it in production if your workload has SLA-sensitive reads running alongside bulk ingest.

Write Amplification vs Read Amplification vs Space Amplification

The thing that caught me off guard when I first dug into LSM internals wasn't the write path — it was realizing that every tuning decision is fundamentally a negotiation between three numbers that push against each other. Improve one, and you're almost always degrading at least one of the others. There's no free lunch here, and database vendors who claim otherwise are hiding the trade-off inside a default config that favors their benchmark workload.

Write Amplification: One User Write Becomes Many Disk Writes

Write amplification (WA) is the ratio of bytes written to disk versus bytes the application actually sent. In an LSM tree, a single 1KB row doesn't just get written once — it gets written to the WAL, then flushed to an L0 SSTable, then compacted into L1, then compacted again into L2. In a leveled compaction setup with RocksDB defaults, that 1KB write can realistically hit disk 10–30 times by the time it reaches the deepest level. I've seen production RocksDB configs where stall_micros was spiking and the actual WA factor was sitting at 40+ measured via rocksdb.compaction-stats.

# Check RocksDB write amplification in real time
$ rocksdb_ldb --db=/data/mydb/ dump_live_files | grep -i "compaction"

# Or pull it from the stats endpoint (if using RocksDB via Java/C++)
# Look for: "Cumulative compaction" bytes written / total bytes flushed
# WA = total_compaction_bytes_written / total_memtable_bytes_flushed

Tiered compaction (what Cassandra calls STCS — Size-Tiered Compaction Strategy) has much lower write amplification because it only merges SSTables of similar sizes. You're not re-writing data as aggressively. The trade-off shows up immediately in the other two metrics.

Read Amplification: The Bloom Filter Doesn't Save You from Everything

Read amplification is how many I/O operations you need to answer a single point read. In an ideal world, a bloom filter eliminates every SSTable that doesn't contain your key. In the real world, you still pay: bloom filter check (in memory, cheap), then index block lookup (one I/O), then data block fetch (one I/O) — per level, per SSTable on that level that the bloom filter didn't rule out. With tiered compaction and 4 tiers of overlapping SSTables, a point read in the worst case touches every single SSTable. I measured this directly in Cassandra with STCS on a cold cache — a read that should have taken 2ms was taking 28ms because it was fanning out across 6 SSTables at the same tier.

Leveled compaction (LCS in Cassandra, default in RocksDB) caps read amplification because each level is sorted and non-overlapping. A point read at most touches one SSTable per level. If you have 6 levels, that's 6 I/Os maximum — but in practice bloom filters knock it down to 1–2. This is why OLTP workloads with heavy random reads almost always use leveled compaction despite its write cost.

# In Cassandra, switch a table to leveled compaction:
ALTER TABLE keyspace.my_table WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160  -- larger = fewer levels = less read amp
};

# With STCS (tiered), you'd instead set min/max threshold:
ALTER TABLE keyspace.my_table WITH compaction = {
  'class': 'SizeTieredCompactionStrategy',
  'min_threshold': 4,
  'max_threshold': 32
};

Space Amplification: The 2x Spike That Kills Production Disks

Space amplification is the ratio of actual disk used to the logical dataset size. This one bites people in production more than the other two because it's not gradual — it's a cliff. During a major compaction event, you're reading N SSTables and writing 1 merged output. Both exist simultaneously on disk until the compaction finishes and the inputs are deleted. With leveled compaction compacting a full L2 into L3, you can temporarily need 2x the space of that level. I've had an on-call alert fire at 3am because a RocksDB instance hit 95% disk utilization during a compaction cycle on a volume we thought had 40% headroom.

Tiered compaction has even worse space amplification at steady state because multiple generations of data with overlapping keys coexist across tiers. A key updated 5 times can have 5 copies living in 5 different SSTables until compaction eventually merges them. The minimum disk overhead to run STCS safely is roughly 50% free space — not because the docs say so, but because you'll stall compactions and hit disk full otherwise.

# RocksDB: monitor space amplification via sst_files_size vs live_sst_files_size
$ cat /proc/$(pgrep rocksdb_server)/io  # rough I/O pressure check

# Or inside the DB stats output, look for:
# "rocksdb.live-sst-files-size" — bytes in live SSTables
# "rocksdb.total-sst-files-size" — includes obsolete not-yet-deleted files
# Space amp ≈ total-sst-files-size / live-sst-files-size
# Healthy: < 1.5x. Compaction-stalled: you'll see 3x+

How Compaction Strategy Choice Moves All Three Needles

The practical mapping looks like this — and I mean practical, not the theoretical best-case each vendor publishes:

Leveled Compaction (LCS/RocksDB default): Low read amp (bloom filters + sorted levels = 1–2 I/Os per read), high write amp (data gets rewritten aggressively), moderate space amp (~1.1x at steady state). Right choice for: random point reads, OLTP, anything with heavy read:write ratio.
Size-Tiered (STCS / Cassandra default): Low write amp, terrible read amp on cold paths, bad space amp with lots of updates. Right choice for: write-heavy append workloads where you rarely update existing keys — event logs, metrics ingest.
TWCS (Time-Window Compaction Strategy): Designed specifically for time-series. It partitions SSTables by time window so compaction only merges within a window. Write amp is very low, read amp is low for recent data (one SSTable per window), space amp is manageable because old windows stop being touched. Falls apart the instant you have out-of-order writes spanning multiple windows — I've seen TWCS leave 200+ tiny SSTables because late data kept opening old windows.

# TWCS config in Cassandra — the window_unit and window_size matter a lot:
ALTER TABLE metrics.sensor_readings WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'HOURS',
  'compaction_window_size': 1,  -- 1 hour per window
  'max_threshold': 32
};
-- WARNING: set expired_sstable_check_frequency_seconds if TTL is involved
-- otherwise TWCS will not drop old windows even when all data has expired

The honest summary: if you're tuning an LSM-based database and someone asks "which compaction strategy is best," the answer is always "what's your read:write ratio and what's your acceptable disk overhead?" Leveled is a safe default for mixed workloads. Tiered/STCS is the right call when you're ingesting millions of events per second and can tolerate slower reads. TWCS is a narrow specialist tool that works beautifully for time-series and punishes you for any deviation from clean time-ordered writes.

Leveled Compaction vs Tiered Compaction — When Each Actually Makes Sense

The compaction strategy you pick at table creation will haunt you for years

I learned this the hard way. We started a Cassandra table with Size-Tiered Compaction Strategy (STCS) because it was the default and the docs said "good for write-heavy workloads." Our workload was write-heavy — IoT sensor data coming in at ~40K writes/second. Six months later, read latencies on dashboards were spiking to 800ms+ during business hours. The fix required a live compaction strategy migration on a 2TB table with zero downtime. That experience is burned into me now.

Leveled compaction (the default in RocksDB and LevelDB) works by maintaining sorted runs at each level, where L0 flushes to L1, L1 to L2, and so on. Each level is roughly 10x larger than the previous. The upside: a read only has to look at one SSTable per level in the worst case, so read amplification stays low — usually single digits. The downside is brutal write amplification. RocksDB documentation honestly pegs it at 10-30x in the worst case, meaning for every 1 byte written by the application, the storage engine rewrites 10-30 bytes. On write-heavy workloads with SSDs, this burns through drive endurance faster than you'd expect. Leveled is the right call for mixed read/write workloads, OLTP-style access patterns, or anything where you're doing point lookups alongside writes.

# RocksDB leveled compaction config (options file format)
[CFOptions "default"]
  compaction_style=kCompactionStyleLevel
  max_bytes_for_level_base=268435456   # 256MB — L1 target size
  max_bytes_for_level_multiplier=10    # each level is 10x the previous
  level0_file_num_compaction_trigger=4 # compact L0 when 4 SSTables exist
  level0_slowdown_writes_trigger=20
  level0_stop_writes_trigger=36

Tiered compaction (STCS in Cassandra) runs on a completely different philosophy: accumulate SSTables in size-based tiers, then compact them together when you have enough in the same tier. Writes are cheap because you're almost never rewriting data — you're just appending new SSTables. The catch is reads. In the worst case, a read has to check every single SSTable in the cluster because the same key can live in multiple tiers. Bloom filters help a lot, but they're probabilistic — you still get false positives that require actual disk reads. Space amplification is also real; you can temporarily end up with 2x the data on disk during a major compaction. Use STCS when your access pattern is truly append-only and you almost never read recent data back. Log aggregation pipelines, event sourcing with async consumers — those are the cases where STCS shines.

Time-Window Compaction Strategy (TWCS) in Cassandra is the specific fix for time-series data, and it's genuinely elegant once you understand the invariant it exploits: old time windows never receive new writes. TWCS uses STCS within each time window, then after a window closes, that window's SSTables get compacted into a single SSTable and are never touched again. This means your compaction overhead is almost entirely bounded to the current active window — old data sits untouched in clean, single SSTables. The hard constraint is that your TTLs must be consistent across the table and your write patterns must actually be time-ordered. If you're doing delayed writes (events arriving hours late), TWCS falls apart fast because late writes land in already-closed windows, triggering compactions you didn't expect.

-- Switch a live Cassandra table from STCS to TWCS
-- (run this, then trigger a major compaction to clean up)
ALTER TABLE sensor_readings
  WITH compaction = {
    'class': 'TimeWindowCompactionStrategy',
    'compaction_window_unit': 'HOURS',
    'compaction_window_size': 1,
    'max_threshold': 32,
    'min_threshold': 4
  };

-- After altering, force compaction so existing SSTables reorganize
-- This is the expensive part — budget 4-6 hours on a 2TB table
nodetool compact keyspace_name sensor_readings

The migration I mentioned earlier: we went from STCS to TWCS on that IoT table. The ALTER TABLE itself is instant — Cassandra just writes the new strategy to schema metadata. The pain is the nodetool compact that follows, which is single-threaded per node by default and blocks on CPU. We ran it node-by-node with a 30-minute gap between nodes so the cluster never had more than one node under compaction pressure simultaneously. Read latencies dropped from ~800ms to under 40ms within 48 hours of the migration completing. The lesson: if your data has a time dimension and old records aren't updated, default to TWCS from day one. The cost of migrating compaction strategy in production is never worth the "I'll fix it later" shortcut.

LSM Trees in the Wild: RocksDB, LevelDB, Cassandra, and ClickHouse

The thing that surprised me most when I started digging into storage engine internals was how much of the database ecosystem is just LSM trees with different UX layers on top. LevelDB, released by Google in 2011, is the common ancestor. It's deliberately minimal — you get a sorted key-value store, compaction happens in levels (Level 0 through 6 by default), and your configuration surface is basically write_buffer_size, max_open_files, and a handful of others. That's intentional. The Chrome browser embeds LevelDB for IndexedDB. SQLite-backed apps sometimes swap it in. It's not the engine you pick for a new production database in 2024 — but it's absolutely the thing running quietly under tools you're already shipping.

RocksDB is what happens when you take LevelDB's core and expose every knob. Facebook built it to handle write-heavy workloads on flash storage, and the config surface is genuinely overwhelming at first. I've seen production configs with 40+ options set. The ones that actually move the needle:

# rocksdb options that matter for write-heavy workloads
options.write_buffer_size = 128 * 1024 * 1024;  # 128MB memtable before flush
options.max_write_buffer_number = 4;             # concurrent memtables
options.level0_slowdown_writes_trigger = 20;     # back-pressure before L0 floods
options.compression_per_level = {               
    kNoCompression,   # L0/L1 — random reads, compression not worth it
    kNoCompression,
    kLZ4Compression,  # L2+ — sequential, LZ4 hits ~500MB/s compression speed
    kLZ4Compression,
    kZSTD            # deepest level — best ratio for cold data
};
options.use_direct_reads = true;  # bypass page cache — you want RocksDB's own block cache

MyRocks (RocksDB under MySQL) runs inside Meta's production MySQL fleet because RocksDB's space amplification is dramatically lower than InnoDB's B-tree — they documented roughly 2x compression on their user database. TiKV (TiDB's storage layer) and CockroachDB both use RocksDB as the underlying key-value store and implement MVCC on top of it. The gotcha with RocksDB in distributed systems is that compaction generates significant I/O spikes — if you're running TiKV and your disks suddenly hit 100% utilization every few hours with no obvious query spike, compaction throttling is where I'd look first.

Cassandra exposes compaction strategy as a first-class operator concern, which I respect. You pick between STCS (Size-Tiered Compaction Strategy) and LCS (Leveled Compaction Strategy) per table. STCS groups similarly-sized SSTables together — writes are cheap, reads get expensive as you accumulate tiers, and space amplification can hit 2-3x during a compaction cycle. LCS maintains a leveled structure like RocksDB — read performance is more predictable, but you're doing more I/O continuously. The table-level config looks like this:

CREATE TABLE events (
    user_id uuid,
    event_time timestamp,
    payload text,
    PRIMARY KEY (user_id, event_time)
) WITH compaction = {
    'class': 'LeveledCompactionStrategy',
    'sstable_size_in_mb': 160,   -- default is 160, increase for large partitions
    'fanout_size': 10            -- ratio between level sizes
};

Use LCS on tables with heavy reads and predictable key distribution. Use STCS on write-heavy time-series tables where you're mostly appending. The mistake I see most often is people running LCS on a table that gets 50K writes/sec — the background compaction I/O competes directly with the write path and kills throughput.

ClickHouse's MergeTree is the outlier here. It's LSM-inspired in that writes go to parts that get merged in the background, but the internals differ because the storage is columnar — each column is a separate file on disk. Compaction (ClickHouse calls it merging) combines small parts into larger ones, and the performance profile is different because a merge of 10 parts across 200 columns means 2000 file operations. The practical impact: ClickHouse really wants large batch inserts (aim for 100K-1M rows per INSERT, not one row at a time) because each insert creates a new part, and too many small parts triggers the Too many parts exception before the background merger catches up. For a broader look at data tools built on these storage engines, check out our guide on Essential SaaS Tools for Small Business in 2026.

Hands-On: Poking at RocksDB Internals

The thing that caught me off guard when I first cracked open a RocksDB database directory was how readable the on-disk layout actually is once you know what to look for. You don't need a custom tool — the ldb CLI ships with RocksDB and lets you poke at SSTables, manifest files, and the level structure without writing a single line of code.

Installing ldb and Getting Your Bearings

On Ubuntu/Debian with RocksDB 8.x installed, ldb is usually bundled. If you built from source, it's in tools/. On macOS with Homebrew:

# installs rocksdb + ldb + sst_dump
brew install rocksdb

# verify it's there
ldb --version
# RocksDB version: 8.10.0

If you're on a system where only the shared library got installed (common with apt), you'll need to build from source. Clone the repo and run make ldb -j$(nproc) — takes about 8 minutes on a modest machine. Don't bother with make all unless you want to wait 40 minutes for tests you'll never run right now.

Dumping Keys and Inspecting the Manifest

Once you have a RocksDB directory — even one created by another process like Kafka, TiKV, or your Go service — you can read it directly:

# dump all key-value pairs (careful on large DBs, pipe to head)
ldb --db=/var/lib/myapp/rocksdb scan --max_keys=50

# hex-encode keys if they're binary (very common)
ldb --db=/var/lib/myapp/rocksdb scan --key_hex --value_hex --max_keys=50

# expected output shape:
# 0x000000000001 : 0x7b22757365724964223a313233...
# 0x000000000002 : 0x7b22757365724964223a313234...

The manifest dump is where LSM trees get interesting. This shows you the actual level layout — which SSTable files live at L0, L1, L2, and so on, along with their key ranges:

ldb --db=/var/lib/myapp/rocksdb manifest_dump

# --- Level 0 --- (these overlap, that's expected)
# File[0]: /000023.sst  size:2048KB  smallest:'user:100' largest:'user:999'
# File[1]: /000031.sst  size:1901KB  smallest:'user:001' largest:'user:888'
#
# --- Level 1 ---
# File[0]: /000019.sst  size:8192KB  smallest:'user:001' largest:'user:399'
# File[1]: /000020.sst  size:8010KB  smallest:'user:400' largest:'user:999'

Notice L0 files have overlapping key ranges — that's not a bug, that's how LSM trees work. L0 is written sequentially as memtables flush, so overlaps are intentional. L1 and below should have non-overlapping ranges. If you see overlaps in L1+, something is wrong with your compaction config.

Watching Compaction Happen in Real Time

The stats property is the fastest way to see what the engine is doing without a profiler. In a running process using the C++ or Go bindings, pull it periodically:

// in Go, using grocksdb
stats, _ := db.GetProperty("rocksdb.stats")
fmt.Println(stats)

// or from ldb for a stopped DB:
ldb --db=/path/to/db stats

The output is a dense table that looks like this:

** Compaction Stats [default] **
Level  Files  Size     Score  Read(GB)  Rn(GB)  Rnp1(GB)  Write(GB)  Wnew(GB)  ...
  L0      4   16 MB    0.80      0.0      0.0       0.0       0.2       0.2
  L1      2   64 MB    0.50      0.1      0.0       0.1       0.1       0.0
  L2      8  512 MB    0.10      0.0      0.0       0.0       0.0       0.0

The Score column tells you how close each level is to triggering a compaction. Score > 1.0 means compaction is either in progress or queued. L0 compaction triggers at 4 files by default (level0_file_num_compaction_trigger) — you'll watch the count climb from 0 to 4 and then drop as compaction merges them down.

Go Snippet: Write 100k Keys, Watch the Layout Shift

This is the most concrete way to actually see LSM behavior. Install grocksdb (requires RocksDB 8.x headers and CGO):

go get github.com/linxGnu/grocksdb

package main

import (
    "fmt"
    "os"
    "time"

    "github.com/linxGnu/grocksdb"
)

func main() {
    opts := grocksdb.NewDefaultOptions()
    opts.SetCreateIfMissing(true)
    // keep memtable small so flushes happen often — makes L0 growth visible
    opts.SetWriteBufferSize(4 * 1024 * 1024) // 4MB instead of default 64MB
    opts.SetMaxWriteBufferNumber(2)
    // don't let background compaction run yet so we can see raw L0 files
    opts.SetDisableAutoCompactions(true)

    dbPath := "/tmp/lsm-demo"
    os.RemoveAll(dbPath)

    db, err := grocksdb.OpenDb(opts, dbPath)
    if err != nil {
        panic(err)
    }
    defer db.Close()

    wo := grocksdb.NewDefaultWriteOptions()
    defer wo.Destroy()

    // write 100k sequential keys — sequential writes are a best-case for LSM
    for i := 0; i < 100_000; i++ {
        key := fmt.Sprintf("user:%08d", i)
        val := fmt.Sprintf(`{"id":%d,"ts":%d}`, i, time.Now().UnixNano())
        if err := db.Put(wo, []byte(key), []byte(val)); err != nil {
            panic(err)
        }
    }

    // check layout BEFORE compaction — expect multiple L0 files
    stats, _ := db.GetProperty("rocksdb.stats")
    fmt.Println("=== BEFORE COMPACTION ===")
    fmt.Println(stats)

    levelFiles, _ := db.GetProperty("rocksdb.num-files-at-level0")
    fmt.Println("L0 file count:", levelFiles)

    // now trigger manual compaction — collapses everything into L1+
    db.CompactRange(grocksdb.Range{})

    // check layout AFTER — L0 should be empty or near-empty
    statsAfter, _ := db.GetProperty("rocksdb.stats")
    fmt.Println("=== AFTER COMPACTION ===")
    fmt.Println(statsAfter)

    levelFilesAfter, _ := db.GetProperty("rocksdb.num-files-at-level0")
    fmt.Println("L0 file count after:", levelFilesAfter)
}

Run this, then in a second terminal while the compaction is running, do watch -n1 'ldb --db=/tmp/lsm-demo manifest_dump 2>&1 | head -30'. You'll literally watch SSTable files disappear from L0 and new, larger ones appear in L1. The write amplification is visible in the manifest: you wrote ~400MB of data but the compaction read-and-rewrote it to produce sorted, non-overlapping L1 files. That's the core trade-off of LSM — write speed now, read efficiency later.

One gotcha: grocksdb requires you to have the RocksDB shared library version match exactly what the bindings expect. If you get a segfault on OpenDb, check that pkg-config --modversion rocksdb returns 8.x, not 6.x or 7.x from an older apt package. The mismatch doesn't always fail at link time, which makes it annoying to diagnose.

Bloom Filters: The Thing That Saves LSM Reads

The thing that caught me off guard when I first read about LSM trees was how catastrophically bad point lookups could be without mitigation. You have a key, you want its value, and the engine has to check the memtable, then every single SSTable from L0 down through L6 (or however deep your tree goes) before it can confirm the key doesn't exist. That "not found" case — the negative lookup — is the killer. Every level means a disk seek, and disk seeks are where milliseconds go to die.

Bloom filters solve this with a beautiful trade-off: they can tell you with certainty that a key is not in a given SSTable, but they can only tell you probabilistically that it might be there. Under the hood, a bloom filter is a bit array with k hash functions. When you insert a key, you set k bits. When you query, if any of those k bits is 0, the key is definitely absent — skip this SSTable entirely. If all k bits are 1, the key is probably present (but might be a false positive). That asymmetry is what makes them so useful. The false positive costs you one unnecessary SSTable read. The false negative doesn't exist.

In RocksDB, the canonical setup looks like this:

#include "rocksdb/filter_policy.h"
#include "rocksdb/table.h"

rocksdb::BlockBasedTableOptions table_options;

// 10 = bits per key. Higher = lower false positive rate = more memory.
// At 10 bits/key, false positive rate is roughly 1%. 
// At 6 bits/key, it climbs to ~5%. At 15, you're under 0.3%.
table_options.filter_policy.reset(
    rocksdb::NewBloomFilterPolicy(10)
);

rocksdb::Options options;
options.table_factory.reset(
    rocksdb::NewBlockBasedTableFactory(table_options)
);

That 10 is bits-per-key. The math behind it comes from the optimal bloom filter formula: for a target false positive rate p, you need roughly -1.44 * log2(p) bits per key. For 1% FP rate that's about 9.6 bits — hence 10 being the common default. If you're running a workload heavy on negative lookups (cache-miss style key checks, or a sparse key space), bump this to 12 or 15. If memory is tight and you can tolerate a few extra disk reads, drop to 7 or 8. I've seen teams set this to 20 thinking "more is better" without realizing they just burned 2x the memory for filter data that's not giving them proportional returns past ~14-15 bits.

Cassandra exposes this differently — it's per-table configuration and you set a target false positive rate rather than bits-per-key directly:

CREATE TABLE events (
    id uuid PRIMARY KEY,
    payload text
) WITH bloom_filter_fp_chance = 0.01  -- 1% false positive rate
   AND compaction = {'class': 'LeveledCompactionStrategy'};

-- Or alter an existing table:
ALTER TABLE events WITH bloom_filter_fp_chance = 0.001;

The default is 0.01 (1%). Setting this to 0.001 drops your false positive rate to 0.1% but roughly increases bloom filter memory by 50% per SSTable (sstable count matters here — Cassandra holds these in heap). I've watched people set this to 0.0001 on a table with 200 SSTables and then wonder why their heap is blowing up. The other direction is equally dangerous: I've seen teams set bloom_filter_fp_chance = 1.0 to "disable" bloom filters to save memory on a table they thought was read-rarely. What they actually did was make every read scan all SSTables unconditionally. Cassandra will let you do this. It will not warn you about what you've done to your p99 latencies.

One practical thing worth knowing: bloom filters in both RocksDB and Cassandra are stored alongside the SSTable data, so they get evicted from the block cache or off-heap memory under pressure just like anything else. If your filter data isn't resident in memory, the lookup falls back to the slow path anyway. In RocksDB you can pin filter blocks in cache with cache_index_and_filter_blocks = true combined with pin_l0_filter_and_index_blocks_in_cache = true — this keeps L0 filters hot since L0 gets hit on every lookup regardless. That's often the highest-ROI tuning move after you've set a sensible bits-per-key value.

Compaction Tuning: The Stuff the Docs Gloss Over

The thing that catches most people off guard with LSM-based databases isn't write performance — it's what happens three days after you deploy, when L0 SSTable count starts climbing and you can't figure out why reads suddenly feel like wading through mud. Compaction debt is quiet until it isn't. By the time writes start stalling, you're already in a bad spot and you're fixing it under pressure.

The root cause in RocksDB almost always comes back to two misunderstood knobs: max_bytes_for_level_base and target_file_size_base. The relationship between them is what matters. target_file_size_base controls individual SSTable size at L1. max_bytes_for_level_base controls the total data budget for L1. If you set target_file_size_base to 64MB but max_bytes_for_level_base to 256MB, you're saying L1 can hold roughly 4 files. That's fine until your write rate pushes more data down faster than the compaction thread can merge it. The debt accumulates at L0 specifically because L0 has no size budget — it's just a count limit. Once level0_file_num_compaction_trigger is hit (default: 4 files), compaction kicks off. If writes are faster than compaction throughput, you start queuing up. Then you hit level0_slowdown_writes_trigger (default: 20), then level0_stop_writes_trigger (default: 36). Those last two are your circuit breakers — they're not bugs, they're the database telling you that your tuning assumptions were wrong.

For a write-heavy time-series workload where keys are monotonically increasing (think sensor data, event logs), I've had good results with a config that aggressively sizes L1 relative to write throughput and gives compaction more threads. Here's the actual config block I use with RocksDB embedded in a custom Go service:

// RocksDB options for write-heavy time-series (Go via grocksdb)
opts := grocksdb.NewDefaultOptions()

// L1 total budget: 512MB. Tune this to ~10x your expected memtable flush rate per second.
opts.SetMaxBytesForLevelBase(512 * 1024 * 1024)

// Individual file size at L1. 128MB works well for sequential key patterns.
// Larger files = fewer compaction jobs but slower individual compactions.
opts.SetTargetFileSizeBase(128 * 1024 * 1024)

// Multiplier for each subsequent level. Default 10 is usually fine.
opts.SetMaxBytesForLevelMultiplier(10)

// Give compaction more threads — default is 1, which is embarrassingly low.
opts.SetMaxBackgroundCompactions(4)
opts.SetMaxBackgroundFlushes(2)

// Push the circuit breakers out — we have the headroom now.
opts.SetLevel0SlowdownWritesTrigger(30)
opts.SetLevel0StopWritesTrigger(50)

// How many L0 files before compaction starts. 
// Keep this low so debt doesn't build silently.
opts.SetLevel0FileNumCompactionTrigger(4)

The level0_slowdown_writes_trigger and level0_stop_writes_trigger are genuinely your circuit breakers — don't just raise them to make the symptoms disappear. I made that mistake once. I bumped both values up because write stalls were annoying and I wanted them to stop. Two days later L0 had 80 SSTables and read latency for recent keys went from ~1ms to ~40ms because every read was fanning out across dozens of uncompacted files. The right move is to fix the compaction throughput, then optionally adjust the triggers. More background compaction threads, a larger max_bytes_for_level_base, or rate-limiting ingest are the actual fixes.

On the Cassandra side, the equivalent visibility tool is nodetool compactionstats. Run it and look at the "pending tasks" number:

$ nodetool compactionstats
pending tasks: 47
          id                                   compaction type   keyspace   table    completed   total       unit    progress
a9f3...   COMPACTION                            sensor_db        readings   1.2 GB   8.7 GB      bytes   13.79%

A pending task count under 5 is healthy. Somewhere between 10 and 30, you're falling behind but not critically. Above 50 consistently? Your compaction strategy is wrong for your workload, or you're under-provisioned on I/O. The thing "pending tasks" actually represents is the number of SSTable merge operations queued up across all tables on that node — it's not per-table. So 47 pending tasks on a node with 3 heavily written tables is less scary than 47 all concentrated on one table. Run nodetool compactionstats -H to get human-readable sizes and see the per-table breakdown. If a single table is consistently generating most of the pending work, switch its compaction strategy: STCS (SizeTieredCompactionStrategy) is cheap to write to but terrible for read-heavy access patterns; TWCS (TimeWindowCompactionStrategy) is the right call for time-series data where you're almost never updating old partitions.

When LSM Trees Are the Wrong Choice

The thing that nobody tells you upfront: LSM trees are an optimization for write-heavy workloads, and if your write rate is modest, you're importing a pile of operational complexity for zero gain. I've watched teams migrate to RocksDB-backed stores because the engineering culture around them felt modern, only to spend months debugging compaction stalls that never would have happened on Postgres. If your app is doing mostly reads with occasional writes — think a user profile service, a product catalog, a CMS — a B-tree database like PostgreSQL or SQLite will be faster on reads, simpler to tune, and will never wake you up at 3am because compaction fell behind.

Point lookups are where LSMs genuinely hurt. A B-tree can answer a lookup in O(log n) with a single traversal of one well-understood data structure. An LSM has to check the memtable, then bloom filters across every level's SSTables, and if a bloom filter gives a false positive (which they do), you're hitting disk on a file that doesn't have your key. RocksDB's bloom filters reduce this significantly, but you're still dealing with read amplification that scales with the number of levels. The deeper the tree, the worse point reads get. With a B-tree, depth is bounded and predictable.

Range scans are the other silent killer. If your queries look like SELECT * FROM events WHERE user_id = 42 AND ts BETWEEN '2024-01-01' AND '2024-06-01' across large time windows, an LSM will scan partial data across multiple SSTables at multiple levels and merge-sort the results on the fly. A clustered B-tree index in Postgres handles this with a single contiguous page scan. I've seen 10x latency differences in range-heavy analytics workloads between RocksDB and Postgres, not because RocksDB was configured wrong, but because the access pattern just doesn't match what LSMs optimize for.

Small datasets are where the mismatch is almost comical. If your entire working set fits in memory — say, under a few hundred MB — both the B-tree and the LSM are serving data from cache anyway. But the LSM still runs background compaction threads, still manages multiple levels of SSTables on disk, still requires you to think about write_buffer_size, max_bytes_for_level_base, and target_file_size_base. You get none of the write throughput benefit and all of the operational overhead. SQLite with WAL mode enabled will outperform a misconfigured RocksDB instance on a small dataset and requires zero compaction tuning.

# RocksDB compaction options you'll eventually need to tune
# (and that have no equivalent in Postgres)
options.compaction_style = kLeveledCompaction
options.write_buffer_size = 64 * 1024 * 1024  # 64MB memtable
options.max_bytes_for_level_base = 512 * 1024 * 1024  # 512MB L1
options.target_file_size_base = 64 * 1024 * 1024
options.max_background_compactions = 4
options.compression_per_level = [
    kNoCompression,   # L0 — fast access
    kNoCompression,   # L1
    kSnappyCompression,  # L2+
    kSnappyCompression,
]
# Miss any of these on a write-heavy workload and you get compaction debt

The honest operational calculus: if your team has never tuned compaction before, the first time you hit a compaction backlog under load, you will not know what you're looking at. Compaction debt in RocksDB manifests as write stalls — the store literally throttles or stops accepting writes while it catches up. Postgres under equivalent write load just... slows down gradually. It doesn't stall. B-tree page splits and vacuum are also not free, but they're far more predictable and the operational playbook is 25 years deep. Unless you're ingesting time-series data at high velocity, running a write-heavy event store, or building something where write throughput is the primary constraint, the B-tree databases will cause fewer incidents and require less specialized knowledge to run safely.

Gotchas I Hit in Production

The deletion one stings if you're not expecting it. When you delete a key in an LSM-based database, nothing gets removed. A tombstone record gets written to the current MemTable, and that tombstone just... sits there. Until compaction picks it up and merges it with the original data, the old SSTable files still exist on disk. I had a Cassandra cluster where we were deleting roughly 40% of rows daily for a time-series cleanup job. Disk usage kept climbing for three days straight before the compaction cycle finally caught up. If you're running a workload with heavy deletes, you need to tune compaction aggressiveness, not just wait for it to happen naturally. In RocksDB, DeleteRange() tombstones are especially brutal because they can span huge key ranges and stall compaction if you're not careful.

The read-your-own-writes issue is subtle and only shows up under specific replica routing conditions. The MemTable for a given write is in memory on the primary — a replica that hasn't received the flush yet will return stale data if you immediately read from it. In Cassandra with consistency level ONE reading from a replica, I've seen this cause bugs that look like race conditions but are actually just replication lag compounded by the flush cycle. The fix is either using LOCAL_QUORUM for reads that must see their own writes, or routing reads back to the coordinator that handled the write. Neither is free — quorum reads cost latency. You pick your poison.

RocksDB's column family behavior caught me completely off guard. I assumed column families were more isolated than they are. They share a single WAL by default, but each has its own MemTable with its own flush threshold. Here's the trap: if one column family's MemTable triggers a flush and another CF has a very low write rate, RocksDB still has to keep the WAL around until all column families that reference it have flushed. A low-traffic CF can pin old WAL files on disk indefinitely. You'll see this in the RocksDB LOG file as entries like Skipping flush of CF X because it's too small while your disk fills up. The fix is setting max_total_wal_size explicitly, which forces flushes across all CFs when WAL size gets too big:

Options options;
options.max_total_wal_size = 512 * 1024 * 1024; // 512MB cap
// This forces a flush of the "lagging" CF before WAL grows unbounded

Cassandra's TTL with TimeWindowCompactionStrategy is a landmine if you get the window sizing wrong. TWCS works by grouping SSTables written in the same time window together and compacting within that window. The deal is: once a window is "closed" (no more writes going into it), TWCS compacts it once and never touches it again — that's how you get the low write amplification. But if your TTL expiry happens across window boundaries, you end up with closed SSTables that still contain live data mixed with expired data. TWCS won't recompact those to drop the tombstones. The rule I use: set compaction_window_size to roughly 1/10th of your TTL. So if your TTL is 30 days, use 3-day windows. Misalign this and you'll be running manual nodetool compact to force cleanup, which defeats the whole purpose.

Backup correctness is the one that's bitten teams I've joined who thought they were safe. A naive cp -r /var/lib/rocksdb/data /backups/ mid-write will give you a snapshot of files in inconsistent states — SSTables partially flushed, MemTable contents not on disk, WAL mid-write. RocksDB has a proper checkpoint API that creates a hard-link-based consistent snapshot without copying bytes:

# In C++, but you can also call it via the rocksdb CLI tools
Checkpoint* checkpoint;
Checkpoint::Create(db, &checkpoint);
checkpoint->CreateCheckpoint("/backups/db_snapshot_20240315");
// Hard links mean this is nearly instant and space-efficient

Cassandra's equivalent is nodetool snapshot keyspace_name, which hard-links the current SSTable set into a snapshot directory under /var/lib/cassandra/data/keyspace/table/snapshots/. Both approaches are essentially free in terms of I/O because they rely on hard links — the actual SSTable files don't move. Where people get burned is forgetting to call nodetool clearsnapshot afterwards. Those hard-linked files prevent the original SSTables from being deleted by compaction, so your disk usage grows indefinitely if you're snapshotting regularly without cleanup.

Disclaimer: This article is for informational purposes only. The views and opinions expressed are those of the author(s) and do not necessarily reflect the official policy or position of Sonic Rocket or its affiliates. Always consult with a certified professional before making any financial or technical decisions based on this content.

Originally published on techdigestor.com. Follow for more developer-focused tooling reviews and productivity guides.

DEV Community