DEV Community

Cover image for Why Nothing Really Gets Deleted Immediately: How Database Deletion Works Through Tombstones, LSM-Trees, and Compaction
Faizan Firdousi
Faizan Firdousi

Posted on

Why Nothing Really Gets Deleted Immediately: How Database Deletion Works Through Tombstones, LSM-Trees, and Compaction

Do you know that when you delete something, it doesn't actually get deleted and erased immediately? Like, you press the delete button and expect the database to be like "ok boss, I'll remove the existence of this thing" ..... but no, it doesn't happen like that.

When we talk about modern high-performance databases, NoSQL and NewSQL systems based on LSM-trees use append-only data structures. That means you can't just pick a random value and update or delete it.

Why Append-Only?

Because databases are highly optimized and they found out that they need to be performant for sequential writes, which are orders of magnitude faster than random writes on disk.

We all know that traditional hard disks are mechanical and have seek times. If you were to specify the particular location where to read/write the data, it becomes expensive at scale while SSDs do not suffer from mechanical seek time, random writes are still more expensive than sequential writes, just for a different reason which is Write amplification.
That's why databases are optimized for sequential writes and not random writes.

Append-only structures also provide immutability benefits:

  • No in-place updates means simpler concurrency control (no locks needed for reads)
  • Easier crash recovery
  • Natural support for versioning

How Data is Stored

Let's understand the working of the data structures in LSM-based databases first. When you write data, it goes to two places at the same time:

Memtable - A sorted data structure (usually a skip list or red-black tree) that holds recent writes in memory for fast access.

Write Ahead Logging (WAL) - An append-only disk file that records every operation sequentially before applying it.

WAL is very important in making the database reliable because if the system crashes before the memtable is flushed to disk, you can replay the WAL to recover all committed operations.

When the memtable gets bigger than some threshold (typically a few megabytes), it then writes it out to disk as an SSTable (Sorted String Tables - immutable sorted key-value files stored on disk) file. This is efficient because the data is already sorted by keys in the memtable, making it easier to maintain that order in SSTables as well.

How Delete Operations Actually Work

So, we use tombstones. These are one of the most elegant yet complex solutions in append-only databases.

A tombstone is a special write operation that marks data as deleted without removing it. When you execute a DELETE statement, the database treats it as an insert, not a removal. The tombstone contains:

  • The key being deleted
  • A timestamp
  • A deletion marker flag

When you delete a key, the tombstone follows the exact same path as a regular insert: written to Memtable → appended to WAL → flushed to SSTable.

So the actual deletion didn't happen yet.

Compaction: Where Deletion Really Happens

Compaction is the exact step where deletion happens. If you don't already know, compaction happens regularly after some time, where the segments are compacted to remove the old data and keep the most recent ones.

Tombstones are physically removed during compaction, but only when specific conditions are met:

  1. Grace period must expire :- The tombstone must be older than the configured grace period (gc_grace_seconds in Cassandra or similar TTL-based settings in other systems)

  2. All relevant SSTables must participate :- If SSTable-A has the tombstone but SSTable-B has older data for the same key, both must be compacted together

After this is done, the disk space is reclaimed again.

There are more real-world challenges and stuff like multi-level compaction strategies, but that's a topic for another deep dive.

If you liked the blog do follow me and read other of my blogs

Top comments (0)