Sathish

Posted on Dec 7, 2025 • Originally published at sathishsaravanan.com

Log Databases Done Right

#database

Your database is slow because it's doing the equivalent of flipping through a physical filing cabinet. Open drawer 47, pull out folder 23, find page 8, update line 3, put it all back. Now do that a thousand times per second.

What if you just grabbed a notebook and wrote everything at the end instead?

That's Bitcask. An append-only log file paired with an in-memory hash table. It sounds absurdly simple, but it powers Riak - a distributed database handling serious production traffic. The secret? Stop fighting how disks actually work.

The Hash Table You Can Persist

Here's the world's simplest database:

db_set() { echo "$1,$2" >> database }
db_get() { grep "^$1," database | tail -n 1 }

This actually works. Write a key-value pair by appending to a file. Read by grepping for the key and taking the last match. Updates? Just append the new value. The latest entry wins.

The problem is obvious: reading requires scanning the entire file. That's O(n), which is unacceptable for anything real.

Bitcask fixes this with one addition: keep a hash table in memory mapping each key to its exact disk location. Now reads become O(1) memory lookup + one disk seek. That's it.

Inside the Data Files

Bitcask is just a directory with multiple append-only files. At any moment, exactly one file is "active" for writes. When it hits a size threshold (usually a few megabytes), Bitcask closes it and opens a new active file. Closed files never get written to again - they're immutable.

Each record has this structure:

[CRC][Timestamp][KeySize][ValueSize][Key][Value]

The CRC catches corruption, the sizes help parsing, and the timestamp tracks when data was written. Simple.

The in-memory hash table (KeyDir) stores this for each key:

file_id, value_position, value_size, timestamp

With that information, retrieving any value is exactly one disk seek. No B-tree traversal, no page cache misses, no secondary indexes - just seek and read.

Writes Are Just Appends

When you write or update a key, Bitcask appends the new value to the active file and atomically updates KeyDir. Both happen or neither does. The old value on disk? Ignored but still there.

Deletes work the same way. Append a tombstone marker, update KeyDir to point at it. Done.

This is why writes are fast: The disk head stays at the end of the file, writing continuously. No seeking around to find the right page to update. Write throughput typically saturates I/O bandwidth, which is exactly what you want.

The trade-off is obvious: your disk fills up with garbage. Old values nobody will ever read again. That's where compaction comes in.

Compaction: Garbage Collection for Disk

Think of compaction like a garbage collector for your disk. Your application keeps writing new values, creating "garbage" - old versions that KeyDir no longer references. Eventually you're wasting gigabytes on data nobody will ever read.

A background process periodically merges old immutable files. It scans each file, keeps only the latest version of each key, and discards everything else. Tombstoned keys vanish completely. The result is a fresh set of compacted files with zero redundancy.

KeyDir gets updated to point to the new locations. Old files get deleted - space reclaimed.

The recovery trick: During compaction, Bitcask generates a "hint file" next to each data file. Think of it as a table of contents - every key and where to find it, but without the actual values.

HintFile: [Key1 -> file_id, offset, size]
         [Key2 -> file_id, offset, size]
DataFile: [full records with actual values]

When a node crashes and restarts, KeyDir is gone. Normally you'd read every data file to rebuild it - minutes for large datasets. Instead, Bitcask reads the tiny hint files and reconstructs KeyDir in seconds. You're trading a bit of disk space during compaction for dramatically faster recovery.

Why This Works for Riak

Bitcask isn't a database - it's a storage engine. It handles the "write bytes to disk, read bytes back" part. Riak is a full distributed database that uses Bitcask as its persistence layer.

Riak handles distributed system complexity: replication, node failures, request routing. Bitcask handles local storage: put this key-value pair on disk, get it back fast.

In a typical Riak cluster with 64 partitions, each node manages several partitions (vnodes). Each vnode runs its own independent Bitcask instance. So if a node handles 8 vnodes, you'll see 8 separate Bitcask directories on disk, each with its own log files and KeyDir.

Why this separation matters:

If one Bitcask instance corrupts, only that partition's data is affected. The other 7 keep working.

Replication is straightforward - append the same log entry to Bitcask instances on multiple nodes. No coordination needed.

Each Bitcask recovers independently using its hint files. No cascading failures.

Bitcask's simplicity makes it perfect for distributed systems. Each node does its own thing without coordination overhead.

The Memory Constraint

Here's the limitation: all keys must fit in RAM. KeyDir stores metadata for every single key. Millions of keys can consume gigabytes of memory.

If your keyspace doesn't fit in memory, Bitcask won't work for you. But if it does? You get predictable microsecond-level read latency regardless of dataset size. That one-seek guarantee is powerful.

When to Use Bitcask

Choose Bitcask when:

Your entire keyspace fits comfortably in RAM
You need predictable, low-latency reads and writes
Write patterns are random (traditional databases suffer here)
You're building distributed systems where nodes manage subsets of data

Avoid Bitcask when:

Your keyspace exceeds available RAM
You need range queries (Bitcask is strictly key-value lookups)
You can't schedule compaction during low-traffic windows
Disk space is extremely constrained

What Bitcask Teaches Us

The database world is full of sophisticated designs: B-trees with write-ahead logs, LSM trees with bloom filters, complex page caches. They all try to work around the fundamental problem: random disk seeks are slow.

Bitcask takes a different approach - embrace sequential writes completely. Don't fight the hardware. The append-only design aligns perfectly with how disks want to operate.

Yes, you pay the cost of keeping all keys in memory. Yes, you need periodic compaction. But in exchange, you get a system so simple you can understand it in an afternoon, with performance characteristics that are completely predictable.

Sometimes the best solution isn't the most sophisticated one. It's the one that works with the grain of your hardware instead of against it.

That's the real lesson from Bitcask: understand your constraints, optimize for them ruthlessly, and don't be afraid of designs that seem "too simple" on the surface. The simplest design that actually works is often the best one.

And don’t tell your database that it’s basically been keeping a diary all this time, it has feelings too, or let your project manager know their $200K migration is just “append to a file and remember the spot.” Some truths don’t belong in standup.