Key-Value Stores Aren't Simple

#keyvaluestore #keyvaluestores #kvstore #distributedsystems

https://www.youtube.com/watch?v=cSzh7YwDvYI

Two lines of bash, and you have a key-value store:

db_set() { echo "$1,$2" >> /tmp/db; }
db_get() { grep "^$1," /tmp/db | tail -1 | cut -d, -f2; }

Append to a file. Grep what you wrote. Run it. It works. That's not a joke — it's the opening of Martin Kleppmann's Designing Data-Intensive Applications. Three methods on the API surface — get, put, delete — and a working KV store in two shell functions. So why is Redis 80,000 lines of C? Why has Meta's ZippyDB been in production since 2013? Why does Discord shape its trillion-message data layer like the same get/put/delete interface and yet rebuild it from scratch?

Because the API is a trap. The three-method interface hides five enormous design decisions, and every production KV store made a different choice on every axis. The signature looks substitutable. The systems behind it are not.

Five Dimensions Hiding Behind get / put / delete

Here are the five — durability, consistency, latency tails, sharding, and hot keys — each grounded in a real production system.

1. Durability — what does "OK" mean?

You call put. The server returns OK. What just happened?

That depends on the system. OK in one KV means the bytes are queued in memory on this one machine. OK in another means a majority of replicas have it in their write-ahead log and the primary fsync'd to disk. The function signature is identical. The promise is wildly different.

Meta's ZippyDB exposes the choice as an option flag on the same put call. Default mode: ack only after a majority of replicas have logged the write to their Paxos logs and the primary has flushed to RocksDB. Strong, slow. Fast-acknowledge mode: ack the moment the primary has the write queued for replication. Fast, fragile. Same client code. Two completely different durability stories.

At the floor of the spectrum sits default Redis — in-memory only. Crash the box and you lose every write since the last snapshot. Turn on AOF (append-only file) with everysec flushing and you still have a one-second loss window on power failure.

In 2013, Kyle Kingsbury (Aphyr) ran Jepsen on a Redis cluster with async replication and a network partition. He sent 2,000 writes. Redis claimed 1,998 of them succeeded. Only 872 were actually present after the failover settled. Redis dropped 56% of the writes it had explicitly told the client succeeded — because async replication plus failover plus partition is a recipe for silent data loss, and the API never said which mode you were in.

2. Consistency — what can the next read see?

Two clients. One writes a value. The other reads — immediately, on a different node. What does the reader see?

If the system is eventually consistent, the reader might see the new value, the old one, or briefly both. Stale reads aren't a bug; they're a feature. Werner Vogels framed the trade-off best while building Dynamo at Amazon: strong consistency is non-negotiable for a bank balance, and surplus for a shopping cart. The cart can lose an item and add it back. The KV store has no idea which one you're storing — that's your problem.

Dropbox built Panda to be the cart-and-balance-and-everything-else metadata layer for their filesystem. Two petabytes of data. Tens of millions of QPS. Single-digit-millisecond latency. Linearizable reads. ACID transactions across multiple keys with two-phase commit. Hybrid logical clocks (HLCs) tag every write with a monotonic version that's tied to wall-clock time, so the system can answer reads at a consistent snapshot and know exactly what was visible.

That's not a normal KV store. It's a transactional KV. Same put call you'd write against Redis — but commits or rolls back atomically across multiple keys. Dropbox explicitly rejected CockroachDB (quorum replication added 80ms write latency, incompatible with their target), FoundationDB (centralized timestamp oracle hit a single-process scaling cap), and Vitess (no production cross-shard ACID). The transactional-KV space is real, and the ergonomics matter as much as the API.

When consistency goes wrong, Jepsen calls it "multiple timelines of a single key." That's split-brain in three words: two halves of the cluster see different histories of the same key, and when the partition heals, one of them gets quietly overwritten. Panda makes that impossible by design. Default-configured Redis can produce it under failure. The function signature, again, doesn't tell you.

3. Latency — p99, not p50

"My database is fast." What does that mean?

p50 (the median) is the marketing number — half your requests are this fast or faster. The number that pages you at 3 AM is p99 — the slowest 1% of requests. Marc Brooker, who writes more clearly about tail latency than anyone, calls it "those times when your system is weirdly slow."

Why does it matter at scale? Because tails compound. Imagine one server with a 1% chance of being slow. That's fine. Now your request fans out to 100 servers and you wait for all of them. The probability that at least one is slow is 63%. Jeff Dean and Luiz Barroso laid this out in The Tail at Scale (2013) — what was rare on a single server becomes normal on a fleet.

Discord saw the math live in production. On Cassandra, their insert p99 swung between 5 and 70 ms — Java GC pauses, compaction backlogs, hot partitions cascading into quorum reads. Five symptoms, one disease: the tail. In 2022 they migrated trillions of messages to ScyllaDB — same data model, written in C++, no JVM, shard-per-core architecture. p99 dropped to a steady 5 ms. The migration ran at 3.2M messages/sec for nine days, and the 2022 World Cup Final hit during the cutover window. The system didn't notice.

4. Sharding — who owns this key?

Above some scale, the cluster has to decide which node owns which key. That decision — the partition map — is where 90% of operational pain lives.

The naive answer: hash(key) % num_servers. Add a server, and ~90% of keys reshuffle. Catastrophe.

The real answers: consistent hashing (only ~1/N keys move when a node joins or leaves), range partitioning (keys are stored in sorted ranges, great for ordered scans, vulnerable to hot ranges), or composite keys (Discord's choice). Discord shards on (channel_id, time_bucket) with Snowflake IDs that sort chronologically — old conversations live on cold nodes, active channels stay on hot nodes, and any single channel's messages cluster together for cache locality.

Pick the partition key wrong and one shard burns while the others sleep. Pick it right and your cluster scales linearly until the next problem hits.

5. Hot keys — when one key is everyone's key

Even a perfect partition map can get you killed by one key.

Discord said it best in their migration writeup: "A server with hundreds of thousands of people sends orders of magnitude more messages than a small group of friends." One channel ID. One partition. Ten times the traffic of any other partition.

Why does that cascade across the cluster? In a quorum-replicated system, every read pulls in two or three nodes that hold copies of that partition. The hot partition's nodes get hammered, fall behind, and now every other query that happens to touch one of those nodes — for completely unrelated keys — slows down too. The hot partition pollutes the rest of the cluster.

Discord's fix lives above the database, not inside it. They built a Rust data services layer between the gateway and the storage cluster. Its job: when a thousand users open the same channel at the same moment, the layer collapses that into a single database query and fans the same answer back to all thousand readers. One row, one query, a thousand happy clients. Request coalescing.

The API never told you which keys are hot. It can't — hot keys come from your users, and your users change every hour. The mitigation has to live in the layer that knows about your users: caches, coalescers, replication strategies, sometimes app-level pre-sharding. Never in get, put, or delete.

Five Questions to Ask Before You Pick a KV Store

Three methods on top. Five enormous decisions underneath. Next time you reach for a key-value store, you're not picking an interface. You're picking five answers to five hard questions:

Durability — what does OK actually mean? Bytes in RAM? Replicated? Fsynced? Quorum-acked across regions?
Consistency — what can the next read see? Eventual? Read-your-writes? Linearizable? Tunable per query?
Latency — what's the p99, not the p50? What happens when one request fans out to many nodes?
Sharding — how does the cluster pick the node that owns a key? What happens when you add or remove one?
Hot keys — when 80% of reads hit one key, who absorbs the heat? The database, or the layer above it?