Key-Value Stores: Design and Architecture
A key-value store is a non-relational database that maps unique keys to values. Keys must be unique, and each key maps directly to a value — typically a string or object.
It supports two core operations: put(key, value) and get(key). Simple in concept, but serious tradeoffs emerge at scale — every design must balance read performance, write performance, and memory usage.
Single-Node Simplicity
On a single node, a key-value store is straightforward. All operations are atomic — reads always return the latest state, writes are immediately consistent, and there's no coordination overhead.
The problem is scale. As data grows, a single instance hits capacity limits and you need to distribute data across multiple nodes. That's where things get complicated.
The CAP Theorem
Distributed systems must contend with network partitions — they're unavoidable. The CAP theorem states that when a partition occurs, you can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance.
Ideal scenario — no partition, full consistency and availability:
When a partition occurs, you must choose:
- Choose consistency — block all writes until the failed node recovers, preventing stale data. Essential for systems like banking where inconsistency is unacceptable.
- Choose availability — keep accepting reads and writes, sync the failed node when it returns. Reads may be stale, but the system stays responsive.
Data Partitioning
To distribute data across nodes, we use consistent hashing. Servers and keys are both mapped onto a hash ring. A key is stored on the first server encountered when moving clockwise from the key's position.
Naive consistent hashing creates two problems: uneven key distribution and unequal partition sizes. Virtual nodes solve this — each physical server occupies multiple positions on the ring, smoothing out distribution proportionally to each server's capacity.
Key benefits of partitioning: automatic scaling, higher storage capacity, improved data locality, and heterogeneous server support.
Data Replication
Partitioning spreads data but doesn't protect it. Replication does — by storing copies of each key across N distinct nodes, ideally across separate data centers to guard against localized failures like power outages.
If one node goes down, others can still serve the request. But if enough nodes fail simultaneously, requests may block while waiting for the cluster to recover. That's where quorum comes in.
Quorum
Quorum controls how many nodes must acknowledge a read or write before it's considered successful. Given N total nodes, R read replicas, and W write replicas:
| Configuration | Effect |
|---|---|
| R = 1, W = N | Optimized for fast reads |
| W = 1, R = N | Optimized for fast writes |
| W + R > N | Strong consistency guaranteed (e.g. N=3, W=R=2) |
| W + R ≤ N | Strong consistency not guaranteed |
The key insight behind W + R > N: any read set and any write set must overlap, so at least one node always has the latest data.
Sloppy Quorum
Standard quorum can still return stale data. If a write lands on node s1 but a subsequent read hits s2, the client sees stale data — since reads and writes are load-balanced, you can't control which nodes they hit.
Sloppy quorum addresses this. Rather than requiring acknowledgement from a strict set of N nodes, it allows any available nodes in the cluster to temporarily handle requests when the designated nodes are unavailable. This keeps the system responsive during partial failures while still satisfying the W + R > N invariant once the cluster stabilizes.
For a 3-node cluster with W = R = 2, a majority of nodes must acknowledge every read and write. This guarantees that the read and write sets always overlap — so at least one node in any read quorum will always have the latest data.
Inconsistency Resolution
Even with quorum, replicas can diverge. Versioning handles this: each write increments a version number, and on subsequent writes clients must first fetch the latest version before incrementing and writing. Readers resolve conflicts by comparing versions and discarding stale data.
DynamoDB uses this reconciliation technique on the read path.
Consistency Models
| Model | Behaviour |
|---|---|
| Strong consistency | All reads return the most up-to-date value |
| Weak consistency | Reads may return stale data |
| Eventual consistency | Reads may be stale, but replicas converge over time |
Failure Detection
Relying on a single server to report node failures is unreliable and noisy at scale. The Gossip Protocol offers a better approach:
Each node maintains a membership list with heartbeat counters. Nodes periodically increment their counters and share them with random peers. If a node's heartbeat stops incrementing past a defined threshold, it's marked offline.
This is decentralized, low-overhead, and scales well.
Handling Failures
Temporary failures use hinted handoff: when a node is unavailable, a neighbor absorbs its writes and syncs them back once the original node recovers.
Permanent failures use anti-entropy with Merkle trees — a data structure that efficiently identifies which portions of two replicas differ, minimizing the data transferred during resync. For full data center outages, cross-datacenter replication is essential.
System Architecture
Core design principles:
- Clients interact via
get(key)andput(key, value) - A coordinator node acts as a proxy between the client and the cluster
- Nodes are distributed on a consistent hashing ring
- The system is fully decentralized — no single point of failure
- Every node is symmetric, handling the same set of responsibilities
Write Path
- Write is appended to a commit log (durability)
- Data is written to an in-memory cache
- When the cache exceeds a threshold, it flushes to an SSTable on disk
Read Path
The system first checks the in-memory cache and returns immediately on a hit. On a miss:
- Check the in-memory cache — on a miss, proceed
- Consult the Bloom filter to identify which SSTables likely contain the key
- Query the relevant SSTables and return the result to the client
Summary
| Component | Purpose |
|---|---|
| Consistent hashing | Distribute keys evenly across nodes |
| Virtual nodes | Fix uneven distribution in consistent hashing |
| Replication | Prevent data loss; ensure availability |
| Quorum (W + R > N) | Balance consistency and availability |
| Versioning | Resolve conflicts between replicas |
| Gossip protocol | Decentralized failure detection |
| Hinted handoff | Handle temporary node failures |
| Merkle trees | Efficient sync after permanent failures |
| Bloom filters | Fast disk lookups on the read path |















Top comments (0)