JosephAkayesi

Posted on May 22

Design a Distributed Key-Value Store

#systemdesign #distributedsystems

Key-Value Stores: Design and Architecture

A key-value store is a non-relational database that maps unique keys to values. Keys must be unique, and each key maps directly to a value — typically a string or object.

It supports two core operations: put(key, value) and get(key). Simple in concept, but serious tradeoffs emerge at scale — every design must balance read performance, write performance, and memory usage.

Single-Node Simplicity

On a single node, a key-value store is straightforward. All operations are atomic — reads always return the latest state, writes are immediately consistent, and there's no coordination overhead.

The problem is scale. As data grows, a single instance hits capacity limits and you need to distribute data across multiple nodes. That's where things get complicated.

The CAP Theorem

Distributed systems must contend with network partitions — they're unavoidable. The CAP theorem states that when a partition occurs, you can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance.

Ideal scenario — no partition, full consistency and availability:

When a partition occurs, you must choose:

Choose consistency — block all writes until the failed node recovers, preventing stale data. Essential for systems like banking where inconsistency is unacceptable.
Choose availability — keep accepting reads and writes, sync the failed node when it returns. Reads may be stale, but the system stays responsive.

Data Partitioning

To distribute data across nodes, we use consistent hashing. Servers and keys are both mapped onto a hash ring. A key is stored on the first server encountered when moving clockwise from the key's position.

Naive consistent hashing creates two problems: uneven key distribution and unequal partition sizes. Virtual nodes solve this — each physical server occupies multiple positions on the ring, smoothing out distribution proportionally to each server's capacity.

Key benefits of partitioning: automatic scaling, higher storage capacity, improved data locality, and heterogeneous server support.

Data Replication

Partitioning spreads data but doesn't protect it. Replication does — by storing copies of each key across N distinct nodes, ideally across separate data centers to guard against localized failures like power outages.

If one node goes down, others can still serve the request. But if enough nodes fail simultaneously, requests may block while waiting for the cluster to recover. That's where quorum comes in.

Quorum

Quorum controls how many nodes must acknowledge a read or write before it's considered successful. Given N total nodes, R read replicas, and W write replicas:

Configuration	Effect
R = 1, W = N	Optimized for fast reads
W = 1, R = N	Optimized for fast writes
W + R > N	Strong consistency guaranteed (e.g. N=3, W=R=2)
W + R ≤ N	Strong consistency not guaranteed

The key insight behind W + R > N: any read set and any write set must overlap, so at least one node always has the latest data.

Sloppy Quorum

Standard quorum can still return stale data. If a write lands on node s1 but a subsequent read hits s2, the client sees stale data — since reads and writes are load-balanced, you can't control which nodes they hit.

Sloppy quorum addresses this. Rather than requiring acknowledgement from a strict set of N nodes, it allows any available nodes in the cluster to temporarily handle requests when the designated nodes are unavailable. This keeps the system responsive during partial failures while still satisfying the W + R > N invariant once the cluster stabilizes.

For a 3-node cluster with W = R = 2, a majority of nodes must acknowledge every read and write. This guarantees that the read and write sets always overlap — so at least one node in any read quorum will always have the latest data.

Inconsistency Resolution

Even with quorum, replicas can diverge. Versioning handles this: each write increments a version number, and on subsequent writes clients must first fetch the latest version before incrementing and writing. Readers resolve conflicts by comparing versions and discarding stale data.

DynamoDB uses this reconciliation technique on the read path.

Consistency Models

Model	Behaviour
Strong consistency	All reads return the most up-to-date value
Weak consistency	Reads may return stale data
Eventual consistency	Reads may be stale, but replicas converge over time

Failure Detection

Relying on a single server to report node failures is unreliable and noisy at scale. The Gossip Protocol offers a better approach:

Each node maintains a membership list with heartbeat counters. Nodes periodically increment their counters and share them with random peers. If a node's heartbeat stops incrementing past a defined threshold, it's marked offline.

This is decentralized, low-overhead, and scales well.

Handling Failures

Temporary failures use hinted handoff: when a node is unavailable, a neighbor absorbs its writes and syncs them back once the original node recovers.

Permanent failures use anti-entropy with Merkle trees — a data structure that efficiently identifies which portions of two replicas differ, minimizing the data transferred during resync. For full data center outages, cross-datacenter replication is essential.

System Architecture

Core design principles:

Clients interact via get(key) and put(key, value)
A coordinator node acts as a proxy between the client and the cluster
Nodes are distributed on a consistent hashing ring
The system is fully decentralized — no single point of failure
Every node is symmetric, handling the same set of responsibilities

Write Path

Write is appended to a commit log (durability)
Data is written to an in-memory cache
When the cache exceeds a threshold, it flushes to an SSTable on disk

Read Path

The system first checks the in-memory cache and returns immediately on a hit. On a miss:

Check the in-memory cache — on a miss, proceed
Consult the Bloom filter to identify which SSTables likely contain the key
Query the relevant SSTables and return the result to the client

Summary

Component	Purpose
Consistent hashing	Distribute keys evenly across nodes
Virtual nodes	Fix uneven distribution in consistent hashing
Replication	Prevent data loss; ensure availability
Quorum (W + R > N)	Balance consistency and availability
Versioning	Resolve conflicts between replicas
Gossip protocol	Decentralized failure detection
Hinted handoff	Handle temporary node failures
Merkle trees	Efficient sync after permanent failures
Bloom filters	Fast disk lookups on the read path

DEV Community