DEV Community

Cover image for Design a Distributed Key-Value Store
JosephAkayesi
JosephAkayesi

Posted on

Design a Distributed Key-Value Store

Key-Value Stores: Design and Architecture

A key-value store is a non-relational database that maps unique keys to values. Keys must be unique, and each key maps directly to a value — typically a string or object.

Key-Value Store

It supports two core operations: put(key, value) and get(key). Simple in concept, but serious tradeoffs emerge at scale — every design must balance read performance, write performance, and memory usage.


Single-Node Simplicity

On a single node, a key-value store is straightforward. All operations are atomic — reads always return the latest state, writes are immediately consistent, and there's no coordination overhead.

The problem is scale. As data grows, a single instance hits capacity limits and you need to distribute data across multiple nodes. That's where things get complicated.


The CAP Theorem

Distributed systems must contend with network partitions — they're unavoidable. The CAP theorem states that when a partition occurs, you can guarantee at most two of three properties: Consistency, Availability, and Partition Tolerance.

Ideal scenario — no partition, full consistency and availability:

Ideal replication scenario

When a partition occurs, you must choose:

Network partition scenario

  • Choose consistency — block all writes until the failed node recovers, preventing stale data. Essential for systems like banking where inconsistency is unacceptable.
  • Choose availability — keep accepting reads and writes, sync the failed node when it returns. Reads may be stale, but the system stays responsive.

Data Partitioning

To distribute data across nodes, we use consistent hashing. Servers and keys are both mapped onto a hash ring. A key is stored on the first server encountered when moving clockwise from the key's position.

Consistent hashing ring

Naive consistent hashing creates two problems: uneven key distribution and unequal partition sizes. Virtual nodes solve this — each physical server occupies multiple positions on the ring, smoothing out distribution proportionally to each server's capacity.

Key benefits of partitioning: automatic scaling, higher storage capacity, improved data locality, and heterogeneous server support.


Data Replication

Partitioning spreads data but doesn't protect it. Replication does — by storing copies of each key across N distinct nodes, ideally across separate data centers to guard against localized failures like power outages.

Data replication across nodes

If one node goes down, others can still serve the request. But if enough nodes fail simultaneously, requests may block while waiting for the cluster to recover. That's where quorum comes in.


Quorum

Quorum controls how many nodes must acknowledge a read or write before it's considered successful. Given N total nodes, R read replicas, and W write replicas:

Quorum diagram

Configuration Effect
R = 1, W = N Optimized for fast reads
W = 1, R = N Optimized for fast writes
W + R > N Strong consistency guaranteed (e.g. N=3, W=R=2)
W + R ≤ N Strong consistency not guaranteed

The key insight behind W + R > N: any read set and any write set must overlap, so at least one node always has the latest data.

Sloppy Quorum

Standard quorum can still return stale data. If a write lands on node s1 but a subsequent read hits s2, the client sees stale data — since reads and writes are load-balanced, you can't control which nodes they hit.

Sloppy quorum addresses this. Rather than requiring acknowledgement from a strict set of N nodes, it allows any available nodes in the cluster to temporarily handle requests when the designated nodes are unavailable. This keeps the system responsive during partial failures while still satisfying the W + R > N invariant once the cluster stabilizes.

For a 3-node cluster with W = R = 2, a majority of nodes must acknowledge every read and write. This guarantees that the read and write sets always overlap — so at least one node in any read quorum will always have the latest data.


Inconsistency Resolution

Even with quorum, replicas can diverge. Versioning handles this: each write increments a version number, and on subsequent writes clients must first fetch the latest version before incrementing and writing. Readers resolve conflicts by comparing versions and discarding stale data.

Versioning and conflict resolution

DynamoDB uses this reconciliation technique on the read path.


Consistency Models

Model Behaviour
Strong consistency All reads return the most up-to-date value
Weak consistency Reads may return stale data
Eventual consistency Reads may be stale, but replicas converge over time

Failure Detection

Relying on a single server to report node failures is unreliable and noisy at scale. The Gossip Protocol offers a better approach:

Multicast failure detection

Each node maintains a membership list with heartbeat counters. Nodes periodically increment their counters and share them with random peers. If a node's heartbeat stops incrementing past a defined threshold, it's marked offline.

Gossip protocol

This is decentralized, low-overhead, and scales well.


Handling Failures

Temporary failures use hinted handoff: when a node is unavailable, a neighbor absorbs its writes and syncs them back once the original node recovers.

Permanent failures use anti-entropy with Merkle trees — a data structure that efficiently identifies which portions of two replicas differ, minimizing the data transferred during resync. For full data center outages, cross-datacenter replication is essential.


System Architecture

System architecture diagram

Core design principles:

  • Clients interact via get(key) and put(key, value)
  • A coordinator node acts as a proxy between the client and the cluster
  • Nodes are distributed on a consistent hashing ring
  • The system is fully decentralized — no single point of failure
  • Every node is symmetric, handling the same set of responsibilities

Node responsibilities


Write Path

Write path

  1. Write is appended to a commit log (durability)
  2. Data is written to an in-memory cache
  3. When the cache exceeds a threshold, it flushes to an SSTable on disk

Read Path

Read path — cache hit

The system first checks the in-memory cache and returns immediately on a hit. On a miss:

Read path — bloom filter and SSTable

  1. Check the in-memory cache — on a miss, proceed
  2. Consult the Bloom filter to identify which SSTables likely contain the key
  3. Query the relevant SSTables and return the result to the client

Summary

Summary table

Component Purpose
Consistent hashing Distribute keys evenly across nodes
Virtual nodes Fix uneven distribution in consistent hashing
Replication Prevent data loss; ensure availability
Quorum (W + R > N) Balance consistency and availability
Versioning Resolve conflicts between replicas
Gossip protocol Decentralized failure detection
Hinted handoff Handle temporary node failures
Merkle trees Efficient sync after permanent failures
Bloom filters Fast disk lookups on the read path

Top comments (0)