Rajkiran

Posted on Jun 13

System Design - 22. Consistent Hashing: The Algorithm That Lets Cassandra Add a Server Without Breaking Everything

#distributedsystems #database #systemdesign #softwaredevelopment

Consistent Hashing: The Algorithm That Lets Cassandra Add a Server Without Breaking Everything

Covers: The Modulo Problem, Hash Ring, Virtual Nodes, Real Implementations in Cassandra and Dynamo

The Promise We Made on Day 3 (Now Fulfilled)

Back on Day 3, when discussing hash-based sharding, we hit a wall: adding a server to a hash-based shard remaps almost everything.

4 shards: shard = hash(key) % 4
5 shards: shard = hash(key) % 5

Adding ONE server changed the modulo from 4 to 5 — 
and remapped roughly 80% of all keys to different shards.

We promised a solution: consistent hashing. Today we deliver on that promise — and it's one of the most elegant algorithms in distributed systems.

The Core Idea: A Ring, Not a Line

Instead of mapping keys to shard numbers via modulo, consistent hashing maps both keys and servers onto the same circular space — a "ring" — using a hash function.

Hash space: 0 to 2^32 - 1 (a circle, where 2^32-1 wraps back to 0)

                    0 / 2^32
                       │
            Server D ──┼── Server A
                       │
        Server C ──────┼────── 
                       │
                  Server B

Placing servers on the ring:

import hashlib

def hash_to_ring(key):
    return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)

servers = ["server_A", "server_B", "server_C", "server_D"]
ring_positions = {hash_to_ring(s): s for s in servers}

# Result (example positions on the ring):
# server_A → position 500,000,000
# server_B → position 1,800,000,000
# server_C → position 2,900,000,000
# server_D → position 4,000,000,000

Placing data on the ring:

key = "user_12345"
key_position = hash_to_ring(key)  # e.g., position 2,100,000,000

The assignment rule: A key belongs to the first server clockwise from its position on the ring.

Ring positions (clockwise):
  Server A: 500M
  Server B: 1.8B
  Server C: 2.9B
  Server D: 4.0B

Key "user_12345" at position 2.1B
  → Walk clockwise from 2.1B → first server is Server C (at 2.9B)
  → "user_12345" is stored on Server C

The Magic: Adding or Removing a Server

This is where consistent hashing earns its name. Watch what happens when we add Server E at position 2.5B (between Server B and Server C):

Before:
  ...Server B (1.8B) ────────────────► Server C (2.9B)...
     Keys in range (1.8B, 2.9B] all belong to Server C

After adding Server E at 2.5B:
  ...Server B (1.8B) ──► Server E (2.5B) ──► Server C (2.9B)...
     Keys in range (1.8B, 2.5B] now belong to Server E
     Keys in range (2.5B, 2.9B] still belong to Server C

ONLY keys between 1.8B and 2.5B move — to Server E.
ALL other keys (on Server A, B, D, and most of Server C) — UNCHANGED.

Compare to modulo hashing:

Modulo hashing (4 → 5 servers): ~80% of ALL keys remap
Consistent hashing (4 → 5 servers): only ~20% of keys remap 
  (specifically, only the keys that "belonged" to the segment 
   now split by the new server)

The general rule: Adding or removing one server out of N only affects keys in the immediate neighboring segment(s) — roughly 1/N of all keys, not all of them. This is the property that makes horizontal scaling of stateful systems (databases, caches) practical without massive, system-wide data migrations.

The Hotspot Problem (And Why Virtual Nodes Fix It)

There's a catch with the basic algorithm above. With only 4-5 servers randomly placed on a ring spanning 0 to 2^32, the segments between servers can be wildly uneven:

Random placement might produce:
  Server A: 100M  ─┐
  Server B: 150M   │ Server B handles only 50M of keyspace (tiny segment)
                   │
  Server C: 2.5B  ─┘ Server C handles 2.35B of keyspace (huge segment!)
  Server D: 3.9B

Server C gets WAY more traffic and data than Server B — 
even though they're supposedly equal peers.

With few servers, random hash positions create uneven segments purely by chance — just like randomly throwing 4 darts at a circular dartboard rarely divides it into 4 equal slices.

Virtual Nodes: The Solution

Instead of placing each physical server at one position on the ring, place it at many positions (virtual nodes, or "vnodes") — typically 100-256 per physical server.

Without vnodes (4 physical servers, 4 ring positions):
  Uneven segments — Server C might get 50% of keyspace, Server B gets 2%

With vnodes (4 physical servers, 256 vnodes each = 1024 ring positions):
  Server A: vnodes at positions [12M, 89M, 156M, ... 256 positions total]
  Server B: vnodes at positions [34M, 102M, 198M, ... 256 positions total]
  Server C: vnodes at positions [...]
  Server D: vnodes at positions [...]

  1024 small segments, scattered across the ring.
  Each physical server "owns" ~256 of these segments — 
  on average, ~25% of the ring each (with much less variance 
  than the 4-position version).

import hashlib

def hash_to_ring(key):
    return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)

VNODES_PER_SERVER = 256

ring = {}  # position -> physical server
for server in ["server_A", "server_B", "server_C", "server_D"]:
    for vnode_id in range(VNODES_PER_SERVER):
        position = hash_to_ring(f"{server}#vnode{vnode_id}")
        ring[position] = server

sorted_positions = sorted(ring.keys())

def get_server(key):
    key_position = hash_to_ring(key)
    # Find first vnode position >= key_position (clockwise)
    for pos in sorted_positions:
        if pos >= key_position:
            return ring[pos]
    return ring[sorted_positions[0]]  # wrap around

The law of large numbers at work: With 256 vnodes per server spread across the ring, the sum of each server's vnode segments averages out close to 1/N of the total ring — even though any individual vnode segment might be small or large. More vnodes = more even distribution.

Bonus Benefit: Easier Rebalancing

With vnodes, when you add a new physical server, instead of taking one large chunk from one neighbor, the new server's 256 vnodes each take a small chunk from 256 different existing vnodes (scattered across all other physical servers). The data migration load is spread evenly across the entire cluster — not concentrated on one unlucky neighbor.

Real Implementation: Cassandra

Cassandra uses consistent hashing with virtual nodes (256 by default, configurable) as the foundation of its entire architecture.

Cassandra cluster: 6 nodes, 256 vnodes each = 1536 total ring positions

When you write a row:
  1. Hash the partition key → position on the ring
  2. Walk clockwise to find the "owning" vnode → identifies physical node
  3. Replicate to N nodes clockwise from there (N = replication factor)
     (this is how Cassandra achieves the W/R quorum from Day 2)

Adding a 7th node:
  - New node gets 256 new vnode positions, scattered across the ring
  - Each new vnode "steals" a small range from an existing vnode
  - Data for those ranges streams to the new node
  - ~1/7 of total data moves (not 6/7 or some larger fraction)
  - Cluster remains fully operational during this rebalancing — 
    reads/writes continue normally

This is why Cassandra clusters can grow from 10 nodes to 100 nodes over time, incrementally, without ever taking the cluster offline for a "resharding operation" — directly solving the resharding catastrophe from Day 3.

Real Implementation: Amazon Dynamo

Amazon's Dynamo paper (2007) — which inspired Cassandra, Riak, and DynamoDB — used consistent hashing as its core innovation specifically to solve the incremental scalability problem for their shopping cart and session storage systems, where adding capacity during traffic growth (especially around peak shopping seasons) couldn't require downtime.

Dynamo's specific contribution was combining consistent hashing with the quorum-based replication (W + R > N) from Day 2 — the ring determines which nodes are responsible for a key, and quorum determines how many of those nodes must agree for reads/writes. Consistent hashing answers "where," quorum answers "how consistent."

Real Implementation: Memcached Clusters

Memcached itself has no built-in clustering — each Memcached instance is independent and unaware of others. Consistent hashing happens client-side.

# Client-side consistent hashing for Memcached
memcached_servers = ["cache1:11211", "cache2:11211", "cache3:11211"]
ring = build_consistent_hash_ring(memcached_servers, vnodes=256)

def get_from_cache(key):
    server = ring.get_server(key)  # client decides which server
    return memcached_client[server].get(key)

When a Memcached server is added or removed, the client library's ring recalculates — and because of consistent hashing's core property, only ~1/N of cache keys "miss" on the new ring topology (they'll be re-fetched from the database and re-cached on their new server). Without consistent hashing, adding/removing a Memcached server would invalidate the entire cache — a massive spike in database load as everything is re-fetched simultaneously.

Interview Scenario: "Design a Distributed Cache Using Consistent Hashing"

The structured answer:

"I'd build a ring using a hash function like MD5 or SHA-1, mapping both cache server identifiers and keys onto a fixed-size space — say 0 to 2^32. Each physical cache server would be assigned multiple virtual node positions on the ring — I'd start with around 150-256 vnodes per server, which gives good load distribution without excessive memory overhead for the ring structure itself.

For lookups, a key is hashed to a ring position, and I walk clockwise to find the first vnode — that identifies the owning physical server.

When a server is added, its vnodes claim small ranges from existing vnodes scattered across the ring — so only roughly 1/N of keys need to move or be re-fetched, not the entire cache. This is the critical property: cache availability during scaling events.

For replication and fault tolerance, I'd replicate each key to the next 2 vnodes clockwise from its primary position — so if one server is down, requests fall through to a replica without a full cache miss.

One detail I'd watch for: hot keys. If a single key (a viral post) gets disproportionate traffic, consistent hashing alone doesn't help — that key still lands on one server. I'd combine this with the key salting technique from Day 3 for known hot keys."

Key Takeaways

Consistent hashing maps both servers and keys onto a circular hash space (a "ring") — a key belongs to the first server clockwise from its position.
Adding/removing one server out of N only remaps ~1/N of keys — solving the "modulo remaps everything" problem from Day 3.
Virtual nodes (100-256 per physical server) solve the uneven-segment problem of having too few ring positions, and spread rebalancing load across the entire cluster.
Cassandra uses consistent hashing + vnodes as its core architecture, enabling incremental scaling without downtime.
Amazon Dynamo combined consistent hashing (for "where") with quorum (for "how consistent") — the foundation of DynamoDB, Cassandra, and Riak.
Memcached clusters rely on client-side consistent hashing — without it, adding/removing a cache server invalidates the entire cache.
Consistent hashing doesn't solve hot keys — combine with key salting (Day 3) for that.

What's Next

Topic 23 covers Bloom Filters — the probabilistic data structure that lets Chrome check billions of malicious URLs using almost no memory, and how Cassandra uses them to avoid disk reads for keys that don't exist.

Topic 22 of the System Design Mastery series. The advanced data structures finale begins.*

Tags: system-design consistent-hashing distributed-systems cassandra backend databases interview-prep

DEV Community