Leonardo Tonezi

Posted on Jun 11

Consistent Hashing: Scaling Stateful Systems Without Reshuffling Your Data

#systemdesign #distributedsystems #backend #programming

In the era of AI, how we handle data matters more than ever. System design decisions like where data lives, how it scales, and how it survives failure are exactly the kind of decisions an LLM can suggest but can't be accountable for. Understanding them is still our job.

One of the techniques quietly powering most modern distributed systems is consistent hashing. Before going further, let's be clear about what problem it solves: this isn't about CPU power or handling more requests per second. It's about distributing stateful work across machines, and figuring out which machine owns which piece of state. That state might be rows in a database, cached values, rate-limit counters, or anything else with locality.

Two kinds of scaling

When engineers say "scale," they usually mean one of two very different things.

Stateless scaling is about handling more requests. You put a load balancer in front of a fleet of identical app servers, and any server can handle any request. Adding capacity is trivial: spin up another instance, register it with the load balancer, done. No server "owns" anything; they're interchangeable.

Stateful scaling is fundamentally harder. The state is too big, too hot, or too specific to live on one machine, so you split it across many. Now every request has a specific destination: the machine that owns the state you're touching. You can't round-robin a read for user:42, because only one machine has that user's data, and you need to route there.

This article is about the second problem.

A quick note on vocabulary

We'll use the word server throughout, because it's concrete and matches the mental model most engineers already have. Just know that in the literature you'll see the same idea called a node (Cassandra, DynamoDB), a shard (MongoDB, Vitess), a worker (distributed job systems), or a bucket (in-memory hash tables, S3 internals). Same concept, different vocabulary depending on the system.

The usual approach: hash modulo N (traditional hashing)

Say you're storing user sessions across 4 Redis servers. Each session is a key like session:user42, and you need a rule that decides which of the 4 servers holds it. The rule is simple: hash(key) % 4. The result is always 0, 1, 2, or 3, one index per server.

server_index = hash("session:user42") % 4

For a static cluster this is hard to beat: one calculation gives you the answer, keys spread evenly, and there is nothing to maintain. The catch is that word "static." And static clusters are a rare luxury today.

You have a system designed to grow, built around a number that can never change. That is the problem. The moment 4 becomes 5, almost every key lands on the wrong server.

Run the numbers and you get roughly 80% of keys reassigned. Not a few. Not some. Most of them.

Your cache exists to protect your database. It absorbs the repetitive reads so the database only has to handle what the cache cannot. A cache stampede inverts that relationship: the cache stops working all at once, and every request that would have been answered in memory is now a query hitting the database simultaneously. Databases are not built for that kind of sudden spike. They slow down, queue up, and eventually fall over.

And if the trigger was not adding a server but losing one, the result is the same: most keys are now pointing at the wrong place.

Modulo hashing is a good default until it isn't. Stable cluster, predictable size, planned changes: use it without hesitation. Dynamic cluster, autoscaling, unplanned failures: that is when you need something better.

The core idea: the ring

Consistent hashing keeps the same starting point: you still hash keys into numbers, but it changes what you do with that number. Instead of mapping it onto a line with N slots, imagine bending the entire hash space into a circle. The hash function produces a number from 0 to some huge maximum, and when you reach the top, you wrap back around to 0. That circle is the ring.

You place your servers on this ring by hashing something stable about each one (its hostname, its IP) to get a position. The ring now has a handful of fixed points on it, one per server:

To find which server owns a key, you hash the key to get its own position on the ring, then walk clockwise until you hit the first server. That server owns the key. That's the whole rule.

When you add a server, only one thing changes: a new point appears on the ring. Keys that sit in the arc just behind that new point now walk clockwise to it instead of continuing to the next server. Every other key on the ring reaches the same server it always did. Nothing else moves.

Only the keys between server-A and the new server-D move, and they move to server-D. On average that's about 1/N of all keys, not the ~80% you lose with modulo hashing. Removing a server is the mirror image: its keys get absorbed by the next server clockwise, and nothing else shifts.

This is the property the whole technique is built around. Add a server, remove a server, lose a server to failure: in every case, only a small slice of keys moves.

What if server positions are distributed unevenly?

Since positions come from a hash function, they are effectively random. Get unlucky and three servers land close together, leaving one enormous arc on the opposite side of the ring.

That arc belongs to a single server. Every key that falls in it goes there, which can easily mean 70% of all traffic hitting one machine while the other two sit mostly idle.

How to fix uneven distribution? Virtual servers.

Instead of placing each physical server once on the ring, you place it many times under different labels. Each of those points is a virtual server, and all of them map back to the same physical machine.

With enough virtual placements spread around the ring, the large gaps disappear. No single machine ends up owning a disproportionate arc because the more entry points you have, the more evenly they fill the circle. Production systems take this pretty far: Cassandra, for example, uses 256 virtual placements per server by default.

Virtual placements also solve hardware imbalance for free. A stronger server just gets more points on the ring, which means more arcs, which means more traffic. A machine with twice the capacity gets twice the points and twice the load. The routing logic never changes.

What happens when a server dies?

Consistent hashing tells you where data lives. It does not protect you if that place stops responding.

When a server goes down, the ring skips it and points to the next server clockwise. If the data was replicated there, the request succeeds and nothing looks broken from the outside. If it was not, the next server has no idea what you are asking for. The rerouting happens either way: what determines whether it works is whether replication was set up in the first place.

This is why production systems that use consistent hashing also replicate each key to the next server clockwise. The data is not exclusive to one machine: its neighbor always has a copy. When the primary goes down, the next server clockwise already has what you need.

Where consistent hashing actually runs

Consistent hashing isn't a textbook curiosity. It's the routing layer inside some of the most widely deployed distributed systems today.

Cassandra, DynamoDB, and Riak all use it as their core data distribution mechanism. Each item's partition key is hashed onto the ring; the first node clockwise owns it. The official Cassandra documentation describes this directly, and the original Amazon Dynamo paper, which both DynamoDB and Riak descend from, is worth reading if you want to see the technique applied at scale alongside the quorum replication it enables.

Memcached client libraries are where many engineers encounter it without realizing. Memcached has no clustering logic of its own; the client decides which server holds a key. Naive clients used modulo hashing, which caused a complete remap of all keys on any topology change. libketama was written at Last.fm specifically to fix this: the original post by its author is a good read and describes exactly the problem we covered earlier in this article.

A concrete example: a Rust web crawler

I built Sandslash, an async SEO auditing crawler in Rust, and consistent hashing shows up there in a concrete way.

The crawler maintains two pieces of per-host state: a token bucket rate limiter (HostRateLimiter, one per hostname, to respect crawl delays) and a robots.txt cache (RobotsCache, one parsed robots file per hostname). In the current single-process design, every worker task shares these via Arc: any worker can handle any URL, and the shared map always has the right state.

Now imagine scaling to multiple processes. Without routing, any process can receive any URL. If four workers all fetch docs.example.com independently, each holds a partial view of the token bucket. The crawler thinks it's consuming 1 request/second per host; the actual rate is 4x. The robots.txt cache gets redundantly populated on every node. State that should be local is now fragmented across machines.

Consistent hashing fixes this at the routing layer. Before dispatching a URL, you hash the hostname onto the ring:

"https://docs.example.com/page1"  ->  ring.assign("docs.example.com")  ->  worker 2
"https://docs.example.com/page2"  ->  ring.assign("docs.example.com")  ->  worker 2
"https://api.example.com/v1"      ->  ring.assign("api.example.com")   ->  worker 0

Worker 2 is now the sole owner of everything related to docs.example.com. Its rate limiter entry is always accurate. Its robots.txt cache is always warm. No cross-process synchronization needed: locality replaces coordination.

One implementation detail worth flagging: the crawler currently uses std::hash::DefaultHasher, which is not guaranteed to be stable across Rust versions or process restarts. In a real multi-process deployment, you'd want a stable algorithm like FNV-1a or xxHash so that all processes agree on the same ring assignments.

Beyond the ring

The ring is the most common implementation, but not the only one.

Rendezvous hashing skips the ring entirely: it scores every server with hash(key + server_id) and picks the highest. No data structures to maintain, but O(n) lookup cost since every server must be scored on each request.

Jump consistent hash (Google, 2014) is stateless and fits in ~10 lines of code. O(1) memory and O(log n) lookup, but only supports adding buckets sequentially: removing an arbitrary server is not supported, which makes it a poor fit for systems where machines can fail unpredictably.

Maglev hashing (Google, 2016) precomputes a lookup table for true O(1) lookups. Built for load balancers where failures are rare; regenerating the table on node loss is slow.

Damian Gryski's Consistent Hashing: Algorithmic Tradeoffs covers all three in depth if you want to go further.

The tradeoff in one sentence

If your cluster never changes, modulo hashing is strictly better: simpler to implement, cheaper to compute, and perfectly uniform. Consistent hashing only earns its place when the cluster is dynamic: machines are added under load, failures happen at runtime, or capacity scales automatically. In those systems, moving 1/N of keys instead of almost all of them isn't an academic property. It's the difference between a topology change being invisible and it taking down your cache layer.

DEV Community