Gabriel Anhaia

Posted on May 23

Consistent Hashing's 3 Production Gotchas Nobody Warns You About

#systemdesign #distributedsystems #caching #architecture

Book: System Design Pocket Guide: Fundamentals — Core Building Blocks for Scalable Systems
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your consistent-hash ring "minimizes rebalancing." Then one node fails, the rebalance "only" moves 1/N of the keys, and that 1/N happens to be the hottest tenant, hammering the two adjacent nodes until they fall over too. Now you're moving 3/N. Then 5/N. The ring is doing exactly what the paper promised, and your service is down anyway.

The Karger 1997 algorithm is correct. The implementations are usually correct. What goes wrong is the gap between "minimizes movement" on a whiteboard and what 1/N of your traffic actually looks like at 11pm on a Friday.

This post skips the algorithm refresher. If you've drawn the ring you know how it works. What follows is three production failures the diagrams hide, the signals that catch them, and a checklist you can put in your runbook tomorrow.

A 60-second refresher (and a link out)

You hash node IDs onto a ring (usually a 64-bit integer space). You hash each key onto the same ring. A key belongs to the first node you find walking clockwise. Add or remove a node and only the keys between that node and its predecessor move. That's the headline property.

In practice nobody runs the ring with one slot per physical node. The variance is too high. Each node owns many virtual nodes (vnodes) scattered across the ring, which smooths the distribution. The classic reference is Karger et al., Consistent Hashing and Random Trees (STOC 1997). Dynamo's paper (2007) is the more practical read.

That's it. Onto what the diagrams don't show.

Gotcha 1: the rebalance storm

When a node dies, "only 1/N of keys move" is true and useless. Those keys move all at once, to two specific nodes, while every client cache for those keys is cold.

What you see on the dashboard: a step-function p99 spike on two adjacent nodes the moment a third one drops out. Their CPU pegs. Their connection pools saturate. Backend stores behind them get hit fresh for every miss. If the dead node held the hottest tenant's session data, the two neighbors don't just take 1/N of normal traffic. They take 1/N of normal traffic plus a cold-cache stampede for the hottest slice.

A team running a sharded Redis fleet I talked to last year had exactly this. Healthy steady state: 40% CPU per node. One node OOMs at 02:14. Two neighbors jump to 95% within 30 seconds. One of them ages out unrelated keys to make room for the new arrivals, which evicts another tenant's hot set, which now misses against its own shard. Cascade. Total outage: 22 minutes.

The fix is bounded-load consistent hashing. Mirrorball (Vahab Mirrokni, Mikkil Thorup, Morteza Zadimoghaddam, Consistent Hashing with Bounded Loads (2016), arxiv 1608.01350) adds one rule: each node has a capacity cap (typically c = (1 + ε) * average_load, with ε somewhere between 0.1 and 0.5). If the natural owner is over capacity, you walk the ring further until you find a node under cap. Same minimal-movement property, hard ceiling on the worst-case load per node.

Google uses this on their public-facing serving. NGINX has had a consistent variant of upstream hashing since 1.7.2. Envoy ships maglev and ring_hash load balancers, and ring_hash with minimum_ring_size plus the hash_balance_factor knob is bounded-load in practice. Use it.

The other half of the fix is shadow reads on join/leave. When a node enters or leaves the ring, the new owner serves the request but also async-fetches from the predecessor for a short window (say, 60 seconds). You eat the warm-up traffic instead of cold-missing every request. Pinterest's Mcrouter implements this for memcached.

Here's the minimum ring you actually want to read. Python, 80-ish lines, no dependencies beyond stdlib hashlib and bisect:

import bisect
import hashlib
from typing import Optional

class HashRing:
    def __init__(self, nodes=None, vnodes_per_node=150,
                 load_factor=1.25):
        # load_factor = (1 + epsilon) for bounded-load
        self.vnodes_per_node = vnodes_per_node
        self.load_factor = load_factor
        self.ring_keys: list[int] = []
        self.ring: dict[int, str] = {}
        self.loads: dict[str, int] = {}
        self.total_load = 0
        for n in (nodes or []):
            self.add_node(n)

    def _hash(self, key: str) -> int:
        # md5 is fine here; we're not authenticating anyone
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node: str) -> None:
        self.loads[node] = 0
        for i in range(self.vnodes_per_node):
            h = self._hash(f"{node}#{i}")
            bisect.insort(self.ring_keys, h)
            self.ring[h] = node

    def get_node(self, key: str) -> Optional[str]:
        if not self.ring_keys:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.ring_keys, h) % len(self.ring_keys)
        # bounded-load walk: skip over-capacity owners
        cap = self._capacity()
        for _ in range(len(self.ring_keys)):
            node = self.ring[self.ring_keys[idx]]
            if self.loads[node] < cap:
                return node
            idx = (idx + 1) % len(self.ring_keys)
        # everyone's saturated; return original owner anyway
        idx = bisect.bisect(self.ring_keys, h) % len(self.ring_keys)
        return self.ring[self.ring_keys[idx]]

    def _capacity(self) -> float:
        if not self.loads:
            return float("inf")
        avg = max(1, self.total_load / len(self.loads))
        return avg * self.load_factor

    def track(self, node: str, delta: int = 1) -> None:
        self.loads[node] = max(0, self.loads[node] + delta)
        self.total_load = max(0, self.total_load + delta)

That track() method matters more than the rest. Bounded-load only works if you actually know each node's current load, which means the client has to count requests in flight or RPS per node. If your hash ring doesn't have a feedback signal, you have a ring, not bounded-load.

Gotcha 2: the hot ring segment

Vnodes spread nodes evenly across the ring. They do not spread keys evenly across the nodes.

This is the part the diagrams obscure. The pretty drawing shows little colored dots distributed nicely around a circle and you assume traffic follows. Traffic doesn't follow. Traffic follows Zipf: in most production workloads a tiny fraction of keys gets the majority of requests. The top 1% of keys can take 50% of QPS. The top 0.01% can take 20%. Adding more vnodes does not fix this. Vnodes can't tell that key tenant:apple:session gets a million requests per second while tenant:obscure-startup-42:session gets three.

What you see: one node sits at 80% CPU regardless of how you tune the vnode count. You shuffle vnodes, you redeploy, the same node (or a different node) pegs. It's not your tuning. It's the workload.

Detection has to live at the client. Per-vnode load metrics on the server side show which slice is hot, but by then your nines are already gone. You want to detect a single misbehaving key before it eats a node.

Count-min sketch is the right tool. Sub-linear memory, false positives but no false negatives, constant-time update and query. Every client maintains a sketch over a sliding window. When any key crosses a threshold (say, more than 5% of recent traffic), you mark it hot and route it differently.

import hashlib
import time

class HotKeyDetector:
    def __init__(self, width=4096, depth=4,
                 window_seconds=10, hot_fraction=0.05):
        self.width = width
        self.depth = depth
        self.window = window_seconds
        self.hot_fraction = hot_fraction
        self.counts = [[0] * width for _ in range(depth)]
        self.total = 0
        self.window_start = time.monotonic()

    def _indexes(self, key: str):
        # depth independent hash families via seeded md5
        for i in range(self.depth):
            h = hashlib.md5(f"{i}:{key}".encode()).digest()
            yield int.from_bytes(h[:8], "big") % self.width

    def observe(self, key: str) -> bool:
        # reset the window if we've rolled over
        now = time.monotonic()
        if now - self.window_start > self.window:
            self.counts = [[0] * self.width
                           for _ in range(self.depth)]
            self.total = 0
            self.window_start = now

        mins = []
        for row, col in enumerate(self._indexes(key)):
            self.counts[row][col] += 1
            mins.append(self.counts[row][col])
        self.total += 1
        est = min(mins)
        # "hot" if estimated share > threshold
        return est > self.total * self.hot_fraction

Once you've flagged a key hot, you have options:

Replicate it. Pick the next 1-2 nodes clockwise on the ring and write the hot key to all of them. Reads round-robin across the replica set. Mcrouter does this with its "key splitting" mode.
Sidecar it. Pull the hot key out of the ring entirely into a small fast tier: local LRU on every client, or a single shared in-memory store. The price is a more complex invalidation path; the benefit is the ring stops seeing the hot key at all.
Eject the tenant. If one tenant is reliably 10x everyone else, give them their own pool. The ring was never the right tool for a workload where one customer is the workload.

Option 1 is the cheap fix. Option 3 is the right fix for a real heavy tail.

Gotcha 3: virtual-node count mis-tuning

Pick the wrong vnode count and you trade two different problems. Too few: bad distribution variance, with some nodes owning 30% more keys than others, even on a uniform workload. Too many: the ring becomes a fat data structure that costs real time to rebuild on every membership change.

The math is mostly settled. With V vnodes per node and N physical nodes, the standard deviation of per-node load (on uniform keys) scales roughly as 1/sqrt(V*N). So 10 vnodes per node with 10 physical nodes gives you σ ≈ 10%, meaning some nodes will sit 20-30% above average just from variance. That's terrible. 100 vnodes brings σ to about 3%. 200 to about 2%. Beyond 500 the marginal gain is noise.

The cost side: each ring lookup is O(log(V*N)) (binary search on sorted vnode hashes), which is fine. Each membership change is O(V) insertions or deletions, which is also fine for a single event. What hurts is recomputing the ring from a config blob on every node join. That's O(V*N) operations, and at 1000 vnodes × 1000 nodes you're talking real CPU for a routing decision.

The rule of thumb I'd actually write in a runbook:

Cluster size	Vnodes per node
< 10 nodes	200-500
10-100 nodes	100-200
100-1000 nodes	40-100
> 1000 nodes	20-40

Cassandra's default is 256, which assumes you'll grow into a moderate cluster. Riak defaults to 64 per node assuming a small cluster. Dynamo went much higher because they wanted near-perfect distribution. Pick the cell that matches your scale and don't fiddle until you see actual variance numbers in production.

The thing that catches people: deploy-time membership change tail latency. A new node joins, the ring recomputes, every client refreshes its view, and for ~200ms request routing is slow because the ring is being rebuilt. You see it as a P99 blip on every rolling restart. The fix is not "fewer vnodes." The fix is to swap the ring data structure atomically on update (read-copy-update style), so requests in flight see the old ring until the new one is committed.

Detection signals that catch all three

You want three signals wired into your dashboards, plus one that lives in your client:

Per-vnode load: RPS, CPU, p99 latency per vnode, not per node. Hot ring segments show up here before they show up as node-level. Most off-the-shelf metric setups don't expose this; you have to instrument the lookup path to tag each request with the vnode hash it landed on.
Ring-recompute time: gauge the time spent rebuilding the ring on membership change. Alert if it exceeds your p99 SLA. This catches vnode-count drift.
Client-side key frequency: output of the count-min sketch above. Top-k hot keys per minute, exported as a metric.
Rebalance velocity: bytes (or keys) per second moving between nodes during a rebalance. If this saturates inter-node bandwidth you're in a storm, not a rebalance.

When consistent hashing is the wrong tool

Consistent hashing is great when nodes come and go often. It's overkill when they don't. For a cluster where membership changes monthly, rendezvous hashing (also called Highest Random Weight, HRW, from David Thaler and Chinya Ravishankar, 1996) is simpler and often better.

The algorithm: for each key, compute hash(key, node_id) for every node, pick the node with the highest score. That's it. No ring, no vnodes, no rebuild. Adding a node means every key that scored highest on it now lives there. Same minimal-movement property as the ring, without the data structure.

def hrw_pick(key: str, nodes: list[str]) -> str:
    return max(nodes, key=lambda n:
               int(hashlib.md5(f"{key}|{n}".encode())
                   .hexdigest(), 16))

Lookup is O(N) instead of O(log(V*N)). That's the catch. For 20 nodes, HRW is faster than a ring lookup because you skip the sorted-array overhead. For 2000 nodes, the ring wins. HRW also gives you a natural top-k ordering for free, which is useful for replication: "store the key on the top 3 nodes by HRW score" beats "store it on the next 2 vnodes clockwise" because the latter can pick two vnodes that map to the same physical node.

If you're running a fixed-size cluster or one that scales by 10s of nodes per year, HRW is probably simpler than what you have now.

A production checklist

Five things, pin them above your desk:

Bounded-load consistent hashing, not raw consistent hashing. ε between 0.1 and 0.5 depending on how spiky your traffic is.
Client-side hot-key detection via count-min sketch. Threshold around 5% of recent traffic, window 10-30 seconds. Replicate, sidecar, or eject anything that crosses it.
Vnode count tuned for your scale: 100-200 per node under 100 nodes, fewer at higher scale. Variance ≤ 5% on the synthetic distribution test.
Per-vnode metrics, not just per-node. RPS, p99, CPU. Hot ring segments are invisible without this.
Shadow reads on node join and leave for 30-60 seconds. The new owner async-warms from the predecessor; cold-miss storms get absorbed.

The ring works. The math is right. What goes wrong is everything around the ring: the load that isn't uniform, the cache that isn't warm, the membership change that takes longer than your p99 budget. Most production consistent-hashing outages aren't bugs in the hash. They're missing pieces in the system around the hash.

What's the worst consistent-hash incident you've had to debug? Drop the post-mortem in the comments.

If this was useful

Consistent hashing sits in the section of System Design Pocket Guide: Fundamentals on load distribution and sharding. The chapter walks the math for vnode variance, the bounded-load extension, and rendezvous hashing side by side, with the operational checklist this post ends on expanded into a longer runbook. If you operate any sharded store or distributed cache, that chapter is worth the price of the book on its own.