- Why shard a cache and what success looks like
- When consistent hashing beats rendezvous — and when it doesn't
- Tactics for hotspots, rebalancing, and the metadata you need
- Client-side routing, failure modes, and automated recovery
- Practical runbook: implementable checklist and code snippets
Sharding a cache at millions of RPS is a mapping problem with operational consequences: the mapping you choose determines how much data moves on every join/leave, how concentrated hot keys become, and whether a single failure turns into a backend storm. Get the mapping, rebalancing and routing wrong and you trade sub-millisecond p50s for cascading p99s and pages at 02:00.
The symptoms that bring you here are familiar: sudden drops in cache hit ratio during resizes, one node taking the brunt of a hot key, rebalancing that triggers a spike in backend QPS, and client libraries diverging on the live mapping so invalidations miss targets. At very large scale those failures don’t look like small blips — they translate into measurable business impact (high p99s, user-visible errors and long tail latency that ruins UX) and expensive firefighting.
Why shard a cache and what success looks like
Sharding (or partitioning) turns a monolithic cache into many smaller, horizontally-scaled stores so you can scale memory and throughput linearly while keeping single-node latency low. Your design goals should be explicit and measurable:
- Capacity and throughput: linear or near-linear scaling of QPS and memory as you add nodes.
- Minimal disruption: adding/removing a node should move only a small fraction of keys (the minimal disruption property).
- Operational predictability: rebalances must be staged and observable; operations should be automatable.
- Cost per request: avoid over-replicating and keep the cache cost-efficient.
- Low stale-data rate: your chosen consistency trade-offs must be explicit.
These goals map directly to metrics you must monitor: cache_hit_ratio, p50/p95/p99 latency per operation, per-node QPS/CPU, eviction rate, and the rate of origin DB fallbacks when the cache misses spike.
When consistent hashing beats rendezvous — and when it doesn't
You have two widely used families of approaches: ring-based consistent hashing (with virtual nodes/vnodes) and rendezvous hashing (Highest Random Weight, HRW). Each solves the minimal-disruption requirement but with different operational trade-offs.
| Characteristic | Consistent hashing (ring + vnodes) | Rendezvous hashing (HRW) |
|---|---|---|
| Concept | Place many token points per server on a ring; key goes to nearest clockwise token. |
Score every server for a key with h(key, server); pick highest score. |
| Rebalancing behavior | Minimal if you use many vnodes; movement concentrated on neighbors unless vn/planned tokens used. | Minimal and uniform: removed/added node only affects keys that chose that node. |
| Memory/metadata | Small routing table: sorted token list; needs vnode count + token list. | Needs full node list and hash function; client computes nodes * keys scores for naive selection. |
| Performance at high node counts | O(log N) lookup (binary search) per key; needs O(V) metadata per node. |
Naive O(N) hash ops per lookup; can be optimized (partial evaluation, caching). |
| Weighted nodes | Supported via vnode counts or repeated tokens. | Natural: add node weight into score computation. |
| Simplicity | Conceptually older; widely used in caching/memcached implementations. | Simpler to reason about; often preferred for weighted selection. |
Key references: the ring approach originated in the consistent hashing work that targeted distributed caching and hot-spot relief . Rendezvous/HRW hashing predates it and is described in Thaler & Ravishankar’s name-based mappings work . Use cases and production notes (Dynamo, Cassandra, large-scale load-balancers) show both algorithms in practice .
Contrarian, practical insight: at very large node counts (hundreds to thousands), the operational cost (configuration metadata and client/library behavior) matters more than asymptotic complexity. Rendezvous looks more CPU-heavy per lookup, but it eliminates the need for virtual nodes and complex token management; consistent hashing + vnodes reduces variance but trades more metadata and careful token assignment. Jump consistent hash provides a fast, low-memory mapping into numbered buckets but requires bucket numbering to be compact and sequential — making it cleaner for storage partitioning but less flexible for node lifecycles in arbitrary ID spaces .
Tactics for hotspots, rebalancing, and the metadata you need
Hot keys and rebalances break otherwise-good mappings. Your playbook must combine detection, surgical mitigation, and safe rebalancing.
Detection and telemetry
- Track per-key QPS with sampling or a heavy-hitters sketch (e.g., Count-Min or top-k sampling). Set alerts on keys crossing operational thresholds.
- Observe per-node
evictions/sec,cpu, and headroom (connection queue length). Hot nodes frequently show high CPU and risingevictions/seclong before p99 degrades. - Measure origin fallback QPS — this is the signal that cache misses are hurting the backend.
Hotspot mitigation patterns
- Replication of hot keys: Create N replicas of a hot key and route reads to the least-loaded replica. Use rendezvous hashing over the replica set to choose the least-loaded target for a given client (this keeps routing deterministic and cheap to compute).
- Dynamic fan-out (read splitting): For heavy multi-key fetches, split the query across replicas to avoid a single server handling all fan-in. Facebook’s memcache engineering work shows replication and “shunting” patterns to handle storms and to convert failures into cache hits for a period .
- Sub-sharding (logical splits): For very hot keys, split the key namespace for that single key into shards (append a suffix produced by hashing a request attribute) and aggregate on read-side client code. This turns a single hot key into many smaller hot keys.
- Traffic shaping: Backpressure or token-bucket rate limit per key at the proxy/client layer to avoid backend overload on misses.
Safe rebalancing and warm-up
- Use vnodes (virtual nodes / many tokens per physical server) to spread the reshuffle across the cluster; DataStax/Cassandra docs recommend dozens to hundreds of tokens per node depending on cluster heterogeneity and scale .
-
Pre-warm new nodes: stage a new node in a
drain/copymode and perform background key pulls (or streaming replication) before exposing it to full traffic. Mark nodenot-readyin routing metadata until warm-up completes. Facebook and other large deployments prefill caches during rebalances to avoid a miss storm . - Staged config rollout: publish a new ring/config with a version id, deploy to clients as a staged rollout (e.g., % of clients), watch hit ratio and origin QPS, ramp if safe. Use sticky clients (delay ring switch by a small window) to allow warm-up while reducing simultaneous cold-starts.
Metadata you must persist and distribute
-
ring_version/ config epoch (atomic updates reduce split-brain in clients) - Token list (for consistent hashing) or node list + weights (for HRW)
- Node health and
stateflags (up,draining,maintenance,not-ready) - Replica preference lists and zone/rack affinity (for locality-aware routing)
- Per-node capacity weights (for heterogeneous hardware) Choose a coordination mechanism that fits your availability model: gossip for decentralized resilience or a central store (etcd/consul) for strong, easily observable, atomic updates (trade-offs exist; Dynamo-style systems use decentralized membership and preference lists) .
Important: Invalidation and mutation propagation is the trickiest part of cache correctness at scale — if your mapping and membership diverge across clients, invalidations miss and stale reads multiply.
Client-side routing, failure modes, and automated recovery
You must choose where routing logic lives: in the client library, in a local sidecar/proxy (mcrouter, twemproxy), or in a central service. Each has different failure and automation trade-offs.
Proxies vs client libraries
- Client libraries reduce network hops and can exploit in-process caches and batching, but you must update library configuration atomically and consistently across thousands of clients.
-
Sidecar/proxy layer (e.g.,
mcrouter,twemproxy) centralizes routing, simplifies client binaries and allows richer routing policies, online reconfiguration, and health checks; Twitter’stwemproxyand Facebook’smcrouterare production-proven examples with server ejection, online reconfiguration and stats . Use proxies when you want uniform control over routing behavior or when client updates are expensive at scale.
Common failure modes and responses
- Node crash / transient network blips: immediate remap of keys to surviving nodes. If remap is not staged, you get sudden miss spikes. Mitigate with replication and local fallback caches.
-
Network partition and split-brain: avoid concurrent incompatible
ring_versionupdates; require a quorum/health check policy for flipping a config toactive. - Flapping nodes: avoid immediate removal of flapping nodes; use exponential backoff and require multiple consecutive health-check failures before auto-ejection.
- Cold-start storms: when many clients see a new node simultaneously, origin QPS spikes. Stage rollouts and pre-warm to prevent this.
Automation and observability primitives you should implement
-
Auto-eject: temporarily mark hosts as down after N consecutive failures; automatically reintroduce after health check passes (both
twemproxyandmcroutersupport auto-ejection features) . -
Versioned config delivery: publish
ring_versionand atomically swap in the new configuration. Clients should checkring_versionand delay swap untilprewarmOR be able to prefer old mapping for short windows. - Automated reheating: background copy jobs to move hot items to new nodes before fully enabling them.
- Shadowing and traffic mirroring: mirror a percentage of production traffic to a candidate node/pool before committing it to the ring (mcrouter-style traffic shadowing used for safety) .
-
Instrumentation:
node.qps,node.cpu,node.evictions_per_sec,key.qps_sampled,origin_qps— set clear SLIs and automated rollbacks on threshold breaches.
Practical runbook: implementable checklist and code snippets
Below are concrete steps and code you can drop into a design doc and use as a checklist.
Checklist — initial design
- Decide mapping algorithm:
consistent-hash(ring + vnodes) orrendezvous(HRW). - Choose
num_vnodesper physical node (start 64–256 for uniform hardware; DataStax docs have guidance). - Establish metadata service:
etcd/consulfor atomic ring updates or a gossip protocol for decentralized membership (document your reasoning). - Build client libraries and/or deploy a proxy (
mcrouter/twemproxy) with health-check + auto-eject support. - Implement heavy-hitter telemetry and alerts (per-key QPS sampling).
- Plan a staged rebalancing process with pre-warm and rolling traffic ramp.
Checklist — safe node add/remove procedure (operational)
- Provision node and mark
not-readyin metadata. - Pre-warm: background-copy hot keys or stream partitions from neighbors.
- Expose the node to a small percentage (e.g., 5–10%) of clients for 5–15 minutes while monitoring
origin_qpsandcache_hit_ratio. (Adjust windows to your workload.) - If metrics stable, ramp to 25%, then 50%, then 100%. Each step should be surfaced with an automated health gate.
- If adverse signals appear, immediately remove the node from the ring and trigger an automated rollback. Monitor origin QPS for 10 minutes after rollback to confirm recovery.
Hot-key mitigation runbook
- If
key.qps> hot-threshold:- Create logical replicas for the key and update the replica list in metadata.
- Use rendezvous hashing to pick which replica a client should read from: compute
hrw(key, replica)and prefer the least loaded of the top-K candidates. - For writes, perform a single-writer or strongly coordinated path (depends on your consistency model) to avoid write races.
Code: simple Rendezvous (HRW) selection (Python)
import hashlib
from typing import List, Tuple
def hrw_choose(key: str, nodes: List[Tuple[str, float]]) -> str:
"""
nodes: list of (node_id, weight)
returns chosen node_id for key using weighted HRW
"""
best = None
best_score = -1
for node_id, weight in nodes:
h = hashlib.sha256(f"{key}|{node_id}".encode()).digest()
score = int.from_bytes(h[:8], "big")
# incorporate weight (e.g., multiply score by weight or use more advanced mapping)
scaled = score * weight
if scaled > best_score:
best_score = scaled
best = node_id
return best
# Example usage:
nodes = [("nodeA", 1.0), ("nodeB", 0.5), ("nodeC", 1.5)]
winner = hrw_choose("user:42", nodes)
Code: consistent hashing with vnodes (Python skeleton)
import bisect
import hashlib
class ConsistentRing:
def __init__(self):
self.ring = [] # sorted list of token ints
self.token_to_node = {} # token -> node_id
def _hash(self, key: str) -> int:
return int.from_bytes(hashlib.md5(key.encode()).digest(), 'big')
def add_node(self, node_id: str, vnode_count: int = 128):
for i in range(vnode_count):
token = self._hash(f"{node_id}#{i}")
bisect.insort(self.ring, token)
self.token_to_node[token] = node_id
def remove_node(self, node_id: str):
tokens = [t for t, n in self.token_to_node.items() if n == node_id]
for token in tokens:
idx = bisect.bisect_left(self.ring, token)
if idx < len(self.ring) and self.ring[idx] == token:
self.ring.pop(idx)
del self.token_to_node[token]
def get_node(self, key: str) -> str:
token = self._hash(key)
idx = bisect.bisect_right(self.ring, token) % len(self.ring)
return self.token_to_node[self.ring[idx]]
Operational knobs you should expose in config
-
num_vnodesper node (if using ring) -
node_weightfor heterogeneous capacity -
auto_eject_fail_limitandauto_eject_retry_ms(for proxies) -
prewarm_enabledandprewarm_window_seconds -
ring_versionandmin_clients_for_version_swap
Monitoring and automation thresholds (examples you should tune)
- Alert if
origin_qpsincreases by >20% over baseline during a rebalance (rollback). - Alert if
cache_hit_ratiodrops >5 percentage points in 5 minutes post-change. - Auto-eject node after N consecutive request failures (e.g., 3) with exponential backoff.
A few pragmatic optimizations you’ll use in practice
- Use vnodes to spread ownership and reduce variance on join/leave .
- Use shadow traffic to pre-validate routing changes before making them authoritative (mcrouter style) .
- Prefer replication for hot keys to sharding them finer — replication simplifies reads and provides headroom quickly .
- Use jump consistent hash for storage-oriented mappings where buckets are linearly numbered — it’s fast and memory-light but requires sequential bucket ids .
Sources
Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web (Karger et al., STOC 1997) - Introduced consistent hashing and the ring continuum idea used in distributed caching.
Using Name-Based Mappings to Increase Hit Rates (Thaler & Ravishankar, Microsoft Research, 1998) - Describes the Highest Random Weight / rendezvous hashing algorithm and analysis.
Dynamo: Amazon’s Highly Available Key-value Store (DeCandia et al., 2007) - Real-world use of consistent hashing, preference lists, and operational practice for large-scale key-value systems.
A Fast, Minimal Memory, Consistent Hash Algorithm (Jump Consistent Hash) — Lamping & Veach (2014) - Describes jump consistent hash: low-memory, fast mapping suited to sequential bucket IDs.
Maglev: A Fast and Reliable Software Network Load Balancer (Google Research, NSDI 2016) - Practical design of a stable mapping (Maglev) used for connection consistency with discussion of table-based mapping and minimal disruption.
Scaling Memcache at Facebook (Rajesh Nishtala et al., NSDI 2013) - Production engineering lessons for huge memcache deployments including replication and mitigation patterns for hotspots.
mcrouter (Facebook) — GitHub project and docs - Production memcached router with online reconfiguration, shadowing and routing features used at scale.
twemproxy / nutcracker (Twitter) — GitHub project and docs - Lightweight proxy supporting consistent hashing modes and auto-eject features for memcached/redis pools.
Virtual nodes (vnodes) documentation — Apache Cassandra / DataStax - Practical guidance on vnode counts and how vnodes affect rebalancing and heterogeneity.
libketama: consistent hashing library for memcached clients (background and usage notes) - Historical practical implementation (Ketama) and how it places multiple server points on a continuum for memcached routing.
Top comments (0)