Cache Rebalancing Was Broken. Here's How They Fixed It.

#caching #distributedsystems #cloud

Few things make SREs more nervous than rebalancing a cache cluster.

You know the feeling. You add a node, trigger a rebalance, and suddenly latency graphs start jumping. It's a familiar risk of the job, especially when your cache sits between your users and your database. A small configuration mistake here can suddenly unleash a storm of GET requests on your primary data store.

I admit, I never really understood the concept of a slot or why it was needed. But after listening to a recent episode of the Cache It podcast on the new atomic slot migration feature in Valkey 9.0, I finally decided to dig in. The deeper I went (and the more times I replayed the episode), the more it clicked.

My learning adventure led me to ask, and finally answer, four important questions regarding slots and what happens to them when a cluster resizes. Let's dive in.

What are slots?

Most caching systems rely on consistent hashing to decide where data lives. It keeps keys evenly balanced across nodes while allowing clusters to grow or shrink with minimal reshuffling.

Valkey uses a fixed hash-slot model, specifically 16,384 slots, that together represent the entire keyspace. While 16,384 slots might seem random, it's actually 2¹⁴, a power of two chosen because it balances just the right amount of precision and efficiency. With that many slots, data can be distributed across large clusters without adding unnecessary overhead to routing or metadata.

Every key hashes to one of those slots, and each node in a cluster owns a subset of them. Rather than mapping every key directly to a node, Valkey maps slots to nodes. Since keys are deterministically hashed to slots, this makes scaling predictable. When you add or remove a node, Valkey only has to move the slots it owns, not millions of individual keys.

When the Valkey client library hashes a key, it already knows which node owns the corresponding slot. If the topology changes (aka slots being assigned to a different node due to a scaling event), Valkey issues a quick redirect so the client can retry against the correct node.

What was broken about the old migration model?

The old migration model moved one key at a time between nodes, triggering a flurry of redirects and topology changes. Clients were constantly told, "Sorry, this key moved, try over there."

It was essentially a brute-force way of moving slots from one node to another. It worked, but it wasn't elegant – and it definitely wasn't fast.

Each key transfer required multiple round trips, and every slot migration forced clients to refresh cluster topology. Large values could even block threads during serialization.

When you're working with millions of keys across your cluster, that adds up to a resource-intensive process that can take minutes to complete, all while the cluster remains live and serving traffic.

The result was instability and slower tail latencies during migration. Which means it was something you'd postpone unless you absolutely had to run it.

How does atomic slot migration fix it?

Valkey 9.0 redesigned the slot migration process using the same principles that power replication. Instead of moving keys one by one, atomic slot migration runs in three distinct phases:

Snapshot – The source node forks a background process and captures a point-in-time snapshot of the slots being migrated while continuing to serve live traffic.

Streaming – Any writes that happen during the snapshot are captured in a buffer and streamed incrementally to the target node.

Finalization – Once all data is synchronized, Valkey briefly pauses new writes, sends a final marker, and performs a single, atomic handover.

This three-phase approach eliminates the multiple round trips, client redirects, and fragile user experience that used to come with slot migration. Because the process runs in the background and atomically switches ownership when complete, there is no risky in-between state. Once it finishes, your slots are already assigned to the target nodes without interruption or confusion.

Why is this better?

For operators, this is a big deal.

Fewer round trips, fewer topology changes, and no split-brain state during migration. You can rebalance entire clusters without disturbing workloads or waking up the on-call engineer (hooray!).

Both models still exist today, but atomic slot migration represents a new standard for reliability. It shows how thoughtful engineering can make the hardest operational tasks feel invisible.

As Khawaja said in the latest Cache It podcast episode with Jacob Murphy, "This moves us closer to a world where scaling a cache never means taking it offline."

And that's a world every SRE wants to live in.

Happy coding!