Fatih

Posted on Aug 15

Engineering Strategies For Resolving Redis Cluster Imbalance

#redis #distributedsystems #systemdesign #backend

Introduction

I recently recalled a feature I implemented in the past — a Redis cluster with multiple shards, serving millions of requests per day. We had extensive monitoring and alerting in place before it went live, but one risk slipped past my radar: unevenness across shards. Luckily, it never happened in production, but it could have.

In this article, I’ll explain how Redis clustering works, why unevenness occurs, and the strategies I’d use to address it.

Redis Cache, Hash Slots, & Sharded Redis Cluster

Redis is an in-memory key-value store known for its speed and support for rich data structures. To cache data, you provide a key, a value, and (usually) a TTL (time-to-live) so that expired entries are automatically removed.

In a single-instance Redis setup, all hash slots (0–16383) are allocated to the same server. In a sharded Redis cluster, a key’s hash slot is calculated as:

CRC16(key) % 16384

Each shard in the cluster owns a range of hash slots. When you store or fetch a key, Redis determines its hash slot, then routes the request to the shard responsible for that range.

Common Threats of An Uneven Redis Cluster

Hot Shard Performance Bottlenecks: Disproportionate traffic on certain shards can cause performance bottlenecks. This can lead to serious CPU or memory max-out occurrences and latency issues.
Premature key evictions: This can happen to shards nearing their maximum memory capacity. Based on its maxmemory-policy configuration, Redis will prioritize new entries and remove older qualified keys on such shards prematurely. A Primary data source layer can be severely impacted when said Redis cluster serves as caching layers.
Replication Traffic and Failover Issues: High traffic on hot shards will cause high replication traffic as well. Unmanageable replication tasks can lead to replication lag, leaving the replicas too far behind the master. In failover situations where a lagging replica gets promoted to master, a 'behind' master instance will cause cache reads to return stale results.

Why Shards Become Uneven

Uneven load across shards can happen for many reasons. Here are the most common:

TTL Skew: Some keys live longer than others. If certain keys have much longer TTLs, their shards may accumulate more data over time, leading to memory imbalance.

Slot Bias: Engineers often format cache keys predictably (e.g., user:12345), so the same parts of the key end up influencing the hash slot calculation. Without randomness, traffic can cluster around specific slot ranges.

Hot Keys: Even with a fair key distribution, some keys get far more traffic. For example, if a certain range of user IDs is more active, those keys can overwhelm the shard(s) that store them, causing CPU and memory hotspots.

Solutions

While there are a number of different workarounds to these issues, it's important as a software engineer to know some of the more common ones.

Isolation

Should certain key formats require extended life, it might be better to propose hosting a dedicated cluster to isolate its effects. Separating them can help engineers predict the peaks and slopes of the Redis cluster performance metrics. It can be challenging to monitor a cluster filled with key-value pairs that behave differently all the time.

We can implement the same behavior separation for Hot Key issues. A whole cluster dedicated for them will allow engineers to allocate a fine-tuned replica count or larger compute instances to overcome its traffic.

However, it's important to keep note that the isolation strategy can increase management overhead as you would have to think about scaling and monitoring more than one cluster. More often than not, the scale-out multiplier is not similar between clusters. This is because the growth of data for each cluster is not the same. Hence, engineers will have to continuously monitor the resulting behavior in production to make sure the scale-out for each cluster has stabilized in a safe position.

Key Sharding

Rather than spinning up a separate cluster, identifiable hot keys can be put through an additional hashing function that helps distribute hot keys to other hash slots.

Standardization

Sometimes, different TTLs can be unintentional. Therefore, setting a uniform TTL for all keys might be a feasible option.

Fix Bad Hashing

Avoiding the usage of sequential IDs like integer entity IDs can help reduce over-grouping. The hash of the final Redis keys might end up located in neighboring hash slot results.

The Redis' key hash tagging feature lets you tell Redis to consider a special part of the key to determine which hash slot it should fall into. This unlocks the possibility of choosing a deterministic hash function to tune your shard selection.

Tips for what goes inside the curly braces (hash tag):

If your primary entity ID seem to be random enough (UUIDs, hashes, etc.), placing it inside hash tag can get you a pretty good distribution.
For sequential IDs like integers, it might be worth hashing before placing it inside the curly braces.
Low cardinality values, a field that has very few distinct values, can skew your Redis load across shards even further as it results in small number of hash slots. Therefore, it is important for engineers to analyze the range of possible values before deciding what goes into the hash tag.

Optionally, you can test your key design and simulate the cache data evenness across shards before you release it to production.

Resharding

As a short-term fix, adding shards partitions the hash slot space into smaller ranges, which can temporarily spread the load more evenly.

However, if the underlying cause of imbalance remains—such as bad key hashing or hot keys—the skew will eventually reappear. If the newly split hash slot ranges don’t accurately isolate the hot keys, the imbalance in GET/SET operations will simply shift from one shard to another. In cases where the root problem is bad hashing, resharding only becomes an effective remedy after reaching very high shard counts—an approach that wastes resources by assigning many shards to underutilized hash slot ranges, reducing cost efficiency.

Resharding also carries significant operational overhead. During shard addition, Redis migrates keys between shards, which can consume bandwidth and CPU, leading to latency spikes that may exceed acceptable SLAs.

Conclusions

A Redis Cluster imbalance is a serious threat to many aspects of a system.
It can take different forms.
It can be addressed using various strategies, depending on the underlying cause.
Some strategies provide a temporary remedy, while others seeks to tackle the root cause permanently.
Each approach has its own benefits and trade-offs, so it’s important to assess urgency, ongoing maintenance needs, and cost implications.

DEV Community