Aonnis

Posted on Jan 1 • Originally published at docs.aonnis.com

Thundering Herds: The Scalability Killer

#architecture #performance #systemdesign

Imagine it’s 3:00 AM. Your pager goes off. The dashboard shows a 100% CPU spike on your primary database, followed by a total service outage. You look at the logs and see a weird pattern: the traffic didn't actually increase, but suddenly every single request started failing at the exact same millisecond.

You’ve just been trampled by the Thundering Herd.

In this post, we’re going to dive into one of the most common yet misunderstood performance bottlenecks in distributed systems: what the Thundering Herd problem is, and how to use a combination of Request Collapsing and Jitter to build systems that don’t collapse under their own weight.

What is the Thundering Herd?

At its core, the Thundering Herd occurs when a large number of processes are waiting for an event, but when the event happens, they all wake up at once. However, only one can actually "handle" the event.

While the term originated in OS kernel scheduling, modern web engineers most frequently encounter it in the form of a Cache Stampede.

The Anatomy of a Crash:

The Golden State: You have a high-traffic endpoint cached in Redis. Everything is fast.
The Expiry: The cache TTL (Time-to-Live) hits zero.
The Stampede: 5,000 concurrent users refresh the page. They all see a cache miss.
The Collapse: All 5,000 requests hit your database simultaneously to re-generate the same data. Your database will experience a surge in load, latency skyrockets, and the service goes down.

Note: 5000 requests is an arbitrary number. The actual number depends on your system's capacity.

Beyond the Cache: Other Thundering Herd Scenarios

While cache stampedes are the most common, the Thundering Herd can manifest across your entire stack:

1. The "Welcome Back" Surge (Downstream Recovery)

Imagine your primary Auth service goes down for 5 minutes. During this time, every other service in your cluster is failing and retrying. When the Auth service finally comes back up, it is immediately hit by large number of requests per second from all the other services trying to "catch up." This often knocks the service right back down again—a phenomenon known as a Retry Storm.

2. The Auth Token Expiry

In microservices, many internal services might share a common access token (like a machine-to-machine JWT). If that token has a hard expiry and 50 different microservices all see it expire at the exact same second, they will all "thunder" toward the Identity Provider to get a new one.

3. "Top of the Hour" Scheduled Tasks

A classic ops mistake is scheduling a heavy cleanup cron job to run at 00 * * * * (midnight) across 100 different server nodes. At precisely 12:00:00 AM, your database or shared storage is hit by 100 heavy processes simultaneously.

4. CDN "Warm-up" and Deployment Surge

When you deploy a new version of a 500MB mobile app binary, it isn't in any CDN edge caches yet. If you immediately notify 1 million users to download it, the first thousands of requests will all miss the edge and hit your origin server at once, potentially melting your storage layer.

How to Detect the Herd (Monitoring & Metrics)

You don't want your first notification of a thundering herd to be a total outage. Look for these "herd signatures" in your dashboard:

Correlation of Cache Misses and Latency: A sudden spike in cache miss rates that perfectly aligns with a surge in p99 database latency.
Connection Pool Exhaustion: If you see your database connection pool hitting its max limit within milliseconds, you likely have a stampede.
CPU Context Switching: On your application servers, a massive spike in "System CPU" or context switches indicates that thousands of threads are waking up and fighting for the same locks.
Error Logs: Thousands of "lock wait timeout" or "connection refused" errors occurring in a tight cluster.

Strategy 1: Request Collapsing (The "Wait in Line" Approach)

Request collapsing (also known as Promise Memoization) is the practice of ensuring that for any given resource, only one upstream request is active at a time.

If Request A is already fetching user_data_123 from the database, Requests B, C, and D shouldn't start their own fetches. Instead, they should "subscribe" to the result of Request A.

The Problem with Naive Collapsing

If you implement a simple lock, you often run into a secondary issue: Busy-Waiting. If 4,999 requests are waiting for that one database call to finish, how do they know when it's done? If they all check "Is it ready yet?" every 10ms, you’ve just created a new herd in your application memory.

The Solution:

Event-Based NotificationTo fix this, we need to move from a Push model (or Polling) to a Pull/Notification model. Instead of asking "Is it done?", the waiting requests should simply go to sleep and ask to be woken up when the data is ready.

In Python or Node.js, this is often handled natively by Promises or Futures. In other languages, you might use Condition Variables or Channels.

Here is a Python example using asyncio. Notice how we use a shared Event object. The "followers" simply await the event, consuming zero CPU while they wait for the "leader" to finish the work.

import asyncio

class RequestCollapser:
    def __init__(self):
        # Stores the events for keys currently being fetched
        self.inflight_events = {}
        self.cache = {}

    async def get_data(self, key):
        # 1. Check if data is already in cache
        if key in self.cache:
            return self.cache[key]

        # 2. Check if someone else is already fetching it
        if key in self.inflight_events:
            print(f"Request for {key} joining the herd (waiting)...")
            event = self.inflight_events[key]
            await event.wait()  # <--- Crucial: Zero CPU usage while waiting
            return self.cache.get(key)

        # 3. Be the "Leader"
        print(f"Request for {key} is the LEADER. Fetching from DB...")
        event = asyncio.Event()
        self.inflight_events[key] = event

        try:
            # Simulate DB fetch
            await asyncio.sleep(1) 
            data = "Fresh Data"
            self.cache[key] = data
            return data
        finally:
            # 4. Notify the herd
            event.set() # Wakes up all waiters instantly
            del self.inflight_events[key]

The Giant Herd: Distributed Collapsing

The Python example above works perfectly for a single server. But what if you have 100 app servers? You still have 100 "leaders" hitting your database at once. Which may or may not be a problem, depending on your database. If you want to protect your system from this edge case, you can use distributed locks to ensure only one node in the entire cluster becomes the leader for a specific key.

To solve this at scale, you can use:

Distributed Locks (Redis/Etcd): Use a library like Redlock to ensure only one node in the entire cluster becomes the leader for a specific key.
The "Singleflight" Pattern: In Go, the golang.org/x/sync/singleflight package is the gold standard for this. It handles the local collapsing logic efficiently, and when combined with a distributed lock, it protects both your app memory and your database.

Strategy 2: Jitter (The "Social Distancing" for Data)

This is where Jitter comes in. Jitter is the introduction of intentional, controlled randomness to stagger execution.

Staggered Retries

When a request finds that a resource is being "collapsed" (someone else is already fetching it), don't let it retry on a fixed interval.

Bad: Retry every 50ms.
Good: Retry every 50ms + random(0, 20ms).

Staggered Expirations

Never set a hard TTL on a batch of keys. If you update 10,000 products and set them all to expire in exactly 1 hour, you are scheduling a disaster for exactly 60 minutes from now.Instead, use: TTL = 3600 + (rand() * 120). This spreads the "thundering" over a 2-minute window, which your database can likely handle.

The Pro Move: Probabilistic Early Refresh

The most resilient systems I've built use a technique called X-Fetch. Instead of waiting for the cache to expire, we use jitter to trigger a refresh slightly before expiration.

As the TTL approaches zero, each request performs a "dice roll." If the roll is low, that specific request takes the lead, re-fetches the data, and resets the cache. Because the "roll" is random for every user, the probability ensures that only one user triggers the update, while everyone else keeps getting the "stale but safe" data.

import time
import random

async def get_resilient_data(key):
    cached = await cache.get(key)

    should_refresh = False

    # 1. Handle Cache Miss
    if cached is None:
        should_refresh = True
    else:
        # 2. Calculate time remaining
        time_remaining = cached.expiry - time.time()

        # 3. Handle Negative Time (Expired) or Probabilistic Check
        if time_remaining <= 0:
            should_refresh = True
        else:
            # Probability increases as time_remaining approaches 0
            # Note: We check <= 0 above to avoid DivisionByZero or negative probability
            should_refresh = random.random() < (1.0 / time_remaining)

    if should_refresh:
        try:
            # Collapse requests using a distributed lock or local future map
            return await collapse_request(key, fetch_from_db)
        except Exception:
            if cached: 
                return cached.data # Fallback to stale data on DB failure
            raise

    return cached.data

Final Defense: Safety Nets

Sometimes, despite your best efforts with Jitter or Collapsing, a herd still breaks through. In those moments, you need a final line of defense to keep your system alive:

Load Shedding: When your database connection pool is full, don't keep queuing requests (which just increases latency). Start dropping them with a 503 Service Unavailable. It’s better to fail 10% of users quickly than to make 100% of users wait 30 seconds for a timeout.
Circuit Breakers: If your database is struggling, the circuit breaker "trips" and stops all traffic for a cool-down period. This gives your DB the breathing room it needs to recover without being continuously bombarded by retries.
Rate Limiting: By capping the number of requests per second (globally or per-user), you ensure that even a massive "herd" can't exceed your system's hard limits. Excess requests are throttled with a 429 Too Many Requests, protecting your infrastructure from being overwhelmed.

Choosing Your Weapon: Strategy Comparison

Strategy	Implementation Complexity	Best Used For...	Main Drawback
Jitter	Low	Retries, TTL Expirations	Doesn't stop the initial spike, just spreads it.
Request Collapsing	Medium	High-traffic single keys (e.g., Homepage)	Can become a complex "leader" bottleneck.
X-Fetch (Probabilistic)	High	Mission-critical low-latency data	Adds pre-emptive load to your database.

Closing Thoughts

Scaling isn't just about adding more servers; it's about managing the coordination between them. By implementing Request Collapsing, you protect your downstream resources. By adding Jitter, you protect your coordination layer from itself.

The next time you set a cache TTL, ask yourself: "What happens if 10,000 people ask for this at the same time?" If the answer is "they all wait for the DB," it's time to add some jitter.

If you enjoyed this deep dive into systems engineering, feel free to follow for more insights on building resilient distributed systems.

Build More Resilient Systems with Aonnis

If you're managing complex caching layers and want to avoid the pitfalls of manual scaling and configuration, check out the Aonnis Valkey Operator. It helps you deploy and manage high-performance Valkey compatible clusters on Kubernetes with built-in best practices for reliability and scale.

Surprise: It is free for limited time.

Visit www.aonnis.com to learn more.

DEV Community