Deepak Singh Solanki

Posted on Apr 12 • Originally published at deepakinsights.Medium on Mar 8

The Thundering Herd Problem

#thunderingherdproble #systemdesignintervie #systemdesignconcepts

The Thundering Herd Problem

Introduction

In 2021, I was working on an EdTech ERP system. COVID had pushed everything online, and all institutions were facing issues related to online education. We built this system and deployed it on the AWS cloud. We were using horizontal scaling techniques to handle high traffic. We set up a load balancer , Docker, auto scaling , cache to make sure that the server can handle high load.

The client was expecting 10000 student logins, so we had done load testing for 10000 users as asked by the client. After this testing, we decided to organise a mock test with real students on Sunday at 5 PM. We were ready.

The exam started at 5 PM, and we got a traffic of 25000 students simultaneously. Still, everyone was cool as we had already set up the auto scaling for the server. But we forgot one thing. Auto scaling isn’t instant and the server scaling time can vary from seconds to minutes, and the server went down. After 5 minutes, the server was ready to handle this load, but another problem hit us. All the cache expired simultaneously as the cache TTL was set to expire between 5 PM to 6 PM every Sunday.

So 25,000 students hit the system, and the cache expired simultaneously, so all requests went straight to DB , and DB slowed everything. The exam hadn’t even started properly. We were frantically checking logs, and students were refreshing their screens.

The Thundering Herd Problem

Before going to the technical definition, we need to understand how a user reacts to online sale opening. Everyone is ready with their wishlists, fingers on the ‘Buy Now’ button, so websites face high traffic. But when the website starts slowing down, then the user continuously starts refreshing the page. So here is the situation: the website is slow because of high load and everyone refreshes the page to get faster response which increases the load even further and it may lead to system failure.

This is exactly what happens inside your system too. When thousands of requests hit the system at the same time, the system can’t handle it. That’s the Thundering Herd Problem.

This problem commonly occurs in three places in your system.

Load Balancer / Auto Scaling: We use Load Balancer to distribute the incoming traffic between all available servers. Let’s say if you set up three servers, then the load balancer will make sure that traffic can properly route to all servers without sending all traffic to any one of them.

But when traffic suddenly spikes, then available servers are not able to handle all traffic. The system tries to scale by adding more servers. But auto scaling isn’t instant. Those new servers take time to spin up: sometimes seconds, sometimes minutes. In that gap, your existing servers are taking all the hits alone.

This is exactly what happened with us. 25,000 students hit the system, auto scaling triggered, but the server went down before new servers could join.

Cache: Cache is used to store frequently accessed data in high-speed memory (e.g., RAM) to reduce database load and improve latency. It allows the system to access data faster. Each cache has a TTL which tells about the time when cache expires and must be refreshed. When a cache expires, all the data requests directly hit the database.

If the cache expires one by one at different times, the database can handle it easily. But if all the cache expires at the same time, all incoming requests find empty cache and hit the database that spikes database load.

That’s exactly what happened with us. Our cache TTL was set to expire between 5 PM to 6 PM every Sunday, the same time our exam started.

Database: A database contains a large amount of information, some data is frequently requested by clients. We already discussed that we use cache to store frequently accessed data. When the cache is working properly, these requests never reach the database. Cache handles them directly. It keeps the database free for heavy or complex queries.

But when cache is not available, all requests hit the database directly, causing a sudden, massive spike in identical queries, leading to severe performance degradation or total system slowdown/failure.

This same thing happened with us on the exam day. Cache expired simultaneously and all student requests hit the database for the same exam data, which eventually slowed down the system.

Real-world Example

We already discussed how the Thundering Herd Problem affected our mock exam. Now let’s look at some more real world examples that will help you understand this problem better:

Live Streaming: On India vs Pak match day, billions of users are watching this match online. On these important match days, platforms receive months’ worth of traffic in a single day. So, their system must be ready to handle this simultaneous load to prevent their system from getting down.
Fantasy apps: At the time of toss, millions of people are active on their fantasy apps to build their teams to win big rewards. This requires fantasy app servers to be ready to handle this simultaneous, high-volume traffic.
UPI / Payment Gateway: Nowadays, everyone likes to pay digitally. These payment applications handle millions of transactions every second. This makes these systems vulnerable to going down. Any sudden traffic spike can bring down these systems instantly. I am sure you have experienced this during peak hours, salary day or in festival season.
Tax Filing: We all need to file taxes to the government. Also, we need to submit this information online using government portals. Now think about the traffic, when everyone is trying to file their income tax / GST returns before the deadline. This high-volume traffic slows down systems and sometimes brings it down for some time.

How Traffic Spikes Overload Systems?

Before discussing this, first we need to understand that all system components like CPU , memory , database , disk , bandwidth have a limited capacity. It cannot be infinite. More resources definitely increase system capacity, but the system always reaches bottlenecks.

In a normal scenario, a system is designed to handle a fixed number of requests. When a request arrives, the server first checks the cache. If the data exists in cache, it returns the response immediately. Otherwise, the request goes to the database.

When a sudden traffic spike hits the server simultaneously, the system first tries to handle them with the available resources. The server also starts scaling when necessary.

Now if cache expires during this high traffic, then all the requests start hitting the database , which eventually overloads the connection pool and query responses slow down because of excessive queries.

During this time, users get delayed responses and sometimes it can lead to timeouts. Clients start retrying because of failed or slow response. This leads to additional traffic to the system.

This creates a dangerous cycle: more traffic, more retries, more load. It continues until the system slows down or crashes completely.

Why is it Dangerous in Distributed Systems?

In a single server system, the Thundering Herd Problem is bad. But in distributed systems , it becomes truly dangerous.

In distributed systems , multiple services depend on each other. When one service slows down, all dependent services start struggling too. Threads get blocked, memory fills up, and connection pools get exhausted. New requests find no available resources and start failing. This creates a chain reaction:

Failure → Retry → More Load → More Failure.

What started as a simple cache expiry now spreads across every service in your architecture. A small synchronization issue becomes a full system breakdown. This is why distributed systems must be designed assuming failures will happen and must be contained before they spread.

Normal Spike vs Thundering Herd

Not every traffic spike is a Thundering Herd. Here is the difference:

Normal spike: When traffic spikes gradually, the system can predict this and manage it accordingly. It also provides a breathing window for the system for auto-scaling.

Thundering Herd: When high-volume traffic spike hits the server simultaneously, the system struggles to handle the traffic. It forces the system to work at peak capacity and there is no breathing window for auto-scaling that leads to total failure.

Impact on System Components

Let’s understand the impact of the Thundering Herd Problem on different system components:

CPU: When the herd hits, CPU utilization spikes to 100%. The processor starts struggling with slow processing and high context switching which increases response time and the system tries to scale by adding more resources, but by then, the damage is already done.

Database: The database starts receiving a high volume of identical queries simultaneously which exhausts all available connections. This leads to lock contention and possible deadlocks. Queries wait in a long queue, resulting in slow query responses and sometimes complete database failure.

Cache: When the Thundering Herd hits, thousands of requests miss the cache at the same time and every one of them tries to regenerate the same data simultaneously. This creates massive CPU and network pressure on the cache server, causing memory spikes and rapid eviction cycles. Instead of protecting your database, your cache becomes part of the problem.

Load Balancer: The load balancer struggles to distribute requests evenly as servers become unresponsive or slow to reply.

Latency: All of the above situations directly impact latency. Slow cache, overloaded database, and exhausted CPU, everything adds delay to every single request. Users start seeing slow page loads, timeouts , and failed requests. What normally takes milliseconds can take seconds. This is when users lose trust in your system.

Each of these situations can contribute to a complete system breakdown.

Prevention Techniques

Once we understand its impact, now let’s look into the prevention techniques:

Request Coalescing: Remember the notice board in school? When the principal wants to inform everyone about the upcoming exam, he just publishes a notice, not informing each student individually. In system design, it’s called Request Coalescing. Only the first request hits the database and all other requests wait. Once the response is ready, it’s shared with all. It’s simple, but not effective for distributed systems.

Cache Locking / Mutex: Think about the biometric machine in your office. When your colleague is punching attendance, others wait till it’s done. Cache Locking works in the same way. When cache expires, only the first request acquires the lock and hits the database to regenerate the cache. All other requests wait till the cache is ready. The lock is released and all waiting requests get the data directly from the cache. It’s important to always set the expiry on the lock. If the request that acquired the lock crashes, the lock must auto-release. Otherwise, all waiting requests will be stuck forever.

Stale-While-Invalidate: Got the salary credited message, but banking app showing the old balance. Wondering what happened? It’s not a system issue, balance updating in background. Correct balance shows after some time. Banks use this technique to ensure a smooth experience. In system design, this is called Stale-While-Invalidate. It ensures all the requests continue to get the data without hitting the database. Cache is updating in background, once the cache refresh completes, new cache serves future requests. This approach reduces latency , prevents spikes on the backend, and ensures smooth traffic even during cache regeneration.

Staggered Expiry: Different roads get green signal at different times to avoid all traffic moving simultaneously. If all roads get green signal at once, then it can result in a mess. Similarly expiring all cache at once triggers a Thundering Herd. In the Staggered Expiry technique, the system uses expiry TTL with a random factor for all cache keys. It ensures all cache keys don’t expire at once and keeps the database safe from simultaneous queries. It reduces spikes in load and helps maintain system stability.

Exponential Backoff: You purchase an iPhone and try to make a payment. But the bank server is down. You wait for a minute, before retrying. If it still does not work, then you wait for a few minutes before retry. This small random delay gives the server time to recover. The same technique is known as Exponential Backoff in system design. This technique is used to limit the number of retries when a server is under the load. Instead of retrying immediately after a failed request, each subsequent retry doubles the delay by an exponentially increasing interval. A small random jitter is also added to each retry to avoid all clients retrying at the exact same time. Without this, all failed requests retry simultaneously that add more load on an already struggling server.

Rate Limiting: We all have experienced OTP delays on websites. Sometimes, we need to wait before requesting another OTP. That’s Rate Limiting in action. In system design, Rate Limiting controls the number of incoming requests to prevent system overload. During a thundering herd, it acts as the first line of defence. It blocks excessive requests before they reach the database. Because it’s always better to reject some requests than to crash the entire system.

What We Did to Fix It

After the failed exam, we decided to analyse our system design and list all the bottlenecks. After checking this carefully, we identified the issue and started working on solutions:

First we decided to pre-warm our servers, before any big event for at least 2x the expected load, which ensures that the server is ready to handle requests from high-volume traffic. Also, reconfigured auto scaling triggers to 70%, which was previously set as 90%.
We decided to use multiple Cache TTL and set it to expire at midnight to avoid such situations again.
We also worked on our indexing and cache optimization.
After completing these changes, we tested the server for 1 lakh users, recorded its results and built a flow chart to handle these situations better in future.
Finally, we set up Slack/email alerts , so team members can get alerts before things get worse.

Things take time. But, this time we were ready. Next Sunday, we conducted a successful exam for 30,000 students.

Conclusion

A Sunday in 2021 taught me some important lessons about system design and gave me a challenge to solve it. I took this challenge and solved it and you also can solve this just by knowing the bottlenecks in your system. Things that I learned on that evening are:

Thundering Herd is not just a traffic problem, it’s a synchronization problem.
Every system has a breaking point, we just need to know before it happens.
Sometimes small misconfigurations (like a wrong TTL ) can bring down entire systems.

So, what we need to consider while designing a system:

Always test the system with its peak limits , which tell you about system breakpoints.
Always track traffic to understand system load.
Using a proper cache TTL can save your system from going down.
Before any big event, check if server pre-warming is required. Don’t wait for the herd to arrive.

The herd will come. The question is: will your system be ready?

DEV Community

The Thundering Herd Problem

The Thundering Herd Problem

Introduction

The Thundering Herd Problem

Real-world Example

How Traffic Spikes Overload Systems?

Why is it Dangerous in Distributed Systems?

Normal Spike vs Thundering Herd

Impact on System Components

Prevention Techniques

What We Did to Fix It

Conclusion

Top comments (0)