The Thundering Herd Problem
Imagine H&M announces a sale at 9:00 AM. Massive crowds are standing outside the doors. As the doors open, everyone rushes inside at the same time. The crowd becomes difficult for store staff to manage, the doorway jams, and there is no place for people to walk inside the store. This problem is referred to as the Thundering Herd problem.
Basically, the Thundering Herd problem occurs when a large number of people are waiting for an event to happen, and as the event opens, everyone strikes at once.
Where does this problem occur?
Caching: When cache expires and all processes try to hit the database at the same time.
Database: If the application queries the same record simultaneously.
Load Balancers: When a server comes back online and is suddenly hit with multiple requests to handle.
Real-world scenario: Cache Expiration
The normal flow is: App -> Cache -> Database.
If a user sends a request, the app checks if the data exists in the cache and returns it.
If it does not exist in the cache, it goes to the database.
The Thundering Herd Problem:
Let's suppose the cache expires every 5 minutes. If 1,000 users request data exactly at the 5-minute mark when the cache expires:
The first request goes: App -> Cache -> Database
The second request goes: App -> Cache -> Database
...and so on, until all 1,000 requests hit the database.
How do traffic spikes overload the system?
When multiple requests hit simultaneously:
- CPU usage jumps to 100%.
- There are too many database connections.
- The cache is effectively unavailable (missing).
- Response time increases significantly.
Why is it more dangerous in a distributed system?
In a distributed system, each service depends on the others:
- If the cache fails, multiple requests hit the database.
- If the database is slow, the application threads block.
- If the application blocks, the load balancer initiates retries.
- Load balancer retries increase traffic.
Once a spike occurs, it can bring multiple services down in a cascading failure.
Real-world examples
- When a famous show is released and everyone watches it at the same time.
- If a cricket match or IPL game is ongoing and everyone tries to refresh the score at the same time.
- If Amazon has a sale beginning at one particular time and everyone tries to buy an item at that exact second.
So, the Thundering Herd problem is not just about high traffic, but the fact that everyone tries to request data at the same time. This can be avoided by controlling and distributing the traffic.
Top comments (0)