In my last post, we talked about the Thundering Herd—that chaotic moment when your workers stampede a resource all at once.
But what causes that stampede? Very often, it’s a Celebrity. No, I don't mean a movie star is literally crashing your servers (though they might!). In System Design, the "Celebrity Problem" (also known as a Hot Key or Hot Partition) is one of the biggest hurdles to scaling a modern app.
The Scenario: The "VIP" Lane Bottleneck
Imagine you have a huge warehouse (your database) where items are stored in different aisles (shards).
Usually, customers are spread out across all aisles. But suddenly, a "Celebrity" item goes on sale in Aisle 7. Now, 99% of your customers are trying to squeeze into that one aisle.
The other aisles are empty, but Aisle 7 is a disaster. People are shouting, the floor is breaking, and no one can get their item.
This is the Celebrity Problem: When one specific piece of data is so popular that the server holding it can't keep up, even if the rest of your system is idle.
Why is this so hard to fix?
In a perfect world, we scale systems by Sharding. If we have 10 million users, we put 1 million on Server A, 1 million on Server B, and so on.
But data isn't distributed equally.
- On X (Twitter), a regular user has 200 followers. Cristiano Ronaldo has 600+ million.
- On Amazon, a random spatula gets 5 views a day. The new iPhone gets 5 million.
If you shard your database by "User ID," the server holding the "Celebrity ID" will catch fire while the other servers are basically on vacation.
🛠️ 3 Ways to Tame the "Celebrity"
If you’re a Project Technical Lead or a Senior Engineer, these are the tools you use to keep the system standing:
1. Data Replication (The "Fan Out" Strategy)
Instead of keeping the Celebrity's data in just one place, you copy it across multiple "Read Replicas."
- How it works: When a celebrity posts, you write that post to 10 different servers. Now, instead of 1 million people hitting one server, you spread them across 10.
2. Local Caching (Layering the Defense)
Don't even let the request reach the database.
- How it works: Use an in-memory cache like Redis. If that's still too slow, use In-Process Caching—keep the "Celebrity" data directly in the memory of the application server itself for a few seconds. Even 5 seconds of local caching can save your database from millions of hits.
3. Adaptive Sharding (The "Pro" Move)
This is what the big players do.
- How it works: The system detects a "Hot Key" in real-time. If it sees that User ID 123 (the celebrity) is getting 100x more traffic than others, it automatically moves that specific user's data to a dedicated, high-performance shard or splits their data across multiple nodes.
🛑 The "Thundering Herd" Connection
As I promised in my last post: How do these two relate?
The Celebrity is the "Hot Key." The Thundering Herd is what happens when the cache for that Hot Key expires.
If Taylor Swift's profile is cached for 10 minutes, everything is fine. But at the 10-minute and 1-second mark, when that cache expires, everyone rushes to the database at once.
The Fix? Use "Soft TTLs" or Background Refreshing. Instead of letting the cache die, have a background task refresh the celebrity's data before it expires so the users never hit the database.
Wrapping Up
Understanding the Celebrity Problem is crucial because it teaches us that data is not equal. Designing for the "average user" is easy; designing for the "outlier" is where true System Design begins.
What happens when the system simply can't keep up?
In my first post, we looked at how to stop a Thundering Herd from crashing your database. In this post, we’ve tackled the Celebrity Problem and how to manage hotspots for high-profile entities.
But sometimes, even with the best caching and sharding strategies, the sheer volume of incoming traffic exceeds your total capacity. At that point, trying to save everyone often means saving no one.
How do the world's largest platforms choose which requests to drop to keep the lights on?
Join me for the next part of this series tomorrow, where we dive into Load Shedding: The Art of Failing Gracefully.
Let’s Connect! 🤝
If you’re enjoying this series, please follow me here on Dev.to! I’m a Project Technical Lead sharing everything I’ve learned about building systems that don't break.
Question for you: How would you handle a "Celebrity" post on a site like Reddit? Would you use a different strategy? Let's discuss in the comments! 👇
Top comments (0)