DEV Community

Cover image for 18/30 Days System Design Questions!
Joud Awad
Joud Awad

Posted on

18/30 Days System Design Questions!

Your Redis cache just expired on a key that 8,000 users hit every second.
Every single one of those requests is now flying straight at your database.

This is the thundering herd. You didn't have a traffic problem — you had a cache problem. Now you have both.

Here's the setup:
Service → Node.js API, 8,000 req/sec on the /feed endpoint
Cache → Redis, TTL = 60s on the feed key
DB → Postgres, comfortable at ~200 req/sec sustained
What happened → TTL expired at peak traffic, all 8,000 req/sec hit Postgres simultaneously

The DB is on its knees. You have minutes before it falls over. And the next TTL expiry is in 60 seconds.

What do you do?

A) Mutex lock — only one request queries the DB to rebuild the cache, the rest wait behind it.
B) Probabilistic early expiry — start randomly rebuilding the cache before the TTL actually hits zero.
C) Request coalescing — collapse all in-flight requests for the same key into a single DB query, return the same result to all of them.
D) Cache pre-warming — a background job rebuilds the key on a schedule, TTL never reaches zero in prod.

All four ship in production systems. Only one of them prevents the thundering herd without introducing a new failure mode under load.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments (including which answer is the senior-engineer trap that works in theory but falls apart when 8,000 requests are piling up).

If your team has ever had a cache expiry take down a database, share this with them. The debate is worth more than the post.

Drop your answer 👇

30DaysOfSystemDesign #SystemDesign #DistributedSystems #SoftwareArchitecture

Top comments (4)

Collapse
 
thejoud1997 profile image
Joud Awad

Answer: D — Cache pre-warming ✅

Here's why, and why the other three trick smart engineers:

Why D wins (Cache pre-warming):
Pre-warming eliminates the thundering herd at the root. A background job — a cron, a scheduled Lambda, a Sidekiq worker — rebuilds the cache key on a fixed schedule that's shorter than the TTL. The key never goes cold in production. There's no expiry cliff for 8,000 requests to fall off.

You know the data is hot. You know when it expires. You have the compute to rebuild it proactively. The cost is a background job running every ~45 seconds; the benefit is your database never sees the spike. Netflix pre-warms content metadata. Twitter pre-builds timelines for high-follower accounts. The pattern is everywhere — it's just not glamorous enough to make it into most architecture posts.

One detail that matters: pair it with stale-while-revalidate. Serve the stale value while the background job refreshes, so a slightly-late rebuild never causes a miss.

Collapse
 
thejoud1997 profile image
Joud Awad

Why C is the trap answer (Request coalescing):
Coalescing sounds most sophisticated — collapse all concurrent requests for the same key into one DB query, return the same result to all of them. No stampede, no lock contention. Elegant.

Here's where it breaks: coalescing requires in-process coordination. It works beautifully on a single server. On 50 Node.js instances behind a load balancer, each instance coalesces independently — so now you have 50 concurrent DB queries instead of 8,000. Better, but still a 50x spike at every TTL expiry. To coalesce across instances you need a distributed coordination layer, and now you've built something nearly as complex as a mutex without the simplicity.

Right answer for a single-process server. Partial solution at scale that engineers convince themselves is complete.

Collapse
 
thejoud1997 profile image
Joud Awad

Why A is wrong (Mutex lock):
Mutex looks clean: one request gets the lock, rebuilds the cache, everyone else waits. No duplicate DB queries.

The problem is "everyone else waits." Under 8,000 req/sec, waiting means API threads blocked, request queue filling up, p99 latency spiking to seconds. You've traded a DB overload for an application-layer backup. The database survives; the user experience dies. And if the lock holder is slow — DB is struggling, rebuild takes 500ms — every request in the system stalls for that window.

Collapse
 
thejoud1997 profile image
Joud Awad

Why B is wrong (Probabilistic early expiry):
XFetch / jitter-based expiry is clever: as the TTL counts down, each cache read has a small random chance of triggering a rebuild early. No lock, no coordination layer. Elegant math.

The problem is it's probabilistic, not guaranteed. Under the exact wrong timing (low traffic right before TTL, then a sudden spike) the key still expires cold. You've reduced the probability of a thundering herd without eliminating it. For an endpoint running at 8,000 req/sec 24/7, you want a deterministic guarantee — not good odds.