The Day the Cache Avalanche Buried Our Treasure Hunt

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

In 2025 we launched a live event platform for Veltrix that let thousands of players race through augmented-reality clues in real time. By month three the daily concurrent users had jumped from 5 000 to 85 000 and the simple Redis-backed leaderboard we started with began emitting 503 Service Unavailable pages every few seconds. The infra team swore the Redis cluster was fine—CPU 42 %, memory 78 %, no evictions—but every time a clue unlocked the P99 latency for GET /hunt/{id} would spike from 12 ms to 2.4 s. Our event staff were literally standing in a server room rebooting Redis by hand between clue drops because the cache key pattern hunt:{id}:leaderboard (two million keys at peak) would invalidate at the same millisecond, causing one million identical cache-miss RTTs to DynamoDB. We were solving for a cache avalanche that turned the Redis connection pool into a 200-connection traffic jam. The alternative was to throttle players with feature flags, which would have killed the viral loop we spent a year building.

What We Tried First (And Why It Failed)

Our first fix was to bucket the invalidation window: instead of invalidating the entire leaderboard on every clue completion, we updated the cache key only when the top 100 ranks changed. That pushed the P99 back down to 280 ms, but introduced the 97th place curse. Players who moved from rank 101 to 98 were invisible on the map until the next bucket refresh, which was every five minutes. Complaints spiked on the DEV.to Discord: I beat that speedrunner, where is my badge? We tried disabling the bucket entirely and switched to Redis Streams to queue incremental updates. The Streams consumer fell behind by 400 ms during peak and we started dropping updates; the leaderboard now showed ghost positions that vanished on refresh. Both attempts failed because they treated symptom rather than boundary: the cache layer was still a single, monolithic write path for 120 000 concurrent state changes.

The Architecture Decision

We tore the cache layer apart and rebuilt it as three explicit boundaries.

Hot cache (Redis) for the top 100 ranks only, with a 5-second TTL and sliding window invalidation based on rank number, not on every write.
Warm cache (DynamoDB DAX) for the next 9 900 ranks, keyed by {hunt_id, score_bucket} so a rank update only touches one of 1 000 partition keys instead of every players aggregate row.
Cold store (Postgres) for the full leaderboard, updated asynchronously by an event-sourced write-behind worker that retries on 409 conflicts.

The trade-off was accepting eventual consistency for ranks below 10 000 in exchange for 99.9 % cache hit ratio on the hot path. We instrumented with OpenTelemetry and set an SLO: P99 ≤ 100 ms for players in the top 100, ≤ 300 ms for everyone else. The infra cost jumped from $2 400/month (Redis) to $4 800/month (Redis + DAX + Postgres + worker pools), but that was still cheaper than sharding Redis into four clusters with 128 namespaces each.

What The Numbers Said After

After the cutover the cache-avalanche 503s dropped to zero and global P99 settled at 89 ms. The real surprise was the worker pool: we expected 8 % update conflict rate, but because the write-behind path used a conditional update on the score column, the 409s were only 3 %—mostly from players on the same Wi-Fi network who simultaneously submitted the same timestamped GPS coordinate. DAX throughput averaged 420 000 reads/sec with 1.8 ms latency, and Postgres LSM compaction kept the cold store write lag below 300 ms 99 % of the time. The cost delta paid for itself inside six weeks because customer support tickets about leaderboard stalls fell from 112 per event to 3, and our NPS rose from 48 to 67.

What I Would Do Differently

I would not have introduced DAX. In practice DAXs point-in-time consistency window (three seconds) created a brief stale read window that a few speedrunners exploited to resubmit a clue after seeing an outdated rank. We patched it by forcing every client to wait for the write-behind workers 200 OK before allowing a new clue submission, but that added 50–120 ms of client-side stutter. Next time Id push the sliding-window invalidation all the way into DynamoDB Accelerator using a custom Go worker that streams the leaderboard delta every 250 ms instead of relying on DAXs proprietary cache. The infra cost would drop back to $3 600/month and the stale-read window would collapse to one update cycle.