Why We Got Ditched By Our Own Treasure Hunt Engine Before Scaling to 10K Concurrent Users

#webdev #programming #dataengineering #python

My team built a real-time treasure hunt game on top of our existing event-driven architecture. Players had to collect virtual "chests" as they progressed through the game. The game engine kept track of player scores, displayed real-time leaderboards, and adjusted game difficulty based on the scores.

## The Problem We Were Actually Solving,

We noticed a peculiar anomaly in our system: every time player scores passed the one-million threshold, our game engine started returning incorrect scores. This anomaly wasn't just a matter of score accuracy; it was tied to the game's entire state. This issue made our system unpredictable, rendering the game unstable.

## What We Tried First (And Why It Failed),

We initially thought that our problem was related to a batch processing pipeline issue. We suspected that because our pipeline ran in batches every 30 minutes, sometimes the score updates would get lost in the shuffle. We added an extra processor to run a mini-batch every 15 minutes, thinking this would solve the problem. It didn't.

## The Architecture Decision,

After some careful analysis, we realized that our issue wasn't about batch processing at all. It was related to a specific implementation detail that only revealed itself under high concurrency. Every time a player's score reached a million or more, our engine had to recompute the entire leaderboard. We were effectively using an O(n) operation on our player scores every time someone broke the one-million threshold, which meant it was killing our performance at scale.

## What The Numbers Said After,

After changing our architecture to a more scalable leaderboard system, our latency decreased from 3000ms to 300ms. With the change, the query cost for leaderboard results reduced by 75%, and we managed to meet our 1-second freshness SLA. By removing this O(n) operation, we also reduced the likelihood of score inconsistencies, ensuring that players' final scores accurately reflected their game performance.

## What I Would Do Differently,

If I were to do it again, I would start solving this problem much earlier. We should have measured latency and query cost much earlier to quickly identify when we hit the performance bottleneck and catch our mistake before we scaled to 10,000 concurrent users.

DEV Community

Why We Got Ditched By Our Own Treasure Hunt Engine Before Scaling to 10K Concurrent Users

Top comments (0)