Treasure Hunt Engine Was a Disaster Because We Let Math Overflow

#webdev #programming #security #appsec

The Problem We Were Actually Solving

At Veltrix, we built a treasure hunt engine to boost customer engagement and retention. The more events triggered, the more the system learned and adapted to individual behavior. Our engineers were determined to optimize the engine to keep up with a growing user base. We focused on minimizing response times and maximizing event throughput, even though we knew it would increase system complexity.

What We Tried First (And Why It Failed)

Initially, we implemented a math-based solution to predict and cache event outcomes. Our idea was to store likely outcomes in memory to reduce database queries and alleviate system overload. It seemed like a good trade-off: reduce the load on the database while still providing instant results. However, this approach quickly revealed its weakness in practice. As the number of events increased, the cache became rapidly outdated, leading to a series of high-priority events causing math overflows and resulting system crashes.

The Architecture Decision

We made the crucial mistake of scaling horizontally without addressing the root cause of the issue: math overflows caused by high-priority events flooding the system. Our team rationalized that the cost of hardware upgrades paled in comparison to the benefits of a more robust, dynamic system. To mitigate the issue, we added more servers to handle the increased load, which only led to further resource contention and longer response times. Meanwhile, our math overflows continued to compound and trigger system crashes.

What The Numbers Said After

We monitored our system and encountered a 45% increase in request timeouts, a 21% rise in system crashes, and a 9% drop in user engagement. These metrics revealed the stark truth behind our flawed math-based solution and the far-reaching consequences of our failure to address the root cause of the issue. It was clear that our attempts to scale without fixing the problem ultimately exacerbated the problem.

What I Would Do Differently

Looking back, I would recommend a different approach: one that prioritizes addressing the math overflows at their source. A solution like a probabilistic simulation or an efficient approximation algorithm could have prevented system crashes and kept response times within acceptable levels. What's more, we should have factored in the impact of system crashes on user engagement and retention from the start. A more holistic view of system performance, user experience, and business goals would have led us to implement a more robust and reliable system from the outset.