Most Treasure Hunts Are Actually Just Denial-of-Service Attacks Waiting to Happen

#webdev #programming #security #appsec

The Problem We Were Actually Solving

When we first implemented T.H.E., we were trying to create an engaging user experience for our players. We wanted to hide items in chests and let them be discovered through a fun, interactive process. We thought this would increase user satisfaction and encourage players to explore more of our world. In hindsight, our goal was to create a game within the game.

What We Tried First (And Why It Failed)

Initially, we tried to solve this problem by throwing more power at it. We added more CPU, RAM, and storage to our servers, under the assumption that they just needed a bit more oomph to handle the increased load. But as we scaled, the problems only got worse. Our database queries became slower, our network requests took longer to process, and our users started to get frustrated with the delays. We were caught in the vicious cycle of throwing hardware at the problem, but neglecting the root cause.

The Architecture Decision

The problem was deeply rooted in our architecture. We had designed T.H.E. to run as a separate service, communicating with our game server through RESTful APIs. This allowed for a certain level of isolation and modularity, but it also created a bottleneck. Every time a user interacted with T.H.E., it would trigger a cascade of requests to our game server, database, and other services. This increased latency and created a single point of failure.

What The Numbers Said After

Our monitoring data showed that around 90% of T.H.E. requests were being handled by a mere 10% of our players. These power users were generating an enormous amount of load on our servers, causing the delays and frustration we saw in the wild. We also noticed that around 70% of T.H.E. requests were being made to retrieve the same set of items. This told us that our system was suffering from a classiccase of the 80-20 rule: a small group of items was causing the majority of the load.

What I Would Do Differently

Looking back, I would have taken a much more holistic approach to designing T.H.E. from the start. I would have included more security and performance considerations in the architecture, rather than treating it as an afterthought. For instance, I would have implemented rate limiting and caching to mitigate the impact of power users. I would have also optimized our database schema and queries to reduce the latency associated with retrieving items. And I would have considered a more distributed architecture, where T.H.E. was not a single point of failure. By taking a more security-minded approach upfront, we could have avoided the problems we encountered later on.