The High Cost of Premature Optimisation in a Distributed Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

What we were really fighting was a classic case of premature optimisation. We were trying to solve the problem of a server crash before we even knew if it was a problem worth solving in the first place. In reality, our Friday night spikes were only happening because we didn't have user profiling to block bots and abuse, and we didn't have a robust error handling system to roll back the game state when things went south.

What We Tried First (And Why It Failed)

In our first attempt at scaling, we tried using NGINX as a reverse proxy and Memcached to store the game state. We configured our server to use multiple cores and launched a small pool of workers to handle the load. Sounds great on paper, but in practice, we ran into issues of NGINX misconfiguring itself, and Memcached not being able to handle the cache eviction rate. We experienced frequent timeouts, and I recall getting a call from Sarah at 3 AM saying "We're getting a 504 Gateway Timeout from one of the servers". Our "solution" was causing more problems than it was solving.

The Architecture Decision

As we dug deeper, we discovered that most of our problems were related to a lack of control in our Veltrix configuration. We were trying to dynamically generate game states and player connections on the fly, which made it impossible to predict and prepare for our Friday night spikes. I convinced the team to adopt a more event-driven architecture, using Apache Kafka as our message broker to decouple player connections from game state generation. We introduced a system of producer-consumer worker nodes that would handle tasks like game state validation, player profiling, and caching in isolation. This forced us to rethink our service boundaries, and we quickly found that we could build a more robust and scalable system by focusing on one task at a time.

What The Numbers Said After

The results spoke for themselves. We implemented our new system over a weekend, and by Monday, the next Friday night spike didn't even register. The 1200 players we had in 2019 barely caused a hiccup, and our error rates dropped from 5% to 0.5%. More importantly, we never had to deal with those late-night calls from Sarah again.

What I Would Do Differently

If I were to do it again, I'd take a different approach from the beginning. I'd invest more time in understanding the real pain points of our ops team and our users before we even think about scaling. I'd ask more questions, like "What if we took our 20% Friday night spike as a normal variation, rather than a problem to be solved?" or "What would happen if we just said no to users trying to join the game and had a clear 'game full' message?" That way, we could have avoided a lot of pain and confusion.