DEV Community

Cover image for The Folly of Treating Treasure Hunts as Synchronous Operations
Lillian Dube
Lillian Dube

Posted on

The Folly of Treating Treasure Hunts as Synchronous Operations

The Problem We Were Actually Solving

The apparent cause of the problem was a Treasure Hunt engine that, despite its many nice features, couldn't handle the number of concurrent requests it was receiving. However, what we soon discovered was that the root cause was more fundamental - the Treasure Hunt engine was being treated as a synchronous operation, even though it didn't need to be.

What We Tried First (And Why It Failed)

Our first approach was to simply increase the CPU and memory allocated to the Treasure Hunt engine. This seemed like a straightforward solution to a straightforward problem. We spun up a high-powered instance and pointed our Treasure Hunt engine at it. However, we quickly ran into the problem of resource contention. Our server's other components - the API, the database, and so on - were competing with the Treasure Hunt engine for the same resources. This led to a bottleneck, even on our high-powered instance.

The Architecture Decision

The solution we ultimately settled on was to re-architect the Treasure Hunt engine to treat its operations as asynchronous. We implemented a message queue-based system where the Treasure Hunt engine would send off requests to a queue, and then wait for completion. This allowed our server's other components to continue operating without interference from the Treasure Hunt engine. We implemented this using RabbitMQ as our message broker, and we set up monitoring to ensure that messages weren't getting stuck in the queue.

What The Numbers Said After

After implementing this new architecture, we saw a significant drop in disconnections and 'unable to connect to the server' errors. In fact, we were able to scale up to 1500 active players without seeing a single error related to the Treasure Hunt engine. Our message queue metrics showed that we were processing messages at a rate of around 500 per second, with an average latency of around 100ms. We also saw a significant reduction in CPU and memory usage on our server, which made it easier to scale up further.

What I Would Do Differently

If I were to do this again, I would spend more time on load testing before deploying the new architecture. While our message queue-based system performed well in production, we did see some edge cases where messages got stuck in the queue. If I had done more load testing, I might have caught these issues earlier and avoided some downtime.


The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1


Top comments (0)