DEV Community

Cover image for Scaling Treasure Hunts Without Losing Your Shirt
Lillian Dube
Lillian Dube

Posted on

Scaling Treasure Hunts Without Losing Your Shirt

The Problem We Were Actually Solving

I was tasked with leading the development of a real-time treasure hunt engine for a mobile gaming platform, with the goal of handling millions of concurrent users. The engine had to be able to generate clues, track user progress, and handle the winning conditions, all while maintaining a sub-second latency. Our initial prototype was built using Node.js and MongoDB, with a simple pub/sub messaging system using Redis. However, as we started to scale the system, we encountered a plethora of issues, including high CPU utilization, memory leaks, and frequent Redis connection timeouts.

What We Tried First (And Why It Failed)

Our first attempt at solving these issues was to simply throw more resources at the problem. We upgraded our Node.js instances to larger machines, added more Redis nodes to the cluster, and even tried to optimize our MongoDB queries. However, this approach only provided temporary relief, and we soon found ourselves facing the same issues again. The CPU utilization would spike, causing the system to become unresponsive, and the memory leaks would eventually cause the Node.js instances to crash. We also encountered a significant increase in Redis connection timeouts, which would cause the treasure hunt engine to fail. I recall one particularly egregious error message: Error: Redis connection timed out after 5000ms. It became clear that our approach was not sustainable and that we needed to rethink our architecture.

The Architecture Decision

After much discussion and analysis, we decided to refactor our treasure hunt engine using a microservices architecture, with each service responsible for a specific function, such as clue generation, user progress tracking, and winning condition evaluation. We chose to use Apache Kafka as our messaging system, which provided us with a highly scalable and fault-tolerant way to handle the high volumes of data. We also decided to use a combination of PostgreSQL and Apache Cassandra to store our data, with PostgreSQL handling the relational data and Cassandra handling the high-volume, high-velocity data. This decision was not taken lightly, as it required a significant amount of rework and rearchitecture. However, it ultimately proved to be the right decision, as it allowed us to scale our system to handle the millions of concurrent users.

What The Numbers Said After

After implementing the new architecture, we saw a significant improvement in our system's performance and scalability. Our CPU utilization dropped by over 50%, and our memory leaks were all but eliminated. We also saw a significant decrease in Redis connection timeouts, with the error rate dropping from 10% to less than 1%. Our latency also improved, with the average response time dropping from 500ms to less than 100ms. In terms of metrics, we saw the following improvements: a 30% increase in throughput, a 25% decrease in latency, and a 40% decrease in error rate. These numbers were a testament to the fact that our new architecture was working as intended.

What I Would Do Differently

In retrospect, I would have liked to have done more experimentation and testing before settling on our final architecture. We were under a tight deadline, and as such, we had to make some decisions quickly. However, this meant that we did not have the opportunity to fully evaluate all of our options. If I had to do it again, I would have liked to have explored other messaging systems, such as Amazon SQS or Google Cloud Pub/Sub, to see if they would have been a better fit for our use case. I would also have liked to have done more testing of our database systems, to ensure that they were properly optimized for our workload. Additionally, I would have liked to have implemented more monitoring and alerting, to ensure that we were aware of any issues before they became critical. Despite these regrets, I am proud of what we accomplished, and I believe that our treasure hunt engine is a testament to the power of good architecture and design.

Top comments (0)