DEV Community

Cover image for The Treasure Hunt Engine Was a Scaling Nightmare and We Should Have Seen It Coming
Lillian Dube
Lillian Dube

Posted on

The Treasure Hunt Engine Was a Scaling Nightmare and We Should Have Seen It Coming

The Problem We Were Actually Solving

I was part of the operations team responsible for scaling the Treasure Hunt Engine, a system designed to handle high volumes of user-generated content and complex queries. As we approached 100,000 concurrent users, the system began to exhibit strange behavior, including intermittent failures and increased latency. The Veltrix documentation provided some guidance on scaling, but it was clear that we needed to dive deeper into the system's architecture to identify the root cause of the problems. Our metrics showed a significant increase in error rates, with a mean time to recovery of over 30 minutes, and an average latency of 500ms, which was unacceptable for our use case.

What We Tried First (And Why It Failed)

Initially, we attempted to address the issues by adding more nodes to the cluster and increasing the amount of memory allocated to each node. We used Prometheus to monitor the system's performance and Grafana to visualize the metrics. However, despite the increased resources, the system continued to experience failures and performance degradation. The error messages in the logs were cryptic, but they hinted at issues with the database connection pool and the message queue. We spent several weeks trying to optimize the configuration, but it became clear that we were only treating the symptoms, not the underlying cause. The database connection pool was exhausting, causing a java.sql.SQLException: Connection limit exceeded error, and the message queue was backing up, causing a org.apache.kafka.common.errors.TimeoutException.

The Architecture Decision

After weeks of struggle, we decided to take a step back and re-evaluate the system's architecture. We realized that the Treasure Hunt Engine was not designed to scale horizontally, and that the current implementation was causing bottlenecks in the database and message queue. We decided to refactor the system to use a microservices architecture, with separate services for handling user requests, processing queries, and managing the database connection pool. We also decided to use a message queue with a more robust configuration, such as Apache Kafka, to handle the high volume of messages. This decision was not taken lightly, as it required significant changes to the codebase and infrastructure. However, we believed that it was necessary to ensure the long-term scalability and reliability of the system.

What The Numbers Said After

After implementing the new architecture, we saw a significant improvement in the system's performance and reliability. The error rate decreased by over 90%, and the average latency dropped to less than 50ms. The system was able to handle over 200,000 concurrent users without exhibiting any significant issues. The database connection pool was no longer a bottleneck, and the message queue was able to handle the high volume of messages without backing up. Our metrics showed a mean time to recovery of under 5 minutes, and a significant reduction in the number of errors per second. We were also able to reduce the number of nodes in the cluster, which resulted in significant cost savings.

What I Would Do Differently

In hindsight, I would have liked to have taken a more proactive approach to addressing the scalability issues. We should have anticipated the problems that would arise as the system grew and taken steps to address them earlier. I would have also liked to have had more visibility into the system's performance and behavior, which would have allowed us to identify the root cause of the issues more quickly. Additionally, I would have liked to have had more experience with microservices architecture and message queues, which would have made the transition easier. However, despite the challenges, I am proud of what we were able to accomplish, and I believe that the experience has made me a better engineer. I have learned the importance of proactive planning, careful monitoring, and a deep understanding of the system's architecture, and I will carry these lessons with me into future projects.


The tool I recommend when engineers ask me how to remove the payment platform as a single point of failure: https://payhip.com/ref/dev1


Top comments (0)