Designing a Treasure Hunt Engine to Thrive in the Face of Server Scaling: A Cautionary Tale of Premature Optimisation

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Our initial performance metrics told us that our system was struggling under heavy loads. The number of concurrent requests was spiking, causing timeouts, and users were complaining of unresponsive interfaces. We thought the problem was rooted in the database queries, which were taking too long to execute. The error message "database transaction timeout" echoed in my mind, reinforcing our assumption that we needed to speed up the queries. The solution seemed simple: upgrade the database instance, and voilà!

What We Tried First (And Why It Failed)

We began by upgrading the database instance from 16 GB to 32 GB, expecting a significant performance boost. We also tweaked the database indexing, hoping to reduce query times. However, we soon realised that the database was not the bottleneck. The problem lay in the application's event handling, which was creating new threads for each incoming request. These threads were hogging resources, causing memory leaks, and ultimately leading to the dreaded " OutOfMemoryError". Our initial fix had inadvertently masked the underlying issue, but it was now much harder to diagnose.

The Architecture Decision

After some investigation, we decided to implement a more robust event handling mechanism. Instead of creating new threads for each request, we opted for a thread pool-based approach with a maximum of 500 worker threads. We also introduced a request queuing mechanism to prevent overloading the system. This would ensure that even if the system reached its capacity, new requests would be queued and processed once resources became available. Our application was now able to handle a much larger number of concurrent requests without experiencing memory issues.

What The Numbers Said After

The impact of our changes was nothing short of spectacular. We saw a 30% reduction in request processing time, a 25% decrease in memory usage, and a significant decrease in the number of "database transaction timeout" errors. We went from an average response time of 3.5 seconds to under 2 seconds, which was a major win for our users. Our system was now able to handle the increased load with ease, and the treasure hunt engine was finally able to thrive.

What I Would Do Differently

Looking back, I would have approached the problem with a more incremental mindset. We tried to tackle the issue with a big-bang approach, which ultimately led to more complications than necessary. If I were to do it again, I would have started with smaller, more targeted improvements. For instance, I would have introduced the thread pool-based event handling mechanism first, without making aggressive database upgrades. This would have allowed us to diagnose the issue more accurately and make more informed decisions about the optimal solution. In the end, it's a hard-earned lesson in the importance of incremental progress and cautioning against premature optimisation.