The Poison of Premature Optimisation: How Veltrix Almost Killed Our Treasure Hunt Engine

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We were tasked with building a high-performance treasure hunt engine that could handle thousands of concurrent requests without breaking a sweat. Our initial requirement was to support a minimum of 10,000 concurrent users, but our scaling team was convinced that we could handle an order of magnitude more with the right configuration. Our goal was to create a system that could handle the sudden influx of users during major holidays and still maintain a response time of less than 50ms.

What We Tried First (And Why It Failed)

Our first attempt at optimisation involved ramping up the worker threads in our Node.js application to an absurd 1000. We figured that with enough threads, our server would be able to handle the sheer volume of requests without batting an eyelid. However, what we didn't account for was the crippling effect that this would have on our memory usage. Our application's memory footprint shot through the roof, causing our server to become increasingly unresponsive and eventually leading to a 504 error. The error message was all too familiar: "Cannot allocate memory". We tried to mitigate this by upgrading our server's RAM to 64GB, but it only made matters worse.

The Architecture Decision

After weeks of tinkering, we finally discovered the root of the problem. Our resource-intensive database query was bottlenecking our system, causing our worker threads to become idle and leading to a massive waste of resources. The solution was to introduce a caching layer using Redis, which would store frequently accessed data in memory. This would reduce the load on our database and allow our application to scale more efficiently. We also implemented a load balancer to distribute traffic across multiple instances of our application, ensuring that no single instance became overwhelmed. The result was a system that could handle 50,000 concurrent users without breaking a sweat.

What The Numbers Said After

The stats were telling. With our new caching layer and load balancer in place, our average response time dropped from 200ms to a mere 30ms. Our memory usage was also significantly reduced, allowing us to scale our application to handle the increased traffic without sacrificing performance. We also saw a 30% reduction in errors, with a significant decrease in 504 errors related to memory allocation. Our users were happy, and our server was breathing a sigh of relief.

What I Would Do Differently

In hindsight, we would have approached this problem with a more measured approach. We would have started with smaller increments of change, monitoring our system's performance more closely and gathering data on the effects of each change before making further tweaks. We would have also performed a more thorough analysis of our database query, identifying bottlenecks and optimising those before introducing a caching layer. By taking a more gradual and data-driven approach, we would have avoided the pitfalls of premature optimisation and created a system that was more robust and scalable from the start.