Treasure Hunt Engine: When We Traded 1% Latency for $10k Revenue in a Single Night

#webdev #programming #devops #kubernetes

The Problem We Were Actually Solving

Our task was to scale the treasure hunt engine to support an expected 10 million users in a single night. The engine itself was a simple one - a series of geospatial queries to determine the closest treasure based on user location. Easy enough, right? The catch was that it was running on a load balancer that was only optimized for demo days, not production-scale traffic. We knew this, and we knew that if we didn't fix it fast, we were going to get crushed.

What We Tried First (And Why It Failed)

Our first attempt at scaling was to add more nodes to the load balancer. We threw a bunch of cheap cloud instances at the problem, and to our surprise, it worked for a while. Well, it worked until we reached about 500,000 users, at which point the load balancer started throwing "ConnectionRefused" errors left and right. It turned out that our load balancer was so unoptimized that it was actually slowing down the engine, rather than speeding it up. This was not what we expected. We thought that adding more nodes would magically fix the problem, but it turns out that it only masked it.

The Architecture Decision

We realized that we needed to rethink the whole load balancer strategy. We decided to implement an edge proxy in front of the load balancer, which would pre-filter users and only send relevant requests to the actual engine. This turned out to be a game-changer - the proxy reduced the number of requests to the engine by 75%, and the engine was now able to handle the increased traffic without breaking a sweat.

What The Numbers Said After

The numbers were staggering. Before the proxy, our engine was handling an average of 50 requests per second. After the proxy, it was handling an average of 3,000 requests per second with no errors. We made it through the night with no issues, and the TV show even got a decent ratings bump thanks to the timely treasure finds.

What I Would Do Differently

If I had to do it again, I'd invest in a load balancer that's actually optimized for production-scale traffic from day one. We wasted a lot of time and money (we were up over $10,000 that night in revenue alone) trying to fix the wrong problem. I'd also invest in better monitoring tools, not just for the load balancer, but for the entire proxy-engine chain. We were flying blind for most of the night, and only got lucky because the TV show producers didn't care about the exact timeline, just that the treasure finds happened on time.