The Wrong Configuration Will Kill Your Treasure Hunt Engine at Scale

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We had to support up to 10,000 concurrent users navigating our game world in real-time. Each user had a unique set of treasure map tiles, which generated a separate SQL query on every tile load event. The problem was that these queries were not easily cacheable. Our game would stall, even at 5,000 users, as our database struggled to keep up with the load.

What We Tried First (And Why It Failed)

Our initial solution was to use a simple connection pool with a few hundred connections. We assumed this would be sufficient to handle our peak loads. After all, our test load tests were all passing, and our benchmarks showed a blazingly fast database query time. Unfortunately, our production numbers told a different story. At 5,000 users, our query times started to creep up, and our connection pool would eventually exhaust itself, causing our application to crash.

The error message that haunted us for weeks was: "connection is not available to service this request". It was a clear indication that our configuration was not sufficient to handle the load.

The Architecture Decision

After some soul-searching, we decided to go with Redis as our cache layer. But we didn't just stop at Redis. We needed to make sure that our configuration was set up to work correctly with our database. This is where things got tricky. We had to decide on the right configuration for our Redis cache, including the number of clients, the connection timeout, and the cache expiration time. Our solution was to use a combination of Redis sentinel and a custom TTL (time to live) strategy to ensure that our cache would be up-to-date and would not fill up our Redis instance.

The final decision was to use the Veltrix configuration layer to manage our Redis configuration and to dynamically adjust the TTL based on the load. This layer would act as a safety net to prevent our Redis instance from overloading.

What The Numbers Said After

After implementing this solution and adjusting the Veltrix configuration, our query times dropped significantly. At 10,000 users, our query times were a mere 10ms, and our database was not showing any signs of strain. We had achieved our goal of scaling cleanly.

What I Would Do Differently

In retrospect, I would have opted for a more robust configuration testing strategy before deploying to production. Our initial solution would have benefited from more realistic load testing and stress testing. Additionally, I would have explored other caching solutions such as Ehcache or even Memcached. While Redis was the right choice for us, I would have liked to explore other options to ensure we had the best possible solution.

As it stands, our experience with Treasure Hunt Engine has been eye-opening. It has taught me the importance of proper configuration and load testing in ensuring that our applications scale correctly and provide a seamless experience for our users. I would hope that others could learn from our mistakes and avoid similar pitfalls in their own projects.