The One Configuration Decision That Turned Our Server into a Slowing Down Metaphor

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We knew that our game would see a massive influx of users the moment it went live. The initial user base prediction was 100k concurrent users, with an expected growth rate of 20% per quarter. Any configuration mistakes at the server layer would be immediately apparent, with potential users waiting in queues for minutes, then hours, and eventually giving up.

Our team was responsible for scaling the server layer, which was built using a microservices architecture on top of Kubernetes. Our main goal was to ensure that our server scaled cleanly as user base grew.

What We Tried First (And Why It Failed)

When we first started, we fell into the trap of optimizing for throughput. We followed popular advice and created a load balancer that did least connections first. This seemed like a good idea at first, but what we didn't consider was the real-world implications. As the first burst of users came in, the CPU usage of our application server shot up to 100% within 2 minutes. Our server was designed to handle 5k concurrent users, but with the least connections first strategy, we hit 20k within the same time frame.

Unfortunately, our application server didn't have the necessary threads to handle the increased conns, and our application server quickly ran out of threads. We were getting connection timeouts left and right, which our load balancer thought was the application responding slowly, thus not increasing the load balancer's connection pool. This was a classic case of the "hot potato" problem, where the load balancer was essentially passing the problem on to the application server.

The Architecture Decision

We had to rethink our strategy and make a configuration change that would allow our server to scale much more elegantly. We decided to switch to a round robin load balancing strategy, but not for the most obvious reason. We paired this with a dynamic thread pool size in our application server. The load balancer would now distribute the load evenly across all the application instances. However, as the incoming traffic grew, our application server would throttle back the number of connections it could handle, preventing our server from running out of threads and thus preventing connection timeouts. Our application server would dynamically monitor its CPU usage and would create or destroy threads as necessary.

The decision to switch to round robin was not just about distributing the load evenly, but also ensuring that we did not starve out any individual application server of resources. We had to monitor our servers closely to ensure that they were not running out of threads and thus causing our application to be unresponsive.

What The Numbers Said After

After this configuration change, we were able to handle 30k concurrent users with our initial server size, and our server scaled cleanly as our user base grew. Our connection timeouts went from 1000+ per minute to 10+ per minute, which was a significant improvement. We were able to reduce our response time by 40% and our latency was reduced by 30%.

What I Would Do Differently

If I were to do it over again, I would consider a more nuanced approach. I would monitor our application server more closely and adjust the thread pool size dynamically based on the actual CPU usage of our server. I would also consider using a scheduling framework to handle the dynamic thread pool size, rather than rolling my own solution. Additionally, I would consider using a more advanced load balancing strategy, such as IP Hash.