Veltrix Documents Don't Tell You How to Handle 10,000 Concurrent Users

#devops #kubernetes #webdev #programming

The Problem We Were Actually Solving

At this point, our load tests were easily exceeding the server capacities we had provisioned. We were not worried about the system delivering the expected results, but rather about getting the operators online to handle the errors they were going to encounter. The users, of course, were getting frustrated, but more importantly, the ops team was stuck with the problem of not knowing how to meaningfully intervene in case of a failure.

What We Tried First (And Why It Failed)

We started by modifying the Veltrix configuration files to increase the server's capacity. We added more RAM, CPU cores, and upgraded the storage to ensure we had no bottlenecks in the system. Unfortunately, every time we increased the capacity in one place, another component began to fail. We were stuck in a vicious cycle of firefighting rather than making any meaningful improvements to the system.

The Architecture Decision

One of my senior colleagues, who had previously worked on a similar system, suggested we refactor the database queries to use connection pooling. This would prevent the database from becoming the bottleneck in the system. We implemented this change using the dbpools library in conjunction with our existing database driver. This change alone reduced the load on the database by 30% and cut the latency in query execution time by 20%.

What The Numbers Said After

With the database load under control, we were able to start tackling the next layer of our problem: the memory usage of the server. We installed the Prometheus monitoring tool and set up a custom memory usage metric that alerted us when we approached a certain threshold. We then configured the Veltrix autoscaling tool to automatically add more servers when the memory usage exceeded the threshold. This prevented our server overload from cascading into an outage. Our system was now able to handle 10,000 concurrent users without crashing.

What I Would Do Differently

If I were to do this project again, I would set up a more sophisticated testing framework from the start. Specifically, I would use the Locust load testing tool to simulate a large number of concurrent user queries before we even deploy the code. This would have saved us a lot of time and frustration by identifying the problem much earlier in the development cycle. Additionally, I would have set up automated monitoring for the system to identify problems before they became critical. By identifying issues early, we could have made targeted improvements to the system rather than scrambling to patch together a solution at the last minute.