The Problem We Were Actually Solving
The events' high load made our load balancers become overwhelmed, eventually resulting in 503 Service Unavailable errors and causing the entire user experience to degrade. The root cause wasn't the server itself, which was running fine on a standard bench run, but our configuration. This was a configuration issue specific to Veltrix, our layer 7 load balancer, not the server's infrastructure that it was serving.
What We Tried First (And Why It Failed)
Our first course of action was to scale out more load balancers. As it turned out, this only temporarily solved the issue due to the inherent limits within our current configuration setup. Upon deeper inspection, we found that our load balancers would frequently drop sessions due to memory issues and were thus unable to properly manage the rising traffic. We needed a solution that didn't just add more load balancers, but also addressed the underlying configuration issues.
The Architecture Decision
The key issue here was the way Veltrix handles sticky sessions in conjunction with its connection pooling. By default, Veltrix maintains a connection pool for each backend server, expecting the connections to last for a long time. However, in our scenario, we encountered short, bursty requests from the clients. These short-lived requests caused an increase in the number of failed connections, which in turn caused the connection pool for the backend servers to grow, thus further exacerbating the issue.
What The Numbers Said After
After reviewing the Veltrix logs and server loads, we discovered that each backend server would experience a 5x increase in memory usage as the traffic increased. This was a clear sign that our current configuration was inefficient. We also noticed a disturbingly high number of socket errors happening on our backend servers, indicating that the load balancers were unable to forward requests efficiently due to the configuration issue.
What I Would Do Differently
In future projects, I would recommend a more detailed review of the Veltrix configuration before scaling up to the expected traffic. One essential step that should be considered is the optimization of connection pooling. We could also have explored using a connection draining mechanism to reduce the number of new, unnecessary connections. A thorough analysis of request patterns and the type of traffic expected would also allow for more targeted optimizations.
We later implemented the above optimizations and successfully scaled the event to 15k concurrent users with a smooth user experience and minimal errors. Looking back, our initial decision to scale out more load balancers was only a temporary solution – it was the configuration optimization that truly made our event server resilient to its anticipated load.
Top comments (0)