This post examines one of the less traditional use cases of a modern APM with end-to-end tracing capabilities. Recently a Plumbr customer was facing an increased demand for the services for which the infrastructure was not really ready. Their customers started to experience poor performance and occasional timeouts from the API that our customer was providing.
The first response threw more hardware at the issue. Indeed, after the infrastructure was scaled up, the service was restored to match the SLA. Being a responsible team, they were not satisfied with just putting out the fire but conducted a post-mortem to get rid of the root cause as well.
The challenge was to make sure such issues would not happen again. The means to dynamically provision more resources (and downscale at low usage periods) were needed as well as the possibility to predict how much resources are required to avoid unnecessary spend in the infrastructure.
So a project was summoned into existence with a goal to build and verify a model to predict resource requirements according to the actual demand for the service. This post describes how the model was built and how it was verified first using a simulated load test and then the production traffic.
Service description
The service under scrutiny was rather simple, consisting of the following elements:
- Synchronous HTTP requests arriving to nginx, acting as a load-balancer
- Requests being routed to a cluster of JVM-based backends using a round-robin algorithm
- The JVMs running a dedicated microservice accepting the HTTP requests, parsing them and persisting the data in an Apache Kafka topic.
- After all of the above is completed, HTTP response is sent to the client
To be able to conduct the modelling experiment, the monitoring solution had to keep an eye both on the system metrics and key performance indicators. The system metrics (CPU, memory, network, etc) were monitored using Prometheus. The performance indicators were captured using Plumbr APM, which traced every API call arriving to the service. From the captured traces, Plumbr extracted the throughput, error rate and latency metrics.
Building the model
The model was built with the following assumptions
- CPU usage should be proportional to the amount of inbound traffic;
- RAM usage should be around [number of in-flight requests] multiplied by [average data batch size] + constant overhead;
- inbound(WAN) and outbound(LAN) network traffic should be correlated and disk usage should be constant
We laid out the following steps:
- Feed the model with resource utilization (CPU, RAM, disk I/O, LAN, WAN), weighted by the throughput (# of requests per unit of time).
- Add up the weighted resource demands per request to obtain the resource requirements and divided by the resource capacity to obtain the resource loads.
- Predict changes in response time & error rate by the model.
- Set up provisioning rules to (dynamically) scale out and back in based on the inbound throughput
A naïve approach expected the system metrics to be clearly in correlation with the inbound trace volume, predicting the monitoring results similar to following
If indeed reality matched this model, we would be able to claim that every additional 100 requests/second inbound traffic would:
- require 200MB RAM;
- increase CPU utilization by 4.8%;
- use 1Mbit/s WAN & 0.8Mbit/s LAN bandwidth;
- and not have any impact on disk I/O
Additionally, we expected the error rate to stay close to zero with increased load with slight increase in median and tail end latency under increased throughput as the CPU becomes closer to saturation.
Using this approach, we could claim that the resource in instances that would turn into the bottleneck the fastest would be the CPU – at 2,000 requests/second the node would be consuming ~96% of the available CPU cycles with plenty of I/O and memory resources remaining. If this model was indeed matching the reality, we could then reduce the memory requirements for instances and provision new instances whenever the load was approaching the critical 2,000 requests/second limit.
Reality check
Reality however proved to be somewhat different from the expectations. When the Gatling-simulated load was sent to the nodes, we instead experienced that around 1000 req/s the response time latencies started significantly increasing, and by 2000 req/s, more than 10% of requests timed out.
Using the bottlenecks exposed via Plumbr traces it was clear to us that the root cause was garbage collection in the JVM. After optimizing data structures for memory efficiency the tests were starting to match the model and we were able to get close to the 2.5K req/s without significant impact to latency before the CPU again started to become a bottleneck.
Reality check in production
It was then time to get the real verification from the production environment. After the improved version of the microservice was deployed in production, the following became evident:
- The model was matching the reality until 1,000 requests/second.
- When the load increased beyond 1,000 requests/second, the median and especially tail end latency started to degrade
- At 2,000 requests/second, timeouts started to appear
At the same time, system level resources were not saturated. The CPU and network usage were low, memory pools had enough capacity and even the previously guilty GC did not show any increased pausing.
So, something was still off. Again, tracing helped. Exploring the bottlenecks by their impact another culprit was discovered – this time a lock contention issue in the Apache Kafka client library. Apparently, the buffer and batch sizes were small for the given throughput, so the client was forced to do a round-trip over network more often than expected. This of course increased the processing time and caused the wait time increase for other threads waiting to claim the lock.
After the buffer and batch sizes were increased and test was re-ran, the model started to match reality up to high throughput of 2,200 requests/second.
But why was the buffer/batch size issue not caught in the simulation? We had some hypotheses, one of which proved to be correct. The simulation was generating uniform traffic as opposed to production where there were multiple spikes where in a few hundred milliseconds the throughput was increased to 3-4x the usual volume.
After the simulation was updated to include random intervals between requests, the lock contention started to reproduce in the tests as well.
Take-away
This case study serves as an excellent example from both incident & problem management standpoint as well as how the root cause was resolved using state-of the art solutions.
- During incident management, the impact was mitigated
- After the impact was mitigated, the problem was analyzed, and the root cause identified.
- When solving the root cause, a performance model was built and verified
So, whenever your services start to tumble under increased load, use the above example to model your own performance. As a result, you will be making sure the increased traffic will be handled at predictable and controllable infrastructure costs.
Of course, as we saw, good tooling will help you. I can only recommend our very own Plumbr APM with tracing to bottlenecks and errors for this. Coupled with system monitoring by Prometheus, it will give you the tools that you need.
Top comments (0)