The previous article looked at whether changing the performance envelope for an application with a memory leak was effective. This article answers the question: "Should you horizontally scale your application based on its response time?"
- Horizontal scaling
- The endpoint under test
- Running tests
- Should you horizontally scale your application based on response times?
Horizontal scaling
Since this is the first article that deals with horizontal scaling, a quote from Nathan reminds us what horizontal scaling is.
Horizontal scaling is when you spread your workload across a larger number of application containers. It is based on aggregate resource consumption metrics for the service. For example you can look at average CPU resource consumption metric across all copies of your container.
When the aggregate average utilization breaches a high threshold you scale out by adding more copies of the container. If it breaches a low threshold you reduce the number of copies of the container. Source
The endpoint under test
Our mock application comes with this REST API endpoint:
-
/long_response_time
, simulating an increasingly busy database.
When this endpoint is invoked, the application calculates the square root of 64 * 64 * 64 * 64 * 64 * 64 ** 64
and saves the result to a database. Due to an increased load on the database, each INSERT
query takes longer and longer to complete.
Running tests
Your morning just started and that coffee is smelling so good. While sipping it, you glance at your application's monitoring dashboard and notice that the average response time went from ±300ms to more than 2 seconds.
Not good.
You decide to configure application autoscaling based on the response time. The idea is that running more containers will help distribute the workload and bring down the response time to acceptable levels.
scaling = service.auto_scale_task_count(max_capacity=5)
scaling.scale_to_track_custom_metric(
"responsescaling",
metric=target_group.metrics.target_response_time(
period=Duration.minutes(1)
),
target_value=2,
scale_in_cooldown=Duration.minutes(1),
scale_out_cooldown=Duration.minutes(1)
)
The above CDK code configures autoscaling for you service running on ECS. A metric for response time is being tracked, and if its value is bigger than 2, additional ECS tasks are added every minute. The maximum number of tasks is 5.
The configuration is applied:
and it seems to be working! The response time has dropped below 2 seconds.
That cold coffee is the least of your concerns now because increasing the number of tasks helped only temporarily.
The third and fourth containers starts up fairly quickly but the response time rises relentlessly.
After a few minutes, the service scales up to the maximum of 5 tasks to try and cope with the rising response times...
... but it is completely ineffective, as response time is still growing:
Why is that?
Well, the ship is taking on water and many sailors are rushing to empty it, but there's only a handful of buckets available 🪣 🪣 🪣.
Should you horizontally scale your application based on response times?
You can, but it won't do much good.
The "Identifying a utilization metric" paragraph has a great explanation on choosing the metric to track and base the autoscaling on.
The metric must be correlated with demand. When resources are held steady, but demand changes, the metric value must also change. The metric should increase or decrease when demand increases or decreases.
The metric value must scale in proportion to capacity. When demand holds constant, adding more resources must result in a proportional change in the metric value. So, doubling the number of tasks should cause the metric to decrease by 50%.
The part in bold is important because it is applicable in our use case. The metric value is not scaling in proportion to capacity. Doubling the number of tasks (and doubling them again) did not cause the metric value to decrease by 50%.
An overloaded database can cause response times to skyrocket, and adding more tasks won't help.
In fact, it may actually make it much worse because launching more copies of the application causes more connections to an already overloaded database server. Source
Dear reader, thank you for following my journey through practical ECS scaling. We looked into how a CPU-heavy application performs better with more CPU resources, how memory leaks are like monkey wrenches in the machine and the futility of horizontally scaling an application when the database is overloaded (this article).
Top comments (0)