Practical ECS scaling: vertically scaling an application with a memory leak

#aws #containers #cdk #scaling

The previous article looked at how changing the performance envelope for a CPU-heavy application affected its performance. This article shows whether vertically scaling an application with a memory leak is effective.

The endpoint under test
Running tests
Results
- Container 1 (1GB of memory)
- Container 2 (2GB of memory)
How effective is it to vertically scale an application that has a memory leak?

The endpoint under test

Our mock application comes with this REST API endpoint:

/memory_leak, simulating a memory leak.

When this endpoint is invoked, the application calculates the square root of 64 * 64 * 64 * 64 * 64 * 64 ** 64 and returns the result. Due to a bad code merge, 1MB of memory is added to a Python list on each request.

leaked_memory = []

@app.route("/memory_leak")
def memory_leaky_task():
    global leaked_memory

    # appends approx. 1MB of data to a list on each
    # request, creating a memory leak
    leaked_memory.append(bytearray(1024 * 1024 * 1))

    return _build_response(do_sqrt())

Running tests

It is better to be roughly right than precisely wrong.
— Alan Greenspan

To load-test this application, I used hey to invoke the endpoint with 5 requests per second for 30 minutes using hey -z 30m -q 1 -c 5 $URL/memory_leak

To be able to compare results, I ran the same application in two containers with different hardware configurations:

	CPUs	Memory (GB)
Container 1	0.5	1.0
Container 2	1.0	2.0

Results

Container 1 (1GB of memory)

Looking at the summary of hey, we notice that not all requests were successful:

Summary:
  Total:        1800.0255 secs
  Slowest:      10.1465 secs
  Fastest:      0.0088 secs
  Average:      0.1860 secs
  Requests/sec: 4.3499

Status code distribution:
  [200] 6148 responses
  [502] 341 responses
  [503] 1211 responses
  [504] 130 responses

Roughly 21% of all requests had a non-200 status code 😞 This is not a great user experience.

Looking at ECS task details we notice that there are 7 tasks in total, only one of which is running at the moment - other 6 are stopped.

To get more details, we can describe one of the stopped tasks:

aws ecs describe-tasks \
  --cluster ecs-scaling-cluster \
  --tasks 7f0872485e6e421e8f83a062a3704303 |
  jq -r '.tasks[0].containers[0].reason'

OutOfMemoryError: Container killed due to memory usage

OutOfMemoryError: Container killed due to memory usage becomes obvious when looking at the memory utilization metric:

The sawtooth pattern reveals the problem: our application is exceeding the performance envelope for the "Memory" dimension! Each request leaks approx. 1MB of memory, and because the container is given 1GB of memory, serving roughly a 1000 requests leads to the container running out of memory. At that point the ECS service forcefully stops the container that is out of memory and starts a fresh one.

Container 2 (2GB of memory)

Running another container with double the memory (1GB → 2GB) and load testing it in the same way we get very similar results, with approx. 7% of all requests having a 5xx status code:

Summary:
  Total:        1800.0240 secs
  Slowest:      10.0976 secs
  Fastest:      0.0119 secs
  Average:      0.0850 secs
  Requests/sec: 4.7249

Status code distribution:
  [200] 7868 responses
  [502] 177 responses
  [503] 405 responses
  [504] 55 responses

In this instance only 4 tasks were started, 3 of which were forcefully stopped:

aws ecs describe-tasks \
  --cluster ecs-scaling-cluster \
  --tasks c35b7029c38c4383b26e768aec3c77f2 |
  jq -r '.tasks[0].containers[0].reason'

OutOfMemoryError: Container killed due to memory usage

And again, the sawtooth memory utilization reveals that we have a memory leak:

How effective is it to vertically scale an application that has a memory leak?

Not at all.

A memory leak can not be fixed by scaling. You can’t vertically or horizontally scale yourself out of a memory leak. The only way to fix this is to fix the application code. You cannot have scalability with a memory leak. Source

Regardless of its memory configuration, a container with an application that has a memory leak will sooner or later run out of memory.

AWS re.Post has an article on troubleshooting OutOfMemory errors. This blog post explains how containers and how containers (in general, and those running on ECS) consume CPU and memory.

Next up: Should you horizontally scale your application based on response times?

You also might want to check how changing the performance envelope for a CPU-heavy application affects its performance.