DEV Community

Shireen Bano A
Shireen Bano A

Posted on

Understanding AWS Autoscaling with Grafana

Architecture Overview

My application is deployed on AWS as a containerized system:
React frontend served by Nginx
Node.js backend deployed as a Docker container

The backend relies heavily on:

  1. RDS (reads/writes)
  2. S3 (uploads/downloads)
  3. Gemini API (LLM inference)

This architecture is intentionally realistic — it represents the type of stack many modern apps use today.
The Goal: High-Stress Scaling Through Load Testing

I wanted to validate autoscaling behavior under pressure. Specifically:

Can ECS scale out when traffic spikes?

How fast does it scale?

Does latency stay stable?

Does error rate increase under stress?

Which dependency becomes the bottleneck first?

Load Testing Strategy (k6)
To make the test realistic, I didn’t just hit a single endpoint repeatedly. Instead, I created a k6 test with two parallel scenarios:

1) Backend Load Scenario (Triggers Scaling)
This scenario generates the high traffic volume needed to push the backend and observe ECS behavior.

  • Warm-up at 20 users
  • Spike instantly to 500 users
  • Hold for 9 minutes
  • Drop back down

Observe scale-in behavior

2) UI Monitoring Scenario (Real User Flow)
This scenario runs a small number of browser-based users to monitor actual UI behavior while the system is under stress.

This includes:

  • login
  • navigation to medical history
  • viewing a PDF report
  • adding a condition
  • requesting recipes
  • uploading a report

This helped validate whether the UI stayed usable during the stress event.

The First Surprise: 500 VUs Did Not Spike CPU
At 500 virtual users, I expected ECS CPU utilization to become the main bottleneck. Instead, the CPU stayed surprisingly low — barely crossing 18%, even while the load test pushed close to 25,000 requests through the system. At first, this felt wrong, and I genuinely questioned whether my load test was working. But the test was fine — my assumption was not. After digging deeper, I realized the application workload simply wasn’t CPU-intensive. Most of the request time was spent waiting on external dependencies like RDS reads/writes, S3 uploads/downloads, and Gemini API responses. This made the system primarily I/O-bound, which explains why CPU-based autoscaling did not react strongly, even under heavy traffic.

This was one of the biggest lessons from the experiment:

High traffic does not always mean high CPU.

Request Count in Grafana

As you can see in the above graph, I almost hit more than 20000 request.

Heartbeat
At the same time, here is my CPU and memory utilization graph, hitting no more than 20%.

Auto-Scaling policy
According to my autoscaling policy, CPU utilization must cross 70% before CloudWatch triggers the alarm. Since my application isn’t naturally CPU-intensive, I wasn’t sure how else to push CPU high enough to test scaling properly.

So I manually generated CPU stress inside the running ECS container by executing an infinite loop using the following command:

aws ecs execute-command --cluster recipe-finder-prod-cluster \
    --task a55518997ca84f24bc2fd614cbc18f20 \
    --container recipe-finder-api \
    --interactive \
    --command "/bin/sh -c 'while true; do :; done & while true; do :; done'"
Enter fullscreen mode Exit fullscreen mode

Within a few minutes, this forced the container CPU to spike aggressively, reaching a consistent ~99% utilization.

Now the real question becomes: how long does scale-out actually take?

Based on the autoscaling configuration, CloudWatch requires 60 seconds of sustained CPU breach before it enters the ALARM state. Once the alarm is triggered, ECS detects it and begins launching new tasks.

Event Timestamp
CPU crossed 70% 12:09:00
Alarm triggered 12:13:25
Desired tasks increased 12:14:00
New task running 12:15:00

CPU crossed the 70% threshold at 12:09, but the CloudWatch alarm didn’t trigger until 12:13. ECS then increased the desired task count at 12:14, and the new task became fully running by 12:15 — meaning the full scale-out process took roughly 6 minutes from threshold breach to a healthy new task.

So, Autoscaling doesn’t react the instant CPU crossed the 70% threshold. CloudWatch evaluates CPU in 1-minute datapoints, and my alarm required 3 breaching datapoints within 3 minutes. Only after the alarm entered the ALARM state did ECS trigger scale-out and launch new Fargate tasks.

Now let's look at the scale in process:

Event Timestamp Notes
CPU fell below scale-in threshold (<63%) 12:34 Based on Grafana
Low alarm triggered (OK → ALARM) 12:49 15-min evaluation period complete
ECS desired tasks decreased 12:50 ECS starts stopping tasks
Extra task stopped (scale-in complete) 12:52 Task fully terminated

Notice how the alarm now triggers 15 minutes after the CPU fell below threshold, matching the Low alarm rule of 15 datapoints in 15 minutes.

Closing Note:

Autoscaling ensures your application can handle spikes, but it comes with temporary performance trade-offs:

During scale-out: When CPU spikes and new Fargate tasks are being launched, your application may briefly return 5xx errors or slower responses. In our experiment, we did see 5% errors for a few minutes during the initial warm-up period before the new tasks fully came online. This “warm-up latency” is an inherent part of reactive autoscaling.

During scale-in: ECS gradually terminates idle tasks once the Low alarm confirms sustained low CPU. This process is intentionally slow to avoid task flapping, ensuring that users aren’t suddenly impacted if traffic spikes again.

Observing CPU, alarm state, and task events together helps understand exactly how long users may experience degraded performance during scaling, and informs decisions about pre-warming, thresholds, and evaluation periods to minimize those user-facing impacts.

Github link:

Screenshot 2026-02-06 at 6 17 33 PM

Application Overview

This application helps users manage their health by securely storing medical history, lab reports, and personal profile information. Based on a patient’s conditions, it generates personalized healthy recipes using a recommendation engine integrated with the Gemini API. The goal is to provide actionable nutrition guidance while maintaining HIPAA compliance, data privacy, and secure storage. It also caches generated recipes for quick retrieval and seamless user experience.

How to install:

    terraform apply
    terrform destroy  #to destroy the infra

Recipe Finder Application:

1) Profile (CRUD + Database Reads/Writes)

Users can view and update profile information.

This workflow represents the most typical web-app traffic pattern: read and write operations to the database image

2) Medical History (Uploads + Processing + AI Pipeline)

Users can upload lab reports and add medical conditions.

Once submitted, the backend processes the medical data and sends it to a recommendation engine, which then forwards structured…

Top comments (0)