Avinash Dalvi for AWS Community Builders

Posted on Sep 18 • Originally published at internetkatta.com

The 9 AM Discovery That Saved Our Production: An ECS Fargate Circuit Breaker Story

#aws #fargate #ecs #docker

Hello Devs,

In the world of containerised deployments, small mistakes can have catastrophic consequences. What started as a routine morning API test in our development environment turned into a revelation about production resilience that fundamentally changed how we approach ECS Fargate deployments.

This is the story of how a simple port configuration error taught us the critical importance of ECS Deployment Circuit Breakers – and why every team running workloads on AWS Fargate should consider them essential infrastructure, not optional extras.

The best production incidents, as it turns out, are the ones that never happen.

The Setup

Our Flask API ran smoothly on ECS Fargate with a cost-optimized dev setup — tasks auto-started at 8 AM and stopped after hours using CloudWatch alarms.

We used an Application Load Balancer (ALB) targeting port 5001, with health checks and task definitions perfectly aligned:

# app.py - The way it had always been
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5001, debug=False)

ECS config:

Container port: 5001
Target group: 5001
ALB health checks: 5001

Everything in harmony.

The Innocent Change

One of our backend developers, was working late on a new feature. They were running multiple services locally and kept hitting port conflicts. Port 5001 was already occupied by another service.

"Quick fix," developer thought, and made what seemed like the most logical change:

# app.py - The "harmless" local change
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5201, debug=False)  # Changed to avoid local conflict

The feature worked perfectly in her local environment. Tests passed. Code review looked good. The Docker build succeeded. Everything seemed normal.

But here's where the story takes a turn.

The 9 AM Discovery

The next morning, I arrived open my machine around 9 AM and decided to run some API tests before diving into feature work. Our automated CloudWatch alarm had dutifully started the ECS Fargate tasks at 8 AM, just as configured. But something was wrong.

Every API call returned the dreaded 502 Bad Gateway error.

I immediately checked the ECS console, and what I saw made me to think : Fargate tasks were in a continuous cycle of PENDING → RUNNING → STOPPED. They would start up, run for a few minutes, then get drained and terminated, only for ECS to immediately spin up new ones.

The root cause hit me like a lightning bolt: Our Flask application was now listening on port 5201, but everything else in our infrastructure was still configured for port 5001.

The Downward Spiral

What followed was a textbook example of how a small misconfiguration can cascade into a major incident:

Task Launch: ECS Fargate would start a new task
Health Check Failure: ALB couldn't reach the app on port 5001
Task Termination: ECS marked the task as unhealthy and terminated it
Replacement Attempt: ECS immediately launched a new Fargate task to maintain desired count
Infinite Loop: Steps 1-4 repeated endlessly

Our ECS Fargate cluster was stuck in what we later dubbed "the task death spiral." New Fargate tasks were being created and destroyed every few minutes, consuming compute resources while serving zero traffic.

The Circuit Breaker Revelation

During our post-incident analysis, then I realised something that would change our deployment strategy forever: "What if I told you this entire incident could have been prevented automatically?"

Enter the ECS Fargate Deployment Circuit Breaker.

This AWS feature acts like an intelligent safety net for ECS Fargate deployments. When enabled, it monitors your Fargate deployment and can automatically detect when something is going wrong, stopping the deployment and rolling back to the previous stable version.

How ECS Behaves in Different Rollback Scenarios

Scenario	Desired Count	Task Def Changed	Rollback Triggered	Rollback Time
Broken image pushed with `latest` only	1	❌ No	❌ No	❌ Never
Broken task def v3 (flask-app:v2)	1	✅ Yes	✅ Yes	⏱ ~10–20 min
Same failure with `desiredCount=5`	5	✅ Yes	✅ Yes	⏱ ~3–5 min

Circuit breaker only works if a new task definition is registered.
Desired count = 1 leads to slow failure detection, delaying rollback.
ECS uses an internal failure threshold (usually 3 failed tasks).

How Circuit Breaker Would Have Saved Us

Let's replay our incident with ECS Fargate deployment circuit breaker enabled:

To better understand how ECS identifies and reacts to a bad deployment, here’s a simplified flow diagram based on our real incident:

Deployment Start: Mismatched port in new task definition
Monitoring Begins: ECS tracks task health and startup patterns
Failure Detected: Multiple ECS task failures trigger threshold
Automatic Rollback: ECS reverts to previous task definition
Service Restored: Traffic resumes via healthy version

Instead of 75 minutes of downtime, we would have had perhaps 5-10 minutes of degraded performance while the circuit breaker detected and resolved the issue.

How ECS Actually Triggers Rollbacks: Behind the Scenes

During our experiments, we noticed some undocumented behaviours:

ECS doesn't rollback unless there's a new task definition — pushing a new image to latest doesn’t count.
Desired count = 1 (common in off-hours cost optimisation) leads to much slower rollbacks due to staggered failures.
ECS seems to use a dynamic failure threshold of 3 (confirmed visually in the console), meaning it waits for 3 failed task launches before triggering rollback. You cannot change either of the threshold values. It is mentioned in the ECS deployment circuit breaker

ECS uses the following logic to determine rollback:

Minimum threshold <= 0.5 * desired task count => maximum threshold

What this means in practice:

Even if circuit breaker is “enabled,” rollback won’t happen unless you structure your deployments correctly.

The Circuit Breaker Implementation

That afternoon, we made a decision that would prove to be one of our best infrastructure investments: enabling ECS Deployment Circuit Breaker across all our services, starting with our most critical production workloads.

The configuration was surprisingly straightforward:

{
  "deploymentCircuitBreaker": {
    "enable": true,
    "rollback": true
  }
}

What Happens When DesiredCount = 0? A Real Risk Pattern

Our incident happened in a development environment using off-hours scaling — every night, desiredCount = 0, and each morning ECS spins the tasks back up at 8 AM. This helps save cost during non-business hours.

But here’s the hidden danger we uncovered through real experiments:

In our real case, we pushed a new (and broken) image to flask-app:latest overnight.

However, we didn’t register a new task definition — the task definition was unchanged.

So when ECS scaled up in the morning, it pulled the broken image and launched new tasks.

Because ECS had no healthy task running and no “new deployment” to monitor, no rollback happened.

This subtle but critical issue means that:

ECS had no baseline healthy task to compare against
There was no new task definition, so ECS didn’t consider this a deployment
Circuit breaker logic was never triggered
ECS just kept retrying the same broken image silently

Even with circuit breaker enabled, rollback only works if ECS sees a new deployment (i.e., new task definition revision). In our case, since we reused flask-app:latest with the same task definition, ECS had nothing to roll back to.

Recommendations (Based on Real-World Failures)

These are not just best practices from the AWS documentation. These are hard-earned lessons from our own experiments and real incident recoveries.

1. Avoid using `latest` tag in ECS task definitions

ECS won't detect image changes if you're using flask-app:latest and don’t update the task definition. This can silently deploy broken images without triggering a rollback.

Do this instead:

Use immutable image tags like v1.2.3, build-20250909, or a full SHA digest
Always reference a new task definition revision tied to each deployment

2. Register a new task definition with every deployment

The deployment circuit breaker only activates when ECS detects a new deployment. If the task definition remains unchanged (even with a new image), ECS won’t treat it as a deployment, and rollback won’t occur.

Do this instead:

Automate task definition registration in your CI/CD pipeline
Even if using the same image tag, register a revision to trigger deployment detection

3. Use CloudWatch alarms to detect deployment failures early

ECS retries silently when tasks fail during deployment. In non-prod or low-desiredCount environments, this can go unnoticed.

Do this instead:

Monitor UnhealthyHostCount (ALB) and ECS service deployment events
Alert on unusual task exit reasons, STOPPED states, or drops in RunningTaskCount

4. Enforce Task Definition Updates in CI/CD

One common issue we saw: devs pushed new images to latest, but forgot to update task definitions. Result? No rollback, no detection, broken app silently running.

Do this instead:

Add a CI/CD check: fail the pipeline if task definition revision isn't updated
Maintain an audit log: map every deployment to a task definition revision

5. Use higher `desiredCount` during deployments for faster rollback

In our tests, when desiredCount was set to 1, rollback took over 20 minutes to trigger. With desiredCount set to 5, the circuit breaker detected the failure pattern faster and triggered rollback within 3–5 minutes.

What to do instead:

Temporarily increase desiredCount during deployments (e.g., from 1 to 5).
Alternatively, tune deploymentConfiguration to use maxPercent = 200 and minHealthyPercent = 50 to allow parallel task launches during updates.

Summary Table

Problem	Recommendation
ECS didn’t rollback on broken image	Always register a new task definition
ECS used broken `latest` tag silently	Avoid `latest`, use immutable image tags
Slow rollback when desired count = 1	Use higher `desiredCount` during deploys
No alert when tasks failed	Add CloudWatch alarms for task health and service events
Deployment skipped task def update	Enforce task def registration in pipeline

Conclusion: The Safety Net We Proactively Built

Even small mistakes — like a port mismatch — can bring down containerized systems. That’s why we treat circuit breakers not as optional features, but as must-have infrastructure. They're not just for rollback — they build resilience into your deployment lifecycle.

That seemingly minor port change could’ve caused hours of downtime — but it didn’t. Because we caught it early, we had the chance to rethink our deployment safety.

We turned that morning’s incident into a proactive defense strategy by enabling ECS Deployment Circuit Breaker across all services. It now gives us confidence that even if a broken deployment slips through, ECS will detect the issue and roll back automatically — without us scrambling at 9 AM.

Our team now deploys with confidence, not caution. And the best part? The incident never reached users.

Sometimes the best production incidents are the ones that never happen.

👉 Have you faced something similar? Let’s talk in the comments.

DEV Community

The 9 AM Discovery That Saved Our Production: An ECS Fargate Circuit Breaker Story

The Setup

The Innocent Change

The 9 AM Discovery

The Downward Spiral

The Circuit Breaker Revelation

How ECS Behaves in Different Rollback Scenarios

How Circuit Breaker Would Have Saved Us

How ECS Actually Triggers Rollbacks: Behind the Scenes

The Circuit Breaker Implementation

What Happens When DesiredCount = 0? A Real Risk Pattern

Recommendations (Based on Real-World Failures)

1. Avoid using `latest` tag in ECS task definitions

2. Register a new task definition with every deployment

3. Use CloudWatch alarms to detect deployment failures early

4. Enforce Task Definition Updates in CI/CD

5. Use higher `desiredCount` during deployments for faster rollback

Summary Table

Conclusion: The Safety Net We Proactively Built

References:

Top comments (0)

The Setup

The Innocent Change

The 9 AM Discovery

The Downward Spiral

The Circuit Breaker Revelation

How ECS Behaves in Different Rollback Scenarios

How Circuit Breaker Would Have Saved Us

How ECS Actually Triggers Rollbacks: Behind the Scenes

The Circuit Breaker Implementation

What Happens When DesiredCount = 0? A Real Risk Pattern

Recommendations (Based on Real-World Failures)

1. Avoid using latest tag in ECS task definitions

2. Register a new task definition with every deployment

3. Use CloudWatch alarms to detect deployment failures early

4. Enforce Task Definition Updates in CI/CD

5. Use higher desiredCount during deployments for faster rollback

Summary Table

Conclusion: The Safety Net We Proactively Built

References:

1. Avoid using `latest` tag in ECS task definitions

5. Use higher `desiredCount` during deployments for faster rollback