Hello Devs,
In the world of containerised deployments, small mistakes can have catastrophic consequences. What started as a routine morning API test in our development environment turned into a revelation about production resilience that fundamentally changed how we approach ECS Fargate deployments.
This is the story of how a simple port configuration error taught us the critical importance of ECS Deployment Circuit Breakers – and why every team running workloads on AWS Fargate should consider them essential infrastructure, not optional extras.
The best production incidents, as it turns out, are the ones that never happen.
The Setup
Our Flask API ran smoothly on ECS Fargate with a cost-optimized dev setup — tasks auto-started at 8 AM and stopped after hours using CloudWatch alarms.
We used an Application Load Balancer (ALB) targeting port 5001
, with health checks and task definitions perfectly aligned:
# app.py - The way it had always been
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5001, debug=False)
ECS config:
Container port: 5001
Target group: 5001
ALB health checks: 5001
Everything in harmony.
The Innocent Change
One of our backend developers, was working late on a new feature. They were running multiple services locally and kept hitting port conflicts. Port 5001 was already occupied by another service.
"Quick fix," developer thought, and made what seemed like the most logical change:
# app.py - The "harmless" local change
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5201, debug=False) # Changed to avoid local conflict
The feature worked perfectly in her local environment. Tests passed. Code review looked good. The Docker build succeeded. Everything seemed normal.
But here's where the story takes a turn.
The 9 AM Discovery
The next morning, I arrived open my machine around 9 AM and decided to run some API tests before diving into feature work. Our automated CloudWatch alarm had dutifully started the ECS Fargate tasks at 8 AM, just as configured. But something was wrong.
Every API call returned the dreaded 502 Bad Gateway error.
I immediately checked the ECS console, and what I saw made me to think : Fargate tasks were in a continuous cycle of PENDING → RUNNING → STOPPED. They would start up, run for a few minutes, then get drained and terminated, only for ECS to immediately spin up new ones.
The root cause hit me like a lightning bolt: Our Flask application was now listening on port 5201, but everything else in our infrastructure was still configured for port 5001.
The Downward Spiral
What followed was a textbook example of how a small misconfiguration can cascade into a major incident:
Task Launch: ECS Fargate would start a new task
Health Check Failure: ALB couldn't reach the app on port 5001
Task Termination: ECS marked the task as unhealthy and terminated it
Replacement Attempt: ECS immediately launched a new Fargate task to maintain desired count
Infinite Loop: Steps 1-4 repeated endlessly
Our ECS Fargate cluster was stuck in what we later dubbed "the task death spiral." New Fargate tasks were being created and destroyed every few minutes, consuming compute resources while serving zero traffic.
The Circuit Breaker Revelation
During our post-incident analysis, then I realised something that would change our deployment strategy forever: "What if I told you this entire incident could have been prevented automatically?"
Enter the ECS Fargate Deployment Circuit Breaker.
This AWS feature acts like an intelligent safety net for ECS Fargate deployments. When enabled, it monitors your Fargate deployment and can automatically detect when something is going wrong, stopping the deployment and rolling back to the previous stable version.
How ECS Behaves in Different Rollback Scenarios
Scenario | Desired Count | Task Def Changed | Rollback Triggered | Rollback Time |
---|---|---|---|---|
Broken image pushed with latest only |
1 | ❌ No | ❌ No | ❌ Never |
Broken task def v3 (flask-app:v2) | 1 | ✅ Yes | ✅ Yes | ⏱ ~10–20 min |
Same failure with desiredCount=5
|
5 | ✅ Yes | ✅ Yes | ⏱ ~3–5 min |
Circuit breaker only works if a new task definition is registered.
Desired count = 1 leads to slow failure detection, delaying rollback.
ECS uses an internal failure threshold (usually 3 failed tasks).
How Circuit Breaker Would Have Saved Us
Let's replay our incident with ECS Fargate deployment circuit breaker enabled:
To better understand how ECS identifies and reacts to a bad deployment, here’s a simplified flow diagram based on our real incident:
Deployment Start: Mismatched port in new task definition
Monitoring Begins: ECS tracks task health and startup patterns
Failure Detected: Multiple ECS task failures trigger threshold
Automatic Rollback: ECS reverts to previous task definition
Service Restored: Traffic resumes via healthy version
Instead of 75 minutes of downtime, we would have had perhaps 5-10 minutes of degraded performance while the circuit breaker detected and resolved the issue.
How ECS Actually Triggers Rollbacks: Behind the Scenes
During our experiments, we noticed some undocumented behaviours:
ECS doesn't rollback unless there's a new task definition — pushing a new image to
latest
doesn’t count.Desired count = 1 (common in off-hours cost optimisation) leads to much slower rollbacks due to staggered failures.
-
ECS seems to use a dynamic failure threshold of 3 (confirmed visually in the console), meaning it waits for 3 failed task launches before triggering rollback. You cannot change either of the threshold values. It is mentioned in the ECS deployment circuit breaker
ECS uses the following logic to determine rollback:
Minimum threshold <= 0.5 *desired task count
=> maximum threshold
What this means in practice:
Even if circuit breaker is “enabled,” rollback won’t happen unless you structure your deployments correctly.
The Circuit Breaker Implementation
That afternoon, we made a decision that would prove to be one of our best infrastructure investments: enabling ECS Deployment Circuit Breaker across all our services, starting with our most critical production workloads.
The configuration was surprisingly straightforward:
{
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
What Happens When DesiredCount = 0? A Real Risk Pattern
Our incident happened in a development environment using off-hours scaling — every night, desiredCount = 0
, and each morning ECS spins the tasks back up at 8 AM. This helps save cost during non-business hours.
But here’s the hidden danger we uncovered through real experiments:
In our real case, we pushed a new (and broken) image to
flask-app:latest
overnight.
However, we didn’t register a new task definition — the task definition was unchanged.
So when ECS scaled up in the morning, it pulled the broken image and launched new tasks.
Because ECS had no healthy task running and no “new deployment” to monitor, no rollback happened.
This subtle but critical issue means that:
ECS had no baseline healthy task to compare against
There was no new task definition, so ECS didn’t consider this a deployment
Circuit breaker logic was never triggered
ECS just kept retrying the same broken image silently
Even with circuit breaker enabled, rollback only works if ECS sees a new deployment (i.e., new task definition revision). In our case, since we reused flask-app:latest
with the same task definition, ECS had nothing to roll back to.
Recommendations (Based on Real-World Failures)
These are not just best practices from the AWS documentation. These are hard-earned lessons from our own experiments and real incident recoveries.
1. Avoid using latest
tag in ECS task definitions
ECS won't detect image changes if you're using flask-app:latest
and don’t update the task definition. This can silently deploy broken images without triggering a rollback.
Do this instead:
Use immutable image tags like
v1.2.3
,build-20250909
, or a full SHA digestAlways reference a new task definition revision tied to each deployment
2. Register a new task definition with every deployment
The deployment circuit breaker only activates when ECS detects a new deployment. If the task definition remains unchanged (even with a new image), ECS won’t treat it as a deployment, and rollback won’t occur.
Do this instead:
Automate task definition registration in your CI/CD pipeline
Even if using the same image tag, register a revision to trigger deployment detection
3. Use CloudWatch alarms to detect deployment failures early
ECS retries silently when tasks fail during deployment. In non-prod or low-desiredCount environments, this can go unnoticed.
Do this instead:
Monitor
UnhealthyHostCount
(ALB) and ECS service deployment eventsAlert on unusual task exit reasons, STOPPED states, or drops in
RunningTaskCount
4. Enforce Task Definition Updates in CI/CD
One common issue we saw: devs pushed new images to latest
, but forgot to update task definitions. Result? No rollback, no detection, broken app silently running.
Do this instead:
Add a CI/CD check: fail the pipeline if task definition revision isn't updated
Maintain an audit log: map every deployment to a task definition revision
5. Use higher desiredCount
during deployments for faster rollback
In our tests, when desiredCount
was set to 1, rollback took over 20 minutes to trigger. With desiredCount
set to 5, the circuit breaker detected the failure pattern faster and triggered rollback within 3–5 minutes.
What to do instead:
Temporarily increase
desiredCount
during deployments (e.g., from 1 to 5).Alternatively, tune
deploymentConfiguration
to usemaxPercent = 200
andminHealthyPercent = 50
to allow parallel task launches during updates.
Summary Table
Problem | Recommendation |
---|---|
ECS didn’t rollback on broken image | Always register a new task definition |
ECS used broken latest tag silently |
Avoid latest , use immutable image tags |
Slow rollback when desired count = 1 | Use higher desiredCount during deploys |
No alert when tasks failed | Add CloudWatch alarms for task health and service events |
Deployment skipped task def update | Enforce task def registration in pipeline |
Conclusion: The Safety Net We Proactively Built
Even small mistakes — like a port mismatch — can bring down containerized systems. That’s why we treat circuit breakers not as optional features, but as must-have infrastructure. They're not just for rollback — they build resilience into your deployment lifecycle.
That seemingly minor port change could’ve caused hours of downtime — but it didn’t. Because we caught it early, we had the chance to rethink our deployment safety.
We turned that morning’s incident into a proactive defense strategy by enabling ECS Deployment Circuit Breaker across all services. It now gives us confidence that even if a broken deployment slips through, ECS will detect the issue and roll back automatically — without us scrambling at 9 AM.
Our team now deploys with confidence, not caution. And the best part? The incident never reached users.
Sometimes the best production incidents are the ones that never happen.
👉 Have you faced something similar? Let’s talk in the comments.
Top comments (0)