Debugging ECS deployment failures: the complete playbook
Step 1: Check service events
aws ecs describe-services --cluster production --services my-service --query 'services[0].events[:10]'
Step 2: Inspect stopped tasks
aws ecs list-tasks --cluster production --family my-service --desired-status STOPPED
aws ecs describe-tasks --cluster production --tasks TASK_ARN --query 'tasks[0].{stop:stopCode,reason:stoppedReason,containers:containers[*].{name:name,exit:lastStatus}}'
Common patterns and fixes
Exit 1 → App crash → check CloudWatch logs
Exit 137 → OOM killed → increase task memory
Exit 127 → Wrong CMD in Dockerfile
Health check failures:
resource "aws_ecs_service" "service" {
health_check_grace_period_seconds = 120 # Default is 0
}
resource "aws_lb_target_group" "service" {
health_check { path = "/health"; unhealthy_threshold = 5 }
deregistration_delay = 30
}
CannotPullContainerError:
resource "aws_iam_role_policy_attachment" "ecr" {
role = aws_iam_role.ecs_execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
Enable ECS Exec (live debugging)
resource "aws_ecs_service" "service" { enable_execute_command = true }
aws ecs execute-command --cluster production --task TASK_ARN --container my-service --interactive --command "/bin/sh"
Decision tree
Deployment fails
├── Task never starts
│ ├── CannotPullContainer → ECR permissions or wrong image tag
│ └── ResourceInitializationError → Secrets Manager permissions
└── Task starts then stops
├── exit 1 → app crash, check logs
├── exit 137 → OOM, increase memory
└── Health check fails → grace period / wrong port
Step2Dev enables ECS Exec and wires CloudWatch for every project.
Top comments (0)