Debugging ECS deployment failures: the complete playbook

#aws #ecs #devops #debugging

Debugging ECS deployment failures: the complete playbook

Step 1: Check service events

aws ecs describe-services --cluster production --services my-service   --query 'services[0].events[:10]'

Step 2: Inspect stopped tasks

aws ecs list-tasks --cluster production --family my-service --desired-status STOPPED

aws ecs describe-tasks --cluster production --tasks TASK_ARN   --query 'tasks[0].{stop:stopCode,reason:stoppedReason,containers:containers[*].{name:name,exit:lastStatus}}'

Common patterns and fixes

Exit 1 → App crash → check CloudWatch logs
Exit 137 → OOM killed → increase task memory
Exit 127 → Wrong CMD in Dockerfile

Health check failures:

resource "aws_ecs_service" "service" {
  health_check_grace_period_seconds = 120  # Default is 0
}
resource "aws_lb_target_group" "service" {
  health_check { path = "/health"; unhealthy_threshold = 5 }
  deregistration_delay = 30
}

CannotPullContainerError:

resource "aws_iam_role_policy_attachment" "ecr" {
  role       = aws_iam_role.ecs_execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

Enable ECS Exec (live debugging)

resource "aws_ecs_service" "service" { enable_execute_command = true }

aws ecs execute-command --cluster production --task TASK_ARN   --container my-service --interactive --command "/bin/sh"

Decision tree

Deployment fails
  ├── Task never starts
  │    ├── CannotPullContainer → ECR permissions or wrong image tag
  │    └── ResourceInitializationError → Secrets Manager permissions
  └── Task starts then stops
       ├── exit 1   → app crash, check logs
       ├── exit 137 → OOM, increase memory
       └── Health check fails → grace period / wrong port

Step2Dev enables ECS Exec and wires CloudWatch for every project.

👉 step2dev.com