Modern cloud applications need more than monitoring they need self-healing infrastructure. Waiting for humans to react to failures increases downtime and risks user impact. In this guide, I’ll show you how to build a system that automatically detects ECS service failures, notifies your team on Slack, and restores the service all using Terraform.
Why This Project Matters
In containerized environments, services can fail due to application crashes, resource exhaustion, or deployment issues. Traditional monitoring tools detect failures, but manual intervention is slow.
A self-healing system solves this by:
- Detecting failures automatically
- Restarting services without human intervention
- Sending alerts to teams in real-time
Architecture Overview
Here’s how the system works:
- ECS service health degrades (task crashes, reduced running count)
- CloudWatch monitors ECS metrics and triggers an alarm when RunningTaskCount < desired count 3.EventBridge captures the alarm state change
- Lambda executes:
- Sends a Slack alert
- Restarts the ECS service
This creates a closed-loop, event-driven system.
AWS Services Used
- Amazon ECS (Fargate) – Hosts containerized apps
- CloudWatch – Monitors service health
- EventBridge – Captures CloudWatch alarms and triggers Lambda
- Lambda – Executes remediation logic and sends Slack notifications 5.Slack Webhook – Sends alerts to your team
Terraform Implementation
I built the infrastructure using Terraform for repeatable, version-controlled deployment. Key points:
1.Modular structure (ecs, lambda, cloudwatch, eventbridge, iam, ssm)
- Slack webhook stored securely in SSM Parameter Store
- Lambda reads the webhook at runtime and sends formatted alerts
This project shows how to turn ECS monitoring into a self-healing system. By combining AWS services and Slack integration, you can detect failures, alert your team, and restore services automatically, reducing downtime and improving reliability.
Github repo:https://github.com/Copubah/AWS-ecs-monitoring-and-auto-remediation
Top comments (0)