AWS ECS Service Monitoring
Automated monitoring solution for AWS ECS services that detects service failures, deployment issues, and task placement problems. Sends real-time notifications via SNS and creates custom CloudWatch metrics for monitoring.
Overview
This solution monitors ECS service events and automatically sends notifications when critical issues occur, including:
- Service task placement failures
- Service deployment failures
- Task configuration issues
- Service discovery problems
- VPC Lattice target health issues
Architecture
- Lambda Function: Processes ECS CloudWatch events and sends notifications
- CloudWatch Event Rule: Captures ECS service events and deployment state changes
- SNS Topic: Delivers notifications for critical events
- Custom CloudWatch Metrics: Tracks error event counts for monitoring
Features
🔍 Event Monitoring
- Service Task Placement Failure: Insufficient CPU/memory or no available container instances
- Service Task Configuration Failure: ARN format or tagging issues
- Service Daemon Placement Constraint Violated: Placement constraint violations
- ECS Operation Throttled: API throttle limit issues
- Service Discovery Operation Throttled: AWS Cloud Map throttle limits
- Service Deployment Failed: Failed deployments with circuit breaker detection
- Service Task Start Impaired: Consistent task startup failures
- Service Discovery Instance Unhealthy: Unhealthy service registry tasks
- VPC Lattice Target Unhealthy: Unhealthy VPC Lattice targets
📊 Monitoring & Alerting
- Real-time SNS notifications with detailed event information
- Custom CloudWatch metrics (
ECSServiceErrorEventsCount) - Structured logging for troubleshooting
- Environment-specific alert subjects
Prerequisites
- AWS CLI configured with appropriate permissions
- Terraform >= 1.14.3
- Python 3.x runtime for Lambda
- ECS clusters and services to monitor
Required AWS Permissions
The Lambda function requires:
-
logs:CreateLogGroup,logs:CreateLogStream,logs:PutLogEvents -
sns:Publishon the SNS topic cloudwatch:PutMetricDatassm:GetParameter- EC2 network interface permissions for VPC access
Configuration
1. Environment Configuration
Create configuration files:
-
config.yml- Common project settings -
config_default.yml- Environment-specific settings
2. Workspace Configuration
workspace:
name: "your-workspace"
sns:
display_name: "ECS Service Monitoring"
names: "ecs-service-alerts"
lambda:
app_name: "ecs-service-monitor"
handler: "lambda_function.lambda_handler"
runtime: "python3.9"
timeout: 60
memory_size: 128
retention_in_days: 14
cluster:
cluster-name-1: ["service1", "service2"]
cluster-name-2: ["service3", "service4"]
Deployment
Quick Deploy
# Make scripts executable
chmod +x launch.sh destroy.sh
# Deploy infrastructure
./launch.sh
# Enter environment: default
# Destroy infrastructure
./destroy.sh
# Enter environment: default
Manual Deployment
# Initialize Terraform
terraform init
# Select workspace
terraform workspace select default
# Plan deployment
terraform plan
# Apply changes
terraform apply
Monitored ECS Events
Service Action Events
SERVICE_TASK_PLACEMENT_FAILURESERVICE_TASK_CONFIGURATION_FAILURESERVICE_DAEMON_PLACEMENT_CONSTRAINT_VIOLATEDECS_OPERATION_THROTTLEDSERVICE_DISCOVERY_OPERATION_THROTTLEDSERVICE_DEPLOYMENT_FAILEDSERVICE_TASK_START_IMPAIREDSERVICE_DISCOVERY_INSTANCE_UNHEALTHYVPC_LATTICE_TARGET_UNHEALTHY
Deployment State Change Events
SERVICE_DEPLOYMENT_FAILED
Notification Format
Subject: [PROJECT_NAME] | [ENV] | ERROR: [Event Type]
Message:
Cluster Name: cluster-name
Service Name: service-name
Region: us-east-1
Event Name: SERVICE_TASK_PLACEMENT_FAILURE
Reason: [Event reason if available]
Message: [Detailed description]
Custom Metrics
The solution creates custom CloudWatch metrics:
-
Namespace:
AWS/ECS -
Metric Name:
ECSServiceErrorEventsCount -
Dimensions:
ClusterName,ServiceName -
Unit:
Count
Troubleshooting
Common Issues
-
Lambda Function Not Triggering
- Verify CloudWatch Event Rule is enabled
- Check ECS service ARNs in configuration
- Ensure Lambda permissions are correct
-
SNS Notifications Not Received
- Verify SNS topic subscription
- Check Lambda execution logs
- Validate SNS publish permissions
-
Missing Events
- Confirm ECS services match configured ARN patterns
- Check CloudWatch Event Rule pattern syntax
- Verify service names in cluster configuration
Logs and Monitoring
# View Lambda logs
aws logs describe-log-groups --log-group-name-prefix "/aws/lambda/your-function-name"
# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name ECSServiceErrorEventsCount \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-02T00:00:00Z \
--period 3600 \
--statistics Sum
Security Considerations
- No hardcoded credentials in configuration files
- IAM roles follow least privilege principle
- Environment variables used for sensitive configuration
- VPC network interface permissions for secure access
Cost Optimization
- Lambda function uses minimal memory (128MB)
- CloudWatch log retention set to 14 days
- Event-driven architecture minimizes compute costs
- Custom metrics reduce CloudWatch costs
Contributing
- Test changes in non-production environment
- Update configuration examples
- Add new event types to Lambda function
- Update documentation for new features
Support
For issues and questions:
- Check CloudWatch logs for Lambda execution details
- Verify ECS service configurations
- Review SNS topic subscriptions
- Validate IAM permissions
GitHub Link
https://github.com/prashantgupta123/devops-automation/tree/main/aws-ecs-service-monitoring
Note: This monitoring solution is designed for production ECS environments. Test thoroughly before deploying to critical systems.
Top comments (0)