Prashant Gupta

Posted on Jan 2

AWS ECS Service Task Recycle

#aws #ecs #monitoring

AWS ECS Service Task Recycle

Automated Lambda function for recycling AWS ECS service tasks one by one, maintaining service availability during the process. Unlike ECS force deployment which replaces all tasks in parallel, this solution stops and replaces tasks sequentially with configurable wait times.

Overview

This solution provides controlled task recycling for ECS services by:

Stopping tasks one at a time instead of parallel replacement
Waiting for service stability between each task replacement
Optionally maintaining service state by temporarily increasing capacity
Configurable wait time between task replacements

Features

Sequential Task Recycling: Stops and replaces tasks one by one
Service Stability: Waits for stable state after each task replacement
Capacity Management: Optional temporary capacity increase to maintain availability
Autoscaling Support: Handles services with Application Auto Scaling
Flexible Authentication: Multiple AWS credential methods via AWSSession module
Email Notifications: Optional SMTP notifications on completion
CloudFormation Deployment: Infrastructure as code with automated deployment
Zero Retries: EventInvokeConfig set to 0 retry attempts
Comprehensive Logging: Detailed CloudWatch logs for monitoring

Architecture

Lambda Function (Python 3.13)
├── Event-driven execution
├── AWSSession.py (AWS authentication)
├── Notification.py (Email notifications)
└── input.json (Configuration)

Prerequisites

Python 3.13+
AWS CLI configured
IAM permissions for ECS and Application Auto Scaling
SMTP server (optional, for notifications)

Installation

1. Clone Repository

cd aws-ecs-service-task-recycle

2. Configure Settings

Edit input.json with your configuration:

{
  "awsCredentials": {
    "region_name": "us-east-1"
  },
  "smtpCredentials": {
    "host": "smtp.example.com",
    "port": "587",
    "username": "user@example.com",
    "password": "password",
    "from_email": "noreply@example.com"
  },
  "emailNotification": {
    "email_subject": "ECS Service Task Recycle Completed",
    "subject_prefix": "AWS ECS",
    "to": ["admin@example.com"]
  }
}

3. Deploy CloudFormation Stack

chmod +x cloudformation_deploy.sh lambda_build.sh
./cloudformation_deploy.sh

Usage

Lambda Event Parameters

{
  "cluster_name": "my-ecs-cluster",
  "service_name": "my-service",
  "maintain_service_state": true,
  "wait_time": 30
}

Parameters:

cluster_name (required): ECS cluster name
service_name (required): ECS service name
maintain_service_state (optional, default: true): Temporarily increase capacity by 1
wait_time (optional, default: 30): Seconds to wait between task replacements

Invoke Lambda Function

AWS CLI:

aws lambda invoke \
  --function-name ecs-task-recycle-function \
  --payload '{"cluster_name":"my-cluster","service_name":"my-service","maintain_service_state":true,"wait_time":30}' \
  response.json

AWS Console:

Navigate to Lambda → Functions → ecs-task-recycle-function
Test tab → Create test event
Add event JSON and invoke

How It Works

Process Flow

Get Current State: Retrieve service configuration and running tasks
Increase Capacity (if maintain_service_state=true): Add +1 to desired count
Wait for Stability: Ensure new task is running
Recycle Tasks: For each old task:
- Stop the task
- Wait for replacement task to start
- Wait for service stability
- Sleep for configured wait_time
Restore Capacity: Return to original desired count
Send Notification: Email report (if configured)

Example Scenario

Service with 3 tasks:

Initial State: 3 tasks running
↓
Increase to 4 tasks (maintain availability)
↓
Stop task 1 → Wait stable → Sleep 30s
↓
Stop task 2 → Wait stable → Sleep 30s
↓
Stop task 3 → Wait stable → Sleep 30s
↓
Restore to 3 tasks
↓
Complete

Configuration

AWS Credentials (input.json)

Multiple authentication methods supported:

{
  "awsCredentials": {
    "region_name": "us-east-1",
    "profile_name": "my-profile",
    "role_arn": "arn:aws:iam::123456789012:role/MyRole",
    "access_key": "AKIAIOSFODNN7EXAMPLE",
    "secret_key": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
    "session_token": "token"
  }
}

SMTP Configuration (Optional)

{
  "smtpCredentials": {
    "host": "smtp.gmail.com",
    "port": "587",
    "username": "user@gmail.com",
    "password": "app-password",
    "from_email": "noreply@example.com"
  }
}

IAM Permissions

Required permissions (included in CloudFormation):

{
  "Effect": "Allow",
  "Action": [
    "ecs:DescribeServices",
    "ecs:UpdateService",
    "ecs:ListTasks",
    "ecs:StopTask",
    "ecs:DescribeTasks",
    "application-autoscaling:DescribeScalableTargets",
    "application-autoscaling:RegisterScalableTarget"
  ],
  "Resource": "*"
}

CloudFormation Resources

Lambda Function: Python 3.13 runtime, 900s timeout, 256MB memory
IAM Role: Execution role with ECS and Auto Scaling permissions
EventInvokeConfig: MaximumRetryAttempts set to 0
CloudWatch Logs: 7-day retention

Monitoring

CloudWatch Logs

aws logs tail /aws/lambda/ecs-task-recycle-function --follow

Key Log Messages

Starting task recycle for {cluster}/{service}
Original desired count: X, tasks: Y
Recycling task N/M: {task_arn}
Task N recycled, waiting Xs
Task recycle completed successfully

Troubleshooting

Service Not Stabilizing

Increase waiter MaxAttempts in code (default: 40)
Check ECS service health and task definitions
Verify target group health checks

Timeout Errors

Increase Lambda timeout (default: 900s)
Reduce number of tasks or increase wait_time

Authentication Failures

Verify IAM role permissions
Check AWS credentials in input.json
Ensure Lambda execution role is correct

Best Practices

Test in Non-Production: Always test with non-critical services first
Monitor CloudWatch: Watch logs during first execution
Adjust Wait Time: Tune based on application startup time
Use Maintain State: Enable for production services
Schedule Wisely: Run during low-traffic periods

Comparison with Force Deployment

Feature	Force Deployment	Task Recycle
Task Replacement	Parallel	Sequential
Service Disruption	Higher	Lower
Completion Time	Faster	Slower
Control	Limited	Configurable
Wait Between Tasks	No	Yes

Security Considerations

Lambda execution role follows least privilege
No hardcoded credentials in code
SMTP credentials stored in input.json (use Secrets Manager in production)
CloudWatch logs for audit trail
EventInvokeConfig prevents retry storms

Cost Optimization

Lambda execution time: ~(number_of_tasks × wait_time) seconds
CloudWatch Logs: 7-day retention
No additional AWS service costs
Consider scheduling during off-peak hours

Limitations

Maximum Lambda execution time: 15 minutes
Suitable for services with < 20 tasks (with 30s wait time)
Requires stable service for waiter to succeed
No rollback mechanism on failure

Contributing

Contributions welcome! Please follow the repository structure:

Test changes thoroughly
Update documentation
Follow existing code style
Add error handling

DEV Community

AWS ECS Service Task Recycle

AWS ECS Service Task Recycle

Overview

Features

Architecture

Prerequisites

Installation

1. Clone Repository

2. Configure Settings

3. Deploy CloudFormation Stack

Usage

Lambda Event Parameters

Invoke Lambda Function

How It Works

Process Flow

Example Scenario

Configuration

AWS Credentials (input.json)

SMTP Configuration (Optional)

IAM Permissions

CloudFormation Resources

Monitoring

CloudWatch Logs

Key Log Messages

Troubleshooting

Service Not Stabilizing

Timeout Errors

Authentication Failures

Best Practices

Comparison with Force Deployment

Security Considerations

Cost Optimization

Limitations

Contributing

Top comments (0)