Building a Multi-Channel AWS EC2 Spot Instance Interruption Alert System
A comprehensive guide to implementing real-time notifications for EC2 Spot Instance interruptions with ECS service impact analysis
Introduction & Problem Statement
AWS EC2 Spot Instances offer significant cost savings—up to 90% compared to On-Demand pricing—making them an attractive option for cost-conscious organizations. However, this cost efficiency comes with a trade-off: Spot Instances can be interrupted with only a 2-minute warning when AWS needs the capacity back.
The Challenge
In production environments running containerized workloads on ECS clusters backed by Spot Instances, these interruptions can cause:
- Service disruptions without proper preparation
- Lost work if applications aren't designed for graceful shutdowns
- Cascading failures when multiple instances are interrupted simultaneously
- Operational blindness when teams aren't notified of impending interruptions
The Solution
This article presents a comprehensive, production-ready solution that:
- Monitors EC2 Spot Instance interruption warnings in real-time
- Identifies running ECS services on affected instances
- Delivers intelligent notifications across multiple channels (SNS, Google Chat, SMTP)
- Provides actionable information for rapid response
Architecture & Design Overview
Our solution leverages AWS's native event-driven architecture to create a robust, serverless notification system:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ EC2 Spot │ │ EventBridge │ │ Lambda │
│ Interruption │───▶│ Rule │───▶│ Function │
│ Warning │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ Notification Channels │
│ ┌─────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ SNS │ │ Google Chat │ │ SMTP │ │
│ │ Topic │ │ Webhook │ │ Email │ │
│ └─────────┘ └─────────────┘ └───────────┘ │
└─────────────────────────────────────────────┘
Key Components
- EventBridge Rule: Captures EC2 Spot Instance Interruption Warning events
- Lambda Function: Processes events, discovers ECS services, and orchestrates notifications
- Multi-Channel Notifications: Flexible delivery via SNS, Google Chat, and SMTP
- ECS Integration: Intelligent service discovery and impact assessment
Solution Approach
Event-Driven Architecture
We chose an event-driven approach because:
- Real-time Response: EventBridge delivers events within seconds
- Cost Efficiency: Pay only for actual interruption events
- Scalability: Handles multiple simultaneous interruptions automatically
- Reliability: AWS-managed infrastructure with built-in retry mechanisms
Multi-Channel Notifications
Different teams prefer different communication channels:
- SNS: Integration with existing alerting systems (PagerDuty, OpsGenie)
- Google Chat: Real-time team collaboration
- SMTP Email: Detailed reports with HTML formatting
Service Impact Analysis
Rather than generic alerts, our solution provides:
- Service Identification: Which ECS services are running on the interrupted instance
- Impact Assessment: Filtering out daemon services that auto-restart
- Actionable Information: Instance details for rapid response
Code Walkthrough
Let's examine the key components that make this solution work:
1. Event Processing & Service Discovery
The core logic processes EventBridge events and discovers affected ECS services:
def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
"""Main handler for spot interruption events."""
instance_id = event.get("detail", {}).get("instance-id", "Unknown")
config = _load_configuration()
# Initialize AWS clients
session = get_aws_session(config['awsCredentials'])
aws_clients = {
'ec2': session.client('ec2'),
'ecs': session.client('ecs'),
'sns': session.client('sns')
}
# Discover services on interrupted instance
instance_info = _analyze_instance(aws_clients['ec2'], instance_id)
service_names = _discover_ecs_services(
aws_clients['ecs'], instance_id, instance_info.get('cluster_name')
)
if service_names:
_send_notifications(aws_clients, config, instance_info, service_names)
return {"status": "success", "services_affected": len(service_names)}
2. ECS Service Discovery Logic
The critical logic that identifies which services are running on the interrupted instance:
def _discover_ecs_services(ecs_client, instance_id: str, cluster_name: str) -> List[str]:
"""Find ECS services on the interrupted instance."""
service_names = []
# Get container instances in cluster
container_instances = ecs_client.list_container_instances(cluster=cluster_name)
for arn in container_instances.get('containerInstanceArns', []):
# Check if this container instance matches our EC2 instance
details = ecs_client.describe_container_instances(
cluster=cluster_name, containerInstances=[arn]
)
if details['containerInstances'][0]['ec2InstanceId'] == instance_id:
# Get tasks running on this container instance
tasks = ecs_client.list_tasks(cluster=cluster_name, containerInstance=arn)
for task_arn in tasks.get('taskArns', []):
service_name = _extract_service_from_task(ecs_client, cluster_name, task_arn)
if service_name:
service_names.append(service_name)
return list(set(service_names)) # Remove duplicates
3. Multi-Channel Notification Logic
The notification orchestration that sends alerts via multiple channels:
def _send_notifications(aws_clients, config, instance_info, service_names) -> List[str]:
"""Send notifications via enabled channels."""
message = _format_notification_message(instance_info, service_names)
notifications_sent = []
# SNS Notification
if os.environ.get('ENABLE_SNS', 'false').lower() == 'true':
try:
aws_clients['sns'].publish(
TopicArn=os.environ['SNS_TOPIC_ARN'],
Subject="🚨 EC2 Spot Instance Interruption Alert",
Message=message
)
notifications_sent.append('SNS')
except Exception as e:
logger.error(f"SNS notification failed: {e}")
# Google Chat Notification
if os.environ.get('ENABLE_CHAT', 'false').lower() == 'true':
try:
response = requests.post(
os.environ['GOOGLE_CHAT_WEBHOOK'],
json={"text": message}, timeout=10
)
response.raise_for_status()
notifications_sent.append('Google Chat')
except Exception as e:
logger.error(f"Google Chat notification failed: {e}")
# Email Notification
if os.environ.get('ENABLE_EMAIL', 'false').lower() == 'true':
try:
html_content = f"<html><body><pre>{message}</pre></body></html>"
send_email(config['smtpCredentials'], config['emailNotification'], html_content)
notifications_sent.append('Email')
except Exception as e:
logger.error(f"Email notification failed: {e}")
return notifications_sent
4. Flexible AWS Authentication
The session management supports multiple authentication methods:
def get_aws_session(credentials: Dict[str, Any]) -> boto3.Session:
"""Create AWS session with flexible authentication."""
region = credentials.get("region_name", "us-east-1")
if credentials.get("profile_name"):
return boto3.Session(profile_name=credentials["profile_name"], region_name=region)
elif credentials.get("role_arn"):
return _create_assumed_role_session(credentials["role_arn"], region)
elif credentials.get("access_key"):
return boto3.Session(
aws_access_key_id=credentials["access_key"],
aws_secret_access_key=credentials["secret_key"],
region_name=region
)
else:
return boto3.Session(region_name=region) # Default credential chain
Configuration & Setup Instructions
Prerequisites
Before deploying the solution, ensure you have:
- AWS CLI configured with appropriate permissions
- Python 3.13+ for local development
- IAM permissions for EC2, ECS, SNS, Lambda, and EventBridge
- SMTP credentials if using email notifications
- Google Chat webhook if using chat notifications
Step 1: Configuration Setup
Create your input.json configuration file:
{
"awsCredentials": {
"region_name": "us-east-1",
"profile_name": "your-aws-profile"
},
"smtpCredentials": {
"host": "smtp.gmail.com",
"port": "587",
"username": "alerts@yourcompany.com",
"password": "your-app-specific-password",
"from_email": "alerts@yourcompany.com"
},
"emailNotification": {
"email_subject": "🚨 Spot Instance Interruption Alert",
"subject_prefix": "PRODUCTION",
"to": ["devops-team@yourcompany.com"],
"cc": ["platform-team@yourcompany.com"]
}
}
Step 2: Deployment Scripts
Update the deployment configuration in cloudformation_deploy.sh:
#!/bin/bash
set -e
# Configuration
STACK_NAME="spot-interruption-alerts"
TEMPLATE_FILE="cloudformation-template.yml"
# Notification Settings
SNS_TOPIC_ARN="arn:aws:sns:us-east-1:123456789012:critical-alerts"
GOOGLE_CHAT_WEBHOOK="https://chat.googleapis.com/v1/spaces/AAAA/messages?key=xxx"
ENABLE_SNS="true"
ENABLE_EMAIL="true"
ENABLE_CHAT="true"
echo "🚀 Deploying Spot Interruption Alert System..."
# Build Lambda package
./lambda_build.sh
# Deploy infrastructure
aws cloudformation deploy \
--template-file $TEMPLATE_FILE \
--stack-name $STACK_NAME \
--parameter-overrides \
SNSTopicArn="$SNS_TOPIC_ARN" \
GoogleChatWebhook="$GOOGLE_CHAT_WEBHOOK" \
EnableSNS="$ENABLE_SNS" \
EnableEmail="$ENABLE_EMAIL" \
EnableChat="$ENABLE_CHAT" \
--capabilities CAPABILITY_NAMED_IAM
# Update Lambda code
FUNCTION_NAME=$(aws cloudformation describe-stacks \
--stack-name $STACK_NAME \
--query 'Stacks[0].Outputs[?OutputKey==`LambdaFunctionName`].OutputValue' \
--output text)
aws lambda update-function-code \
--function-name $FUNCTION_NAME \
--zip-file fileb://lambda-deployment.zip
echo "✅ Deployment completed successfully!"
Step 3: Build Script
The lambda_build.sh script packages your Lambda function:
#!/bin/bash
echo "📦 Building Lambda deployment package..."
# Clean previous builds
rm -rf package/ lambda-deployment.zip
# Create package directory
mkdir -p package
# Install dependencies
pip install -r requirements.txt -t package/
# Copy source files
cp *.py package/
cp input.json package/
# Create deployment package
cd package && zip -r ../lambda-deployment.zip . && cd ..
# Cleanup
rm -rf package/
echo "✅ Package created: lambda-deployment.zip"
Step 4: Deploy the Solution
# Make scripts executable
chmod +x cloudformation_deploy.sh lambda_build.sh
# Deploy the complete solution
./cloudformation_deploy.sh
Usage Examples
Example 1: SNS Integration with PagerDuty
Configure SNS topic to integrate with your existing alerting system:
# Create SNS topic
aws sns create-topic --name spot-interruption-alerts
# Subscribe PagerDuty endpoint
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:spot-interruption-alerts \
--protocol https \
--notification-endpoint https://events.pagerduty.com/integration/xxx/enqueue
Example 2: Google Chat Integration
Set up a Google Chat webhook:
- Go to Google Chat → Space Settings → Apps & Integrations
- Add Webhook
- Copy the webhook URL
- Update your deployment script with the webhook URL
Example 3: Custom Email Templates
Customize email notifications by modifying the _format_notification_message function:
def _format_notification_message(instance_info, service_names):
"""Custom email template with company branding."""
return f"""
<div style="font-family: Arial; max-width: 600px;">
<h2 style="color: #d32f2f;">🚨 Spot Instance Interruption Alert</h2>
<div style="background: #f5f5f5; padding: 15px; border-radius: 5px;">
<h3>Instance Details</h3>
<ul>
<li><strong>Instance ID:</strong> {instance_info['instance_id']}</li>
<li><strong>Type:</strong> {instance_info['instance_type']}</li>
<li><strong>AZ:</strong> {instance_info['availability_zone']}</li>
</ul>
</div>
<div style="background: #fff3cd; padding: 15px; border-radius: 5px; margin-top: 15px;">
<h3>Affected Services</h3>
<ul>
{''.join(f'<li>{service}</li>' for service in service_names)}
</ul>
</div>
<p><em>Automated alert from AWS Infrastructure Team</em></p>
</div>
"""
Best Practices Followed
1. Code Quality & Maintainability
- PEP 8 Compliance: Consistent code formatting and naming conventions
- Type Hints: Enhanced code readability and IDE support
- Modular Design: Separated concerns across multiple modules
- Comprehensive Logging: Structured logging with contextual information
2. Error Handling & Resilience
- Graceful Degradation: Continue processing even if one notification channel fails
- Retry Logic: Built-in retry mechanisms for transient failures
- Input Validation: Robust validation of configuration and event data
- Exception Handling: Comprehensive error handling with detailed logging
3. Security
- Least Privilege IAM: Minimal required permissions for Lambda execution role
- Credential Management: Secure handling of SMTP and webhook credentials
- No Hardcoded Secrets: All sensitive data passed via environment variables or configuration
- Input Sanitization: Proper validation of event data and configuration
4. Performance & Cost Optimization
- Efficient Resource Usage: Minimal Lambda memory allocation (128MB)
- Fast Execution: Optimized code paths for sub-10-second execution
- Event-Driven: Pay only for actual interruption events
- Connection Reuse: Efficient AWS client initialization
Security & Performance Considerations
Security Measures
- IAM Role Permissions
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ecs:ListContainerInstances",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:DescribeTasks",
"ecs:DescribeServices",
"sns:Publish"
],
"Resource": "*"
}
]
}
-
Credential Security
- SMTP passwords stored in configuration files (consider AWS Secrets Manager for production)
- Google Chat webhooks marked as
NoEchoin CloudFormation - No hardcoded credentials in source code
-
Network Security
- Lambda function can be deployed in VPC for additional isolation
- HTTPS-only communication for all external APIs
- Webhook URL validation to prevent SSRF attacks
Performance Optimizations
-
Lambda Configuration
- Memory: 128MB (sufficient for most workloads)
- Timeout: 30 seconds (allows for network retries)
- Runtime: Python 3.13 (latest stable version)
-
Efficient Processing
- Parallel notification sending where possible
- Early exit for instances without services
- Minimal AWS API calls through intelligent filtering
-
Cost Considerations
- Lambda Invocations: Only triggered by actual interruption events
- Data Transfer: Minimal outbound data for notifications
- CloudWatch Logs: Structured logging to minimize log volume
Common Pitfalls & Troubleshooting
Issue 1: Lambda Timeout
Symptoms: Lambda function times out before completing notifications
Solutions:
# Increase timeout in CloudFormation template
Timeout: 60 # Increase from 30 to 60 seconds
# Add timeout handling in code
import signal
def timeout_handler(signum, frame):
raise TimeoutError("Lambda function timeout")
signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(25) # Set alarm for 25 seconds (5 seconds before Lambda timeout)
Issue 2: SMTP Authentication Failures
Symptoms: Email notifications fail with authentication errors
Solutions:
- Gmail: Use App-Specific Passwords instead of regular passwords
- Office 365: Enable SMTP AUTH for the mailbox
- Corporate SMTP: Verify firewall rules allow outbound SMTP traffic
# Enhanced SMTP error handling
def _send_via_smtp(smtp_config, message, recipients):
try:
server = smtplib.SMTP(smtp_config["host"], int(smtp_config["port"]))
server.set_debuglevel(1) # Enable debug output
server.starttls()
server.login(smtp_config["username"], smtp_config["password"])
# ... rest of the code
except smtplib.SMTPAuthenticationError as e:
logger.error(f"SMTP authentication failed: {e}")
# Consider fallback notification method
except Exception as e:
logger.error(f"SMTP error: {e}")
Issue 3: Missing ECS Services
Symptoms: Alerts sent for instances without detected services
Root Causes:
- Incorrect ECS cluster name detection
- Services not properly tagged
- Daemon services incorrectly filtered
Solutions:
# Enhanced cluster detection
def _get_cluster_name_from_instance(ec2_client, instance_id):
"""Multiple methods to detect ECS cluster name."""
try:
response = ec2_client.describe_instances(InstanceIds=[instance_id])
instance = response['Reservations'][0]['Instances'][0]
# Method 1: Auto Scaling Group tag
for tag in instance.get('Tags', []):
if tag['Key'] == 'aws:autoscaling:groupName':
return tag['Value']
# Method 2: ECS Cluster tag
for tag in instance.get('Tags', []):
if tag['Key'] == 'ecs:cluster-name':
return tag['Value']
# Method 3: Parse from instance name
instance_name = next(
(tag['Value'] for tag in instance.get('Tags', [])
if tag['Key'] == 'Name'),
''
)
if 'ecs' in instance_name.lower():
return instance_name.split('-')[0] # Assuming naming convention
except Exception as e:
logger.error(f"Failed to detect cluster name: {e}")
return None
Issue 4: Google Chat Webhook Failures
Symptoms: Google Chat notifications fail silently
Solutions:
def _send_chat_notification(message):
"""Enhanced Google Chat notification with retry logic."""
webhook_url = os.environ.get('GOOGLE_CHAT_WEBHOOK')
if not webhook_url:
logger.warning("Google Chat webhook not configured")
return False
max_retries = 3
for attempt in range(max_retries):
try:
response = requests.post(
webhook_url,
json={"text": message},
timeout=10,
headers={'Content-Type': 'application/json'}
)
response.raise_for_status()
logger.info("Google Chat notification sent successfully")
return True
except requests.exceptions.RequestException as e:
logger.warning(f"Google Chat attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
logger.error("Google Chat notification failed after all retries")
return False
Enhancements & Future Improvements
1. Advanced Service Discovery
Enhance ECS service detection with additional metadata:
def _get_enhanced_service_info(ecs_client, cluster_name, service_name):
"""Get detailed service information for better alerting."""
try:
service_details = ecs_client.describe_services(
cluster=cluster_name,
services=[service_name]
)
if service_details and service_details['services']:
service = service_details['services'][0]
return {
'name': service_name,
'desired_count': service.get('desiredCount', 0),
'running_count': service.get('runningCount', 0),
'pending_count': service.get('pendingCount', 0),
'task_definition': service.get('taskDefinition', '').split('/')[-1],
'load_balancers': len(service.get('loadBalancers', [])) > 0
}
except Exception as e:
logger.error(f"Failed to get enhanced service info: {e}")
return {'name': service_name}
2. Intelligent Alert Suppression
Implement smart alerting to reduce noise:
def _should_send_alert(instance_info, service_names):
"""Determine if alert should be sent based on business rules."""
# Skip alerts during maintenance windows
if _is_maintenance_window():
logger.info("Skipping alert during maintenance window")
return False
# Skip alerts for non-critical services
critical_services = os.environ.get('CRITICAL_SERVICES', '').split(',')
if critical_services and not any(
service in critical_services for service in service_names
):
logger.info("No critical services affected, skipping alert")
return False
# Skip alerts if cluster has sufficient capacity
if _has_sufficient_capacity(instance_info.get('cluster_name')):
logger.info("Cluster has sufficient capacity, skipping alert")
return False
return True
3. Metrics and Monitoring
Add CloudWatch custom metrics for better observability:
def _publish_custom_metrics(instance_info, service_names, notifications_sent):
"""Publish custom CloudWatch metrics."""
cloudwatch = boto3.client('cloudwatch')
try:
# Metric: Spot interruptions by instance type
cloudwatch.put_metric_data(
Namespace='SpotInterruption',
MetricData=[
{
'MetricName': 'InterruptionCount',
'Dimensions': [
{
'Name': 'InstanceType',
'Value': instance_info.get('instance_type', 'Unknown')
}
],
'Value': 1,
'Unit': 'Count'
}
]
)
# Metric: Services affected
cloudwatch.put_metric_data(
Namespace='SpotInterruption',
MetricData=[
{
'MetricName': 'ServicesAffected',
'Value': len(service_names),
'Unit': 'Count'
}
]
)
except Exception as e:
logger.error(f"Failed to publish metrics: {e}")
4. Integration with External Systems
Add support for additional notification channels:
def _send_slack_notification(message):
"""Send notification to Slack via webhook."""
webhook_url = os.environ.get('SLACK_WEBHOOK_URL')
if not webhook_url:
return False
slack_message = {
"text": "Spot Instance Interruption Alert",
"attachments": [
{
"color": "danger",
"text": message,
"footer": "AWS Lambda Alert System",
"ts": int(time.time())
}
]
}
try:
response = requests.post(webhook_url, json=slack_message, timeout=10)
response.raise_for_status()
return True
except Exception as e:
logger.error(f"Slack notification failed: {e}")
return False
def _create_jira_ticket(instance_info, service_names):
"""Create JIRA ticket for critical service interruptions."""
if not _is_critical_interruption(service_names):
return
jira_config = {
'url': os.environ.get('JIRA_URL'),
'username': os.environ.get('JIRA_USERNAME'),
'token': os.environ.get('JIRA_TOKEN')
}
ticket_data = {
'fields': {
'project': {'key': 'INFRA'},
'summary': f"Spot Instance Interruption: {instance_info['instance_id']}",
'description': f"Services affected: {', '.join(service_names)}",
'issuetype': {'name': 'Incident'},
'priority': {'name': 'High'}
}
}
# Implementation details...
5. Configuration Management
Move to AWS Systems Manager Parameter Store or Secrets Manager:
def _load_configuration_from_ssm():
"""Load configuration from AWS Systems Manager Parameter Store."""
ssm = boto3.client('ssm')
try:
# Get SMTP configuration
smtp_params = ssm.get_parameters_by_path(
Path='/spot-interruption/smtp',
Recursive=True,
WithDecryption=True
)
# Get email configuration
email_params = ssm.get_parameters_by_path(
Path='/spot-interruption/email',
Recursive=True
)
# Build configuration dictionary
config = {
'smtpCredentials': {},
'emailNotification': {}
}
for param in smtp_params['Parameters']:
key = param['Name'].split('/')[-1]
config['smtpCredentials'][key] = param['Value']
for param in email_params['Parameters']:
key = param['Name'].split('/')[-1]
config['emailNotification'][key] = param['Value'].split(',') if key in ['to', 'cc'] else param['Value']
return config
except Exception as e:
logger.error(f"Failed to load configuration from SSM: {e}")
raise
Conclusion
Building a robust spot instance interruption notification system requires careful consideration of multiple factors: real-time event processing, intelligent service discovery, multi-channel notifications, and operational resilience.
Key Takeaways
Event-Driven Architecture: Leveraging AWS EventBridge provides real-time, scalable event processing with minimal operational overhead.
Multi-Channel Notifications: Different teams prefer different communication methods. Supporting SNS, Google Chat, and SMTP ensures broad coverage.
Intelligent Service Discovery: Rather than generic alerts, providing specific service impact information enables faster response times.
Production-Ready Code: Following Python best practices, comprehensive error handling, and structured logging creates maintainable, reliable systems.
Security First: Implementing least-privilege IAM roles, secure credential handling, and input validation protects against common security vulnerabilities.
Business Impact
This solution provides several business benefits:
- Reduced MTTR: Faster notification and service identification reduces mean time to recovery
- Cost Optimization: Enables confident use of spot instances with proper monitoring
- Operational Excellence: Automated alerting reduces manual monitoring overhead
- Scalability: Event-driven architecture scales automatically with infrastructure growth
Next Steps
Consider implementing the suggested enhancements based on your organization's needs:
- Advanced Metrics: Add CloudWatch dashboards for spot interruption trends
- Integration: Connect with existing ITSM tools (ServiceNow, JIRA)
- Machine Learning: Implement predictive analytics for interruption patterns
- Multi-Region: Extend solution across multiple AWS regions
The complete source code, CloudFormation templates, and deployment scripts are available in the repository. This solution has been tested in production environments and can be adapted to meet specific organizational requirements.
This solution demonstrates how modern cloud-native architectures can solve real operational challenges while maintaining security, performance, and cost-effectiveness. The modular design allows for easy customization and extension as requirements evolve.
Top comments (0)