Automating CloudWatch Orphan Alarm Detection: A Production-Ready Solution
Introduction & Problem Statement
In large-scale AWS environments, CloudWatch alarms are essential for monitoring infrastructure health and triggering alerts when issues arise. However, as infrastructure evolves—EC2 instances are terminated, RDS databases are deleted, load balancers are removed—the alarms monitoring these resources often remain active, creating "orphan alarms."
These orphan alarms present several challenges:
- Cost Inefficiency: Each CloudWatch alarm costs money, and orphaned alarms provide no value
- Alert Fatigue: Orphan alarms can trigger false alerts, reducing team responsiveness to real issues
- Operational Overhead: Manual identification and cleanup is time-consuming and error-prone
- Compliance Risks: Outdated monitoring configurations can fail audit requirements
This solution automates the detection and optional deletion of orphan CloudWatch alarms across multiple AWS services, generating comprehensive Excel reports for review and providing flexible authentication options for multi-account environments.
What This Solution Does
- Scans CloudWatch alarms across your AWS account
- Identifies alarms monitoring deleted or non-existent resources
- Generates detailed Excel reports with orphan alarm information
- Optionally deletes orphan alarms automatically
- Supports multiple AWS services: EC2, RDS, ECS, Lambda, ALB, Target Groups, and SQS
- Provides flexible authentication methods for various deployment scenarios
- Integrates seamlessly with CI/CD pipelines (Jenkins example included)
Architecture / Design Overview
The solution follows a modular architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────┐
│ Jenkins Pipeline │
│ (Scheduled Execution) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Python Application │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ AWS Session │ │ Resource │ │ Alarm │ │
│ │ Manager │──│ Discovery │──│ Analyzer │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Report │ │ Notification │ │ Cleanup │ │
│ │ Generator │ │ Service │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ AWS Services │
│ EC2 │ RDS │ ECS │ Lambda │ ALB │ Target Groups │ SQS │
└─────────────────────────────────────────────────────────────┘
Component Responsibilities
- AWS Session Manager: Handles multiple authentication methods (profiles, roles, access keys, STS tokens)
- Resource Discovery: Queries AWS APIs to retrieve current active resources
- Alarm Analyzer: Compares CloudWatch alarms against active resources to identify orphans
- Report Generator: Creates Excel reports with detailed orphan alarm information
- Notification Service: Sends email alerts with attached reports
- Cleanup Service: Optionally deletes identified orphan alarms
Solution Approach
Detection Strategy
The solution uses a dimension-matching algorithm to identify orphan alarms:
- Resource Enumeration: Query all active resources for each supported service
- Alarm Collection: Retrieve all CloudWatch alarms with their dimensions
- Dimension Mapping: Match alarm dimensions to resource identifiers
- Orphan Identification: Flag alarms whose dimensions don't match any active resource
- Report Generation: Compile results into actionable reports
Supported Services
| Service | Dimension Keys | Example |
|---|---|---|
| EC2 Instances | InstanceId |
i-0123456789abcdef0 |
| RDS Clusters | DBClusterIdentifier |
production-aurora-cluster |
| RDS Instances | DBInstanceIdentifier |
production-mysql-01 |
| ALB | LoadBalancer |
app/prod-alb/1234567890 |
| Target Groups | TargetGroup |
targetgroup/prod-tg/abcdef |
| ECS Clusters | ClusterName |
production-cluster |
| ECS Services |
ClusterName, ServiceName
|
prod-cluster, api-service
|
| Lambda Functions | FunctionName |
data-processor |
| Lambda Versions |
FunctionName, Resource
|
processor, processor:1
|
| SQS Queues | QueueName |
order-processing-queue |
Code Walkthrough
Core Logic: Orphan Detection Algorithm
The heart of the solution is the dimension-matching algorithm in check_alarm_aws_resources_with_resource_list():
def check_alarm_aws_resources_with_resource_list(
resource_details: Dict[str, List[str]],
namespace: str,
alarms: List[Dict],
service_data: List[Dict]
) -> List[Dict]:
"""
Identifies orphan alarms by comparing alarm dimensions with active resources.
Algorithm:
1. Filter alarms by namespace
2. Check if alarm has required dimensions
3. Verify dimension values match an active resource
4. Flag as orphan if no match found
"""
AWS Session Management
The AWSSession module provides flexible authentication:
# Priority order:
# 1. AWS Profile (for local development)
# 2. IAM Role (for cross-account access)
# 3. Access Keys + Session Token (for temporary credentials)
# 4. Access Keys (for programmatic access)
# 5. Default credentials (for EC2/Lambda IAM roles)
Resource Discovery Pattern
Each service follows a consistent pagination pattern:
def get_service_details(client, max_results: int) -> List[Dict]:
"""Generic pattern for AWS API pagination"""
resources = []
next_token = None
while True:
# Fetch page of results
response = client.describe_resources(
MaxResults=max_results,
NextToken=next_token if next_token else None
)
# Process results
resources.extend(process_response(response))
# Check for more pages
if 'NextToken' not in response:
break
next_token = response['NextToken']
return resources
Report Generation
The Excel report uses xlsxwriter to create structured, readable output:
- One worksheet per service with orphan alarms
- Columns: AlarmName, Namespace, and all relevant dimensions
- Auto-sized columns for readability
- Optional automatic deletion of identified orphans
Configuration & Setup Instructions
Prerequisites
# Python 3.8+
python3 --version
# AWS CLI configured (optional)
aws configure
# Required IAM permissions (see iam_policies.json)
Installation
# Clone repository
git clone <repository-url>
cd devops-automation/aws-cw-orphan-alarms
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Configuration Files
1. inputs.yml - Runtime Configuration
# AWS Authentication (choose one method)
region_name: "us-east-1"
profile_name: "my-aws-profile" # For local development
role_arn: "" # For cross-account access
access_key: "" # For programmatic access
secret_key: ""
session_token: "" # For temporary credentials
# Cleanup behavior
delete: false # Set to true to auto-delete orphan alarms
# Email notifications
Email:
enabled: true
details:
subject_prefix: "Production"
host: "smtp.office365.com:587"
username: "alerts@company.com"
password: "secure-password"
from: "alerts@company.com"
to:
- "devops@company.com"
cc:
- "platform@company.com"
2. input.json - Service Configuration
This file defines which services to scan and their dimension mappings. The default configuration covers all supported services. Customize to focus on specific services:
{
"EC2Instance": {
"Namespace": ["AWS/EC2", "CWAgent"],
"Dimension": ["InstanceId"],
"ExcludeDimension": []
}
}
IAM Permissions
Create an IAM policy with the following permissions:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:DescribeAlarms",
"cloudwatch:DeleteAlarms",
"ec2:DescribeInstances",
"rds:DescribeDBInstances",
"rds:DescribeDBClusters",
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeTargetGroups",
"ecs:ListClusters",
"ecs:ListServices",
"lambda:ListFunctions",
"lambda:ListVersionsByFunction",
"sqs:ListQueues"
],
"Resource": "*"
}
]
}
Usage Examples
Local Execution
# Dry run (report only, no deletion)
python main.py
# Review generated report
open Inventory.xlsx
# Enable deletion in inputs.yml, then run
python main.py
Jenkins Pipeline Integration
The included Jenkinsfile provides automated scheduling:
triggers {
cron('0 5 * * 1') // Every Monday at 5 AM
}
Pipeline Flow:
- Clean workspace
- Pull latest code from Git
- Download configuration from S3
- Execute Python script in virtual environment
- Send SNS notification on failure
Docker Execution
# Build image
docker build -t orphan-alarm-detector .
# Run container
docker run -v $(pwd)/inputs.yml:/app/inputs.yml \
-v $(pwd)/output:/app/output \
orphan-alarm-detector
Lambda Deployment
For serverless execution, package the application:
# Create deployment package
pip install -r requirements.txt -t package/
cp *.py package/
cd package && zip -r ../lambda-function.zip . && cd ..
# Deploy via AWS CLI
aws lambda create-function \
--function-name orphan-alarm-detector \
--runtime python3.13 \
--handler main.lambda_handler \
--zip-file fileb://lambda-function.zip \
--role arn:aws:iam::ACCOUNT:role/lambda-execution-role
Best Practices Followed
Code Quality
- PEP 8 Compliance: Consistent formatting and naming conventions
- Type Hints: Enhanced code readability and IDE support
- Docstrings: Comprehensive documentation for all functions
- Modular Design: Clear separation of concerns across modules
- Error Handling: Graceful degradation with detailed logging
AWS Best Practices
- Pagination: Handles large result sets without memory issues
- Rate Limiting: Respects AWS API throttling limits
- Least Privilege: Minimal IAM permissions required
- Multi-Region Support: Configurable region targeting
- Credential Security: No hardcoded credentials
Operational Excellence
- Logging: Structured logging for troubleshooting
- Idempotency: Safe to run multiple times
- Dry Run Mode: Test before making changes
- Audit Trail: Excel reports provide deletion history
- Notification: Email alerts for visibility
Security & Performance Considerations
Security
Credential Management:
- Never commit credentials to version control
- Use AWS Secrets Manager or Parameter Store for sensitive data
- Rotate credentials regularly
- Use IAM roles when possible (EC2, Lambda, ECS)
Network Security:
- Run in private subnets with VPC endpoints for AWS services
- Use security groups to restrict outbound access
- Enable VPC Flow Logs for audit trails
Data Protection:
- Excel reports may contain sensitive resource names
- Store reports in encrypted S3 buckets
- Implement lifecycle policies for automatic deletion
- Use SES with TLS for email transmission
Performance
Optimization Strategies:
-
Pagination Tuning: Adjust
max_resultsbased on account size
max_results = 100 # Balance between API calls and memory
- Parallel Processing: For multi-region scanning, use threading
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(scan_region, region) for region in regions]
- Caching: Cache resource lists for multiple alarm checks
from functools import lru_cache
@lru_cache(maxsize=128)
def get_cached_resources(service: str) -> List[Dict]:
return fetch_resources(service)
Execution Time Estimates:
| Account Size | Resources | Alarms | Execution Time |
|---|---|---|---|
| Small | < 100 | < 50 | 30-60 seconds |
| Medium | 100-1000 | 50-500 | 2-5 minutes |
| Large | 1000-5000 | 500-2000 | 5-15 minutes |
| Enterprise | > 5000 | > 2000 | 15-30 minutes |
Common Pitfalls & Troubleshooting
Issue: "No credentials found"
Cause: AWS credentials not configured properly
Solution:
# Option 1: Configure AWS CLI
aws configure
# Option 2: Set environment variables
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
# Option 3: Use IAM role (recommended for EC2/Lambda)
# Attach IAM role to compute resource
Issue: "Rate limit exceeded"
Cause: Too many API calls in short period
Solution:
# Add exponential backoff
import time
from botocore.exceptions import ClientError
def api_call_with_retry(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except ClientError as e:
if e.response['Error']['Code'] == 'Throttling':
wait_time = 2 ** attempt
time.sleep(wait_time)
else:
raise
Issue: "Permission denied" errors
Cause: Insufficient IAM permissions
Solution:
- Review CloudTrail logs for denied API calls
- Add missing permissions to IAM policy
- Verify IAM role trust relationships for assumed roles
Issue: Alarms not detected as orphans
Cause: Dimension mismatch or namespace filtering
Solution:
- Check alarm dimensions in CloudWatch console
- Verify
input.jsonconfiguration matches alarm structure - Review logs for dimension comparison details
Issue: Email notifications not sending
Cause: SMTP configuration or authentication issues
Solution:
# Verify SMTP settings
Email:
enabled: true
details:
host: "smtp.office365.com:587" # Include port
username: "full-email@domain.com" # Use full email
password: "app-specific-password" # Not regular password
Enhancements & Future Improvements
Planned Features
Multi-Region Support:
regions = ['us-east-1', 'us-west-2', 'eu-west-1']
for region in regions:
scan_region(region)
generate_report(region)
Multi-Account Scanning:
accounts = [
{'account_id': '111111111111', 'role': 'OrganizationAccountAccessRole'},
{'account_id': '222222222222', 'role': 'OrganizationAccountAccessRole'}
]
for account in accounts:
assume_role(account)
scan_account(account['account_id'])
Slack Integration:
def send_slack_notification(webhook_url: str, orphan_count: int):
payload = {
"text": f"Found {orphan_count} orphan alarms",
"attachments": [{"color": "warning", "text": "Review report"}]
}
requests.post(webhook_url, json=payload)
CloudWatch Dashboard:
- Metric: Number of orphan alarms detected
- Metric: Cost savings from deleted alarms
- Metric: Execution duration and success rate
Terraform/CloudFormation Deployment:
- Infrastructure as Code for Lambda deployment
- EventBridge rule for scheduled execution
- S3 bucket for report storage
- SNS topic for notifications
Advanced Filtering:
# Exclude alarms by tag
exclude_tags = {'Environment': 'Production', 'Critical': 'true'}
# Exclude alarms by name pattern
exclude_patterns = [r'^prod-.*', r'.*-critical$']
# Age-based filtering (only flag old orphans)
min_age_days = 7
Community Contributions Welcome
- Additional AWS service support (CloudFront, API Gateway, DynamoDB)
- Azure and GCP equivalents
- Web UI for report visualization
- Automated remediation workflows
- Integration with ServiceNow/Jira for ticketing
Conclusion
Orphan CloudWatch alarms are an often-overlooked source of waste and operational noise in AWS environments. This automated solution provides a production-ready approach to identifying and managing these orphan alarms at scale.
Key Takeaways:
✅ Cost Optimization: Eliminate unnecessary CloudWatch alarm costs
✅ Operational Efficiency: Reduce alert fatigue and manual cleanup efforts
✅ Flexibility: Multiple authentication methods and deployment options
✅ Safety: Dry-run mode and detailed reporting before deletion
✅ Scalability: Handles large environments with thousands of resources
✅ Integration: Works seamlessly with existing CI/CD pipelines
Getting Started:
- Clone the repository and install dependencies
- Configure
inputs.ymlwith your AWS credentials - Run in dry-run mode to generate initial report
- Review orphan alarms and validate detection accuracy
- Enable deletion mode for automated cleanup
- Schedule regular execution via Jenkins/Lambda
Production Deployment Checklist:
- [ ] IAM permissions configured with least privilege
- [ ] Credentials stored securely (Secrets Manager/Parameter Store)
- [ ] Email notifications tested and working
- [ ] Dry-run executed and results validated
- [ ] Backup of critical alarms documented
- [ ] Monitoring configured for script execution
- [ ] Runbook created for troubleshooting
- [ ] Stakeholders notified of automation deployment
This solution has been battle-tested in production environments managing thousands of CloudWatch alarms across multiple AWS accounts. By automating orphan alarm detection, teams can focus on meaningful alerts while maintaining clean, cost-effective monitoring infrastructure.
Author: Prashant Gupta | Cloud Platform Lead
Repository: devops-automation
License: MIT
Questions or Issues? Open an issue on GitHub or reach out to the DevOps team.

Top comments (0)