->> Introduction
Why do companies migrate from on-premises servers to the cloud?
Simple reasons:
- High Maintenance overhead
- Expensive infrastructure
- Poor scalability
- Operational inefficiency
Cloud Platforms like AWS, Azure, and GCP promise:
- Pay-as-you-go pricing
- Elastic scalability
- Managed services
- Faster innovation
Sounds perfect right?
But here's the reality check:
Migrating to the cloud does not automatically mean your costs will go down.
Cloud cost optimization is a shared responsibility.
Cloud providers give you powerful tools.
It's your job to use them responsibly.
As a DevOps Engineer, one of your core responsibilities is:
Managing resources efficiently and cleaning up unused or stale infrastructure
And That's exactly what this project is about.
The Real Problem?
Let's take a very common scenario.
You spin up an Ec2 instance to host an application.
Its run in Dev/Test environment.
The workday ends...
But the instance keeps running overnight.
And the next night.
And the next week.
Money is burning silently.
Now multiply this by:
- 20 developers
- Multiple AWS regions
- Multiple projects
- Multiple Environments
Manually stopping instances?
Not scalable.
Not reliable.
Not realistic.
Solution Overview
I built a serverless AWS cost optimization system that:
- Automatically detects non-critical EC2 instances
- Checks business hours
- Analyzes CPU usage
- Stops underutilized machines
- Logs every action
- Sends email alerts
- Runs on a schedule
All powered by:
- AWS Lambds
- Boto3 (Python SDK)
- CloudWatch Metrics
- DynamoDB
- SNS
- EventBridge
Why Serverless (Why Lambda?)
- Lambda is perfect for this use case because:
- No server management
- Pay only for execution time
- Event-driven automation
- Auto scales
- Secure with IAM
Instead of running a VM 24/7 just to check EC2 usage...
We let AWS run code only when needed.
That's peak cloud efficiency.
Architecture
Here's how the system works:
EventBridge (Schedule)
↓
AWS Lambda (Cost Optimizer)
↓
EC2 Instances ←→ CloudWatch Metrics
↓
DynamoDB Logs
↓
SNS Email Alerts
Flow Explanation:
- EventBridge triggers Lambda on a Schedule
- Lambda scans EC2 instances using tags
- Lambda checks:
- Business hours
- CPU utilization
- If eligible -> stops EC2
- Logs action into DynamoDB
- Sends email alert using SNS
Tag-Based Safety Layer
One of the smartest design choices was tag-based filtering.
Only EC2 instances with these tags processed:
Tag Key ** **Required Value
AutoStop Yes
Environment Dev or Test
Critical No
This ensures:
- X Production systems are never touched.
- X Critical workloads are protected.
- Only disposable environments are optimized.
Time-Based Optimization
Business hours rules:
If current time >= 8 PM OR < 8 AM
→ instance is eligible for stopping
Time is calculated using a configurable time zone.
so, the system works globally.
CPU-Based Optimization
Even during business hours, some machines are just... idle.
Lambda fetches CPU metrics from CloudWatch:
Average CPU < CPU_THRESHOLD
→ instance is eligible for stopping
Lambda fetches waste from zombie EC2 instances doing absolutely nothing.
DRY_RUN Mode
One of the most important features:
DRY_RUN = true
In this mode:
- Lambda does NOT stop EC2
- Logs and alerts still run
- You can safely test logic
When ready:
DRY_RUN = false
Now Lambda actually stops EC2.
This prevents accidental disasters
Core Lambda Code
import boto3
import datetime
import os
import pytz
ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
ddb = boto3.resource('dynamodb')
TABLE_NAME = os.environ['TABLE_NAME']
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
DRY_RUN = os.environ['DRY_RUN'].lower() == "true"
CPU_THRESHOLD = int(os.environ['CPU_THRESHOLD'])
TIMEZONE = os.environ['TIMEZONE']
table = ddb.Table(TABLE_NAME)
def get_cpu_utilization(instance_id):
response = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.datetime.utcnow() - datetime.timedelta(minutes=30),
EndTime=datetime.datetime.utcnow(),
Period=300,
Statistics=['Average']
)
if not response['Datapoints']:
return None
latest = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])[-1]
return latest['Average']
def lambda_handler(event, context):
tz = pytz.timezone(TIMEZONE)
now = datetime.datetime.now(tz)
hour = now.hour
instances = ec2.describe_instances(Filters=[
{'Name': 'tag:AutoStop', 'Values': ['Yes']},
{'Name': 'tag:Environment', 'Values': ['Dev', 'Test']},
{'Name': 'tag:Critical', 'Values': ['No']},
{'Name': 'instance-state-name', 'Values': ['running']}
])
for res in instances['Reservations']:
for inst in res['Instances']:
instance_id = inst['InstanceId']
cpu = get_cpu_utilization(instance_id)
reason = None
if hour >= 20 or hour < 8:
reason = "After hours"
elif cpu is not None and cpu < CPU_THRESHOLD:
reason = f"Low CPU: {cpu}%"
if reason:
if not DRY_RUN:
ec2.stop_instances(InstanceIds=[instance_id])
table.put_item(Item={
"InstanceId": instance_id,
"Timestamp": str(now),
"Reason": reason
})
sns.publish(
TopicArn=SNS_TOPIC_ARN,
Subject="EC2 Optimization Alert",
Message=f"Stopped {instance_id} due to {reason}"
)
return {"statusCode": 200, "body": "Optimization complete"}
DynamoDB Logging
Each stop action is logged:
{
"InstanceId": "i-0abc12345",
"Timestamp": "2026-01-22 20:15:04",
"Reason": "Low CPU: 4.2%"
}
This gives:
- Audit history
- Compliance Data
- Debugging Visibility
- Cost optimization reports
SNS Notifications
Every action triggers an email alert:
Subject:
EC2 Optimization Alert
Message:
Stopped i-0abc12345 due to After hours
This ensures:
- Transparency
- Human awareness
- Manual override if needed
Automation with EventBridge
EventBridge schedules Lambda:
Rule Purpose
EC2AutoStopRule Runs every hour or at 8PM
EC2AutoStartRule Starts instances at 8 AM
Now the system is fully hands-free.
Challenges I Faced
-
Runtime mismatch
Python 3.14 Lambda + Python 3.10 layer -> crashFix: matched runtime versions
-
IAM permission hell
Missing permissions broke EC2 stop, DynamoDB logs, SNS alertsFix: attached least-privilege IAM policies
-
Environment variable bugs
Wrong variable names caused KeyErrorsFix: standardized env vars (
TABLE_NAME,SNS_TOPIC_ARN) -
Lambda working but EC2 not stopping
Turns out DRY_RUN was still trueFix: flipped it to false
No CPU metrics available
CloudWatch had no datapoints
Fix: added safe fallback logic
Conclusion
Cloud cost optimization is not just about cutting bills - it’s about building responsible, automated, and scalable systems.
In this project, we proved that with the right mix of AWS serverless services and Python automation, it’s possible to create a production-grade cost optimization system that:
- Automatically stops non-essential EC2 instances
- Protects critical workloads using tags
- Makes smart decisions using time and CPU metrics
- Logs every action for audit and visibility
- Sends real-time alerts for transparency
- Runs fully hands-free using EventBridge
- Includes a DRY_RUN safety mode to prevent accidents
This approach eliminates manual intervention, reduces human error, and ensures that cloud resources are used only when they are truly needed.
By combining Lambda, Boto3, CloudWatch, DynamoDB, SNS, and EventBridge, we created a lightweight yet powerful solution that can be easily extended to:
- Auto-start instances in the morning
- Optimize EBS volumes and snapshots
- Integrate Slack or Teams notifications
- Track cost trends using AWS Cost Explorer
- Manage multi-account AWS environments

Top comments (0)