Amit Kushwaha

Posted on Jan 22

Cloud Cost Optimization Using Boto3: Automating EC2 Management with AWS Lambda

#python #aws #devops #automation

->> Introduction

Why do companies migrate from on-premises servers to the cloud?

Simple reasons:

High Maintenance overhead
Expensive infrastructure
Poor scalability
Operational inefficiency

Cloud Platforms like AWS, Azure, and GCP promise:

Pay-as-you-go pricing
Elastic scalability
Managed services
Faster innovation

Sounds perfect right?
But here's the reality check:

Migrating to the cloud does not automatically mean your costs will go down.

Cloud cost optimization is a shared responsibility.

Cloud providers give you powerful tools.
It's your job to use them responsibly.

As a DevOps Engineer, one of your core responsibilities is:

Managing resources efficiently and cleaning up unused or stale infrastructure

And That's exactly what this project is about.

The Real Problem?

Let's take a very common scenario.

You spin up an Ec2 instance to host an application.
Its run in Dev/Test environment.
The workday ends...
But the instance keeps running overnight.
And the next night.
And the next week.

Money is burning silently.

Now multiply this by:

20 developers
Multiple AWS regions
Multiple projects
Multiple Environments

Manually stopping instances?
Not scalable.
Not reliable.
Not realistic.

Solution Overview

I built a serverless AWS cost optimization system that:

Automatically detects non-critical EC2 instances
Checks business hours
Analyzes CPU usage
Stops underutilized machines
Logs every action
Sends email alerts
Runs on a schedule

All powered by:

AWS Lambds
Boto3 (Python SDK)
CloudWatch Metrics
DynamoDB
SNS
EventBridge

Why Serverless (Why Lambda?)

Lambda is perfect for this use case because:
No server management
Pay only for execution time
Event-driven automation
Auto scales
Secure with IAM

Instead of running a VM 24/7 just to check EC2 usage...

We let AWS run code only when needed.

That's peak cloud efficiency.

Architecture

Here's how the system works:

EventBridge (Schedule)
        ↓
AWS Lambda (Cost Optimizer)
        ↓
EC2 Instances  ←→  CloudWatch Metrics
        ↓
DynamoDB Logs
        ↓
SNS Email Alerts

Flow Explanation:

EventBridge triggers Lambda on a Schedule
Lambda scans EC2 instances using tags
Lambda checks:
- Business hours
- CPU utilization
If eligible -> stops EC2
Logs action into DynamoDB
Sends email alert using SNS

Tag-Based Safety Layer

One of the smartest design choices was tag-based filtering.

Only EC2 instances with these tags processed:

Tag Key ** **Required Value
AutoStop Yes
Environment Dev or Test
Critical No

This ensures:

X Production systems are never touched.
X Critical workloads are protected.
Only disposable environments are optimized.

Time-Based Optimization

Business hours rules:

If current time >= 8 PM OR < 8 AM
→ instance is eligible for stopping

Time is calculated using a configurable time zone.

so, the system works globally.

CPU-Based Optimization

Even during business hours, some machines are just... idle.

Lambda fetches CPU metrics from CloudWatch:

Average CPU < CPU_THRESHOLD
→ instance is eligible for stopping

Lambda fetches waste from zombie EC2 instances doing absolutely nothing.

DRY_RUN Mode
One of the most important features:

DRY_RUN = true

In this mode:

Lambda does NOT stop EC2
Logs and alerts still run
You can safely test logic

When ready:

DRY_RUN = false

Now Lambda actually stops EC2.
This prevents accidental disasters

Core Lambda Code

import boto3
import datetime
import os
import pytz

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
ddb = boto3.resource('dynamodb')

TABLE_NAME = os.environ['TABLE_NAME']
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
DRY_RUN = os.environ['DRY_RUN'].lower() == "true"
CPU_THRESHOLD = int(os.environ['CPU_THRESHOLD'])
TIMEZONE = os.environ['TIMEZONE']

table = ddb.Table(TABLE_NAME)

def get_cpu_utilization(instance_id):
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=datetime.datetime.utcnow() - datetime.timedelta(minutes=30),
        EndTime=datetime.datetime.utcnow(),
        Period=300,
        Statistics=['Average']
    )

    if not response['Datapoints']:
        return None

    latest = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])[-1]
    return latest['Average']

def lambda_handler(event, context):
    tz = pytz.timezone(TIMEZONE)
    now = datetime.datetime.now(tz)
    hour = now.hour

    instances = ec2.describe_instances(Filters=[
        {'Name': 'tag:AutoStop', 'Values': ['Yes']},
        {'Name': 'tag:Environment', 'Values': ['Dev', 'Test']},
        {'Name': 'tag:Critical', 'Values': ['No']},
        {'Name': 'instance-state-name', 'Values': ['running']}
    ])

    for res in instances['Reservations']:
        for inst in res['Instances']:
            instance_id = inst['InstanceId']
            cpu = get_cpu_utilization(instance_id)

            reason = None
            if hour >= 20 or hour < 8:
                reason = "After hours"
            elif cpu is not None and cpu < CPU_THRESHOLD:
                reason = f"Low CPU: {cpu}%"

            if reason:
                if not DRY_RUN:
                    ec2.stop_instances(InstanceIds=[instance_id])

                table.put_item(Item={
                    "InstanceId": instance_id,
                    "Timestamp": str(now),
                    "Reason": reason
                })

                sns.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Subject="EC2 Optimization Alert",
                    Message=f"Stopped {instance_id} due to {reason}"
                )

    return {"statusCode": 200, "body": "Optimization complete"}

DynamoDB Logging

Each stop action is logged:

{
  "InstanceId": "i-0abc12345",
  "Timestamp": "2026-01-22 20:15:04",
  "Reason": "Low CPU: 4.2%"
}

This gives:

Audit history
Compliance Data
Debugging Visibility
Cost optimization reports

SNS Notifications

Every action triggers an email alert:

Subject:

EC2 Optimization Alert

Message:

Stopped i-0abc12345 due to After hours

This ensures:

Transparency
Human awareness
Manual override if needed

Automation with EventBridge

EventBridge schedules Lambda:

Rule Purpose

EC2AutoStopRule Runs every hour or at 8PM
EC2AutoStartRule Starts instances at 8 AM

Now the system is fully hands-free.

Challenges I Faced

Runtime mismatch
Python 3.14 Lambda + Python 3.10 layer -> crash

Fix: matched runtime versions
IAM permission hell
Missing permissions broke EC2 stop, DynamoDB logs, SNS alerts

Fix: attached least-privilege IAM policies
Environment variable bugs
Wrong variable names caused KeyErrors

Fix: standardized env vars (TABLE_NAME, SNS_TOPIC_ARN)
Lambda working but EC2 not stopping
Turns out DRY_RUN was still true

Fix: flipped it to false
No CPU metrics available
CloudWatch had no datapoints
Fix: added safe fallback logic

Conclusion

Cloud cost optimization is not just about cutting bills - it’s about building responsible, automated, and scalable systems.

In this project, we proved that with the right mix of AWS serverless services and Python automation, it’s possible to create a production-grade cost optimization system that:

Automatically stops non-essential EC2 instances
Protects critical workloads using tags
Makes smart decisions using time and CPU metrics
Logs every action for audit and visibility
Sends real-time alerts for transparency
Runs fully hands-free using EventBridge
Includes a DRY_RUN safety mode to prevent accidents

This approach eliminates manual intervention, reduces human error, and ensures that cloud resources are used only when they are truly needed.

By combining Lambda, Boto3, CloudWatch, DynamoDB, SNS, and EventBridge, we created a lightweight yet powerful solution that can be easily extended to:

Auto-start instances in the morning
Optimize EBS volumes and snapshots
Integrate Slack or Teams notifications
Track cost trends using AWS Cost Explorer
Manage multi-account AWS environments

Resources:

Top comments (2)

Vikas Tripathi • Feb 22

Really solid project. The tag-based safety layer is smart
design — protecting production systems from automation
is exactly the right approach.

Your CPU detection logic is very similar to what I built
in my own scanner — 14-day average under 5% threshold.
One thing I found useful was extending the lookback window
beyond 30 minutes to catch instances that are busy
occasionally but idle most of the time.

The DRY_RUN mode is underrated — more automation projects
should include this by default. Saved me from a few
disasters during testing too.

You mentioned EBS and snapshot optimization as future
extensions — I ended up building those as the primary
focus alongside idle EC2 detection. The unattached volume
problem alone is surprisingly large in most accounts.

What's the average CPU threshold you settled on for
production use?

Amit Kushwaha • Feb 23

Thanks a lot, Vikas !
really appreciate the detailed feedback!

DEV Community

Cloud Cost Optimization Using Boto3: Automating EC2 Management with AWS Lambda

->> Introduction

The Real Problem?

Solution Overview

Architecture

Tag-Based Safety Layer

Time-Based Optimization

CPU-Based Optimization

Core Lambda Code

DynamoDB Logging

SNS Notifications

Automation with EventBridge

Challenges I Faced

Fix: matched runtime versions

Fix: attached least-privilege IAM policies

Fix: standardized env vars (`TABLE_NAME`, `SNS_TOPIC_ARN`)

Fix: flipped it to false

Conclusion

Resources:

Top comments (2)

->> Introduction

The Real Problem?

Solution Overview

Architecture

Tag-Based Safety Layer

Time-Based Optimization

CPU-Based Optimization

Core Lambda Code

DynamoDB Logging

SNS Notifications

Automation with EventBridge

Challenges I Faced

Fix: matched runtime versions

Fix: attached least-privilege IAM policies

Fix: standardized env vars (TABLE_NAME, SNS_TOPIC_ARN)

Fix: flipped it to false

Conclusion

Resources:

Fix: standardized env vars (`TABLE_NAME`, `SNS_TOPIC_ARN`)