DEV Community

Cover image for Cloud Cost Optimization Using Boto3: Automating EC2 Management with AWS Lambda
Amit Kushwaha
Amit Kushwaha

Posted on

Cloud Cost Optimization Using Boto3: Automating EC2 Management with AWS Lambda

->> Introduction

Why do companies migrate from on-premises servers to the cloud?

Simple reasons:

  • High Maintenance overhead
  • Expensive infrastructure
  • Poor scalability
  • Operational inefficiency

Cloud Platforms like AWS, Azure, and GCP promise:

  • Pay-as-you-go pricing
  • Elastic scalability
  • Managed services
  • Faster innovation

Sounds perfect right?
But here's the reality check:

Migrating to the cloud does not automatically mean your costs will go down.

Cloud cost optimization is a shared responsibility.

Cloud providers give you powerful tools.
It's your job to use them responsibly.

As a DevOps Engineer, one of your core responsibilities is:

Managing resources efficiently and cleaning up unused or stale infrastructure

And That's exactly what this project is about.


The Real Problem?

Let's take a very common scenario.

You spin up an Ec2 instance to host an application.
Its run in Dev/Test environment.
The workday ends...
But the instance keeps running overnight.
And the next night.
And the next week.

Money is burning silently.

Now multiply this by:

  • 20 developers
  • Multiple AWS regions
  • Multiple projects
  • Multiple Environments

Manually stopping instances?
Not scalable.
Not reliable.
Not realistic.


Solution Overview

I built a serverless AWS cost optimization system that:

  • Automatically detects non-critical EC2 instances
  • Checks business hours
  • Analyzes CPU usage
  • Stops underutilized machines
  • Logs every action
  • Sends email alerts
  • Runs on a schedule

All powered by:

  • AWS Lambds
  • Boto3 (Python SDK)
  • CloudWatch Metrics
  • DynamoDB
  • SNS
  • EventBridge

Why Serverless (Why Lambda?)

  • Lambda is perfect for this use case because:
  • No server management
  • Pay only for execution time
  • Event-driven automation
  • Auto scales
  • Secure with IAM

Instead of running a VM 24/7 just to check EC2 usage...

We let AWS run code only when needed.

That's peak cloud efficiency.


Architecture

Here's how the system works:


EventBridge (Schedule)
        ↓
AWS Lambda (Cost Optimizer)
        ↓
EC2 Instances  ←→  CloudWatch Metrics
        ↓
DynamoDB Logs
        ↓
SNS Email Alerts

Enter fullscreen mode Exit fullscreen mode

Flow Explanation:

  1. EventBridge triggers Lambda on a Schedule
  2. Lambda scans EC2 instances using tags
  3. Lambda checks:
    • Business hours
    • CPU utilization
  4. If eligible -> stops EC2
  5. Logs action into DynamoDB
  6. Sends email alert using SNS

Tag-Based Safety Layer

One of the smartest design choices was tag-based filtering.

Only EC2 instances with these tags processed:

Tag Key ** **Required Value
AutoStop Yes
Environment Dev or Test
Critical No

This ensures:

  • X Production systems are never touched.
  • X Critical workloads are protected.
  • Only disposable environments are optimized.

Time-Based Optimization

Business hours rules:

If current time >= 8 PM OR < 8 AM
→ instance is eligible for stopping
Enter fullscreen mode Exit fullscreen mode

Time is calculated using a configurable time zone.

so, the system works globally.


CPU-Based Optimization

Even during business hours, some machines are just... idle.

Lambda fetches CPU metrics from CloudWatch:

Average CPU < CPU_THRESHOLD
→ instance is eligible for stopping

Enter fullscreen mode Exit fullscreen mode

Lambda fetches waste from zombie EC2 instances doing absolutely nothing.


DRY_RUN Mode
One of the most important features:

DRY_RUN = true
Enter fullscreen mode Exit fullscreen mode

In this mode:

  • Lambda does NOT stop EC2
  • Logs and alerts still run
  • You can safely test logic

When ready:

DRY_RUN = false
Enter fullscreen mode Exit fullscreen mode

Now Lambda actually stops EC2.
This prevents accidental disasters


Core Lambda Code

import boto3
import datetime
import os
import pytz

ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
sns = boto3.client('sns')
ddb = boto3.resource('dynamodb')

TABLE_NAME = os.environ['TABLE_NAME']
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
DRY_RUN = os.environ['DRY_RUN'].lower() == "true"
CPU_THRESHOLD = int(os.environ['CPU_THRESHOLD'])
TIMEZONE = os.environ['TIMEZONE']

table = ddb.Table(TABLE_NAME)

def get_cpu_utilization(instance_id):
    response = cloudwatch.get_metric_statistics(
        Namespace='AWS/EC2',
        MetricName='CPUUtilization',
        Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
        StartTime=datetime.datetime.utcnow() - datetime.timedelta(minutes=30),
        EndTime=datetime.datetime.utcnow(),
        Period=300,
        Statistics=['Average']
    )

    if not response['Datapoints']:
        return None

    latest = sorted(response['Datapoints'], key=lambda x: x['Timestamp'])[-1]
    return latest['Average']

def lambda_handler(event, context):
    tz = pytz.timezone(TIMEZONE)
    now = datetime.datetime.now(tz)
    hour = now.hour

    instances = ec2.describe_instances(Filters=[
        {'Name': 'tag:AutoStop', 'Values': ['Yes']},
        {'Name': 'tag:Environment', 'Values': ['Dev', 'Test']},
        {'Name': 'tag:Critical', 'Values': ['No']},
        {'Name': 'instance-state-name', 'Values': ['running']}
    ])

    for res in instances['Reservations']:
        for inst in res['Instances']:
            instance_id = inst['InstanceId']
            cpu = get_cpu_utilization(instance_id)

            reason = None
            if hour >= 20 or hour < 8:
                reason = "After hours"
            elif cpu is not None and cpu < CPU_THRESHOLD:
                reason = f"Low CPU: {cpu}%"

            if reason:
                if not DRY_RUN:
                    ec2.stop_instances(InstanceIds=[instance_id])

                table.put_item(Item={
                    "InstanceId": instance_id,
                    "Timestamp": str(now),
                    "Reason": reason
                })

                sns.publish(
                    TopicArn=SNS_TOPIC_ARN,
                    Subject="EC2 Optimization Alert",
                    Message=f"Stopped {instance_id} due to {reason}"
                )

    return {"statusCode": 200, "body": "Optimization complete"}

Enter fullscreen mode Exit fullscreen mode

DynamoDB Logging

Each stop action is logged:

{
  "InstanceId": "i-0abc12345",
  "Timestamp": "2026-01-22 20:15:04",
  "Reason": "Low CPU: 4.2%"
}
Enter fullscreen mode Exit fullscreen mode

This gives:

  • Audit history
  • Compliance Data
  • Debugging Visibility
  • Cost optimization reports

SNS Notifications

Every action triggers an email alert:

Subject:

EC2 Optimization Alert
Enter fullscreen mode Exit fullscreen mode

Message:

Stopped i-0abc12345 due to After hours
Enter fullscreen mode Exit fullscreen mode

This ensures:

  • Transparency
  • Human awareness
  • Manual override if needed

Automation with EventBridge

EventBridge schedules Lambda:

Rule Purpose

EC2AutoStopRule Runs every hour or at 8PM
EC2AutoStartRule Starts instances at 8 AM

Now the system is fully hands-free.


Challenges I Faced

  1. Runtime mismatch
    Python 3.14 Lambda + Python 3.10 layer -> crash

    Fix: matched runtime versions

  2. IAM permission hell
    Missing permissions broke EC2 stop, DynamoDB logs, SNS alerts

    Fix: attached least-privilege IAM policies

  3. Environment variable bugs
    Wrong variable names caused KeyErrors

    Fix: standardized env vars (TABLE_NAME, SNS_TOPIC_ARN)

  4. Lambda working but EC2 not stopping
    Turns out DRY_RUN was still true

    Fix: flipped it to false

  5. No CPU metrics available
    CloudWatch had no datapoints
    Fix: added safe fallback logic

Conclusion

Cloud cost optimization is not just about cutting bills - it’s about building responsible, automated, and scalable systems.

In this project, we proved that with the right mix of AWS serverless services and Python automation, it’s possible to create a production-grade cost optimization system that:

  • Automatically stops non-essential EC2 instances
  • Protects critical workloads using tags
  • Makes smart decisions using time and CPU metrics
  • Logs every action for audit and visibility
  • Sends real-time alerts for transparency
  • Runs fully hands-free using EventBridge
  • Includes a DRY_RUN safety mode to prevent accidents

This approach eliminates manual intervention, reduces human error, and ensures that cloud resources are used only when they are truly needed.

By combining Lambda, Boto3, CloudWatch, DynamoDB, SNS, and EventBridge, we created a lightweight yet powerful solution that can be easily extended to:

  • Auto-start instances in the morning
  • Optimize EBS volumes and snapshots
  • Integrate Slack or Teams notifications
  • Track cost trends using AWS Cost Explorer
  • Manage multi-account AWS environments

Resources:

Top comments (0)