Solved: EC2 Cost Optimization

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: Unmanaged EC2 instances often lead to unexpected AWS costs due to forgotten resources and poor visibility. This guide provides strategies ranging from immediate CLI termination of untagged instances to automated lifecycle management via tagging and Lambda, and even a drastic aws-nuke for full account resets.

🎯 Key Takeaways

Manual CLI commands, specifically aws ec2 describe-instances and aws ec2 terminate-instances, can be used for an emergency ‘Find and Destroy’ blitz to stop untagged, running EC2 instances and prevent immediate cost escalation.
Implementing a strict tagging policy (e.g., owner, project, ttl-hours) combined with a scheduled Lambda function allows for automated termination of expired EC2 instances, creating a self-cleaning cloud environment.
For non-critical, ephemeral accounts, aws-nuke offers a powerful, high-risk method to programmatically delete all resources except those explicitly whitelisted, effectively resetting the account to a clean state.

Struggling with a runaway AWS bill from forgotten EC2 instances? This guide offers three practical, real-world strategies for finding and terminating those costly ghost servers, from a quick CLI command to full automation.

Woke Up to a $5,000 AWS Bill? Let’s Talk EC2 Ghost Hunting.

I still remember the Monday morning. I grabbed my coffee, logged in, and saw a Slack alert from our finance bot. The dev account’s forecasted spend for the month was up 400%. My first thought? “Someone got hacked.” My second, more realistic thought? “Someone forgot to turn something off.” It turned out a junior engineer, trying to impress everyone, had spun up a cluster of p3.16xlarge instances for a “quick” machine learning test on Friday afternoon and promptly forgot about them. That weekend cost us more than my first car. This isn’t just a hypothetical; it’s a rite of passage in the cloud world, and it’s why we need to talk about taming the EC2 beast.

The “Why”: More Than Just Forgetfulness

Look, it’s easy to blame the junior dev. But the real problem is a lack of guardrails and visibility. In the rush to empower developers with “move fast and break things” autonomy, we often forget to give them tools to clean up after themselves. The root cause of a surprise AWS bill is almost always one of these:

Lack of Ownership: An instance like dev-test-07b has no owner tag. Who spun it up? Is it important? Nobody knows, so nobody dares to touch it.
Temporary Becomes Permanent: A “quick test” server for a proof-of-concept becomes a critical part of a pre-prod workflow, but it’s still running on an expensive instance type with no lifecycle management.
Hidden Parasites: It’s not just the EC2 instance. Every unterminated instance often leaves behind a detached EBS volume and maybe even an Elastic IP address, all quietly adding to your monthly bill.

So, how do we fix it? We don’t take away the keys. We build a better car with automatic brakes. Here are three approaches I’ve used, from the quick-and-dirty to the enterprise-grade.

Solution 1: The ‘Find and Destroy’ Blitz (The Quick Fix)

This is your emergency stop button. It’s Monday morning, the bill is climbing, and you need to stop the bleeding right now. This is a manual, but effective, approach using the AWS CLI and a little jq magic. The goal is to find instances that were launched more than, say, 24 hours ago and don’t have a “Project” tag.

First, let’s find the culprits. This command will list the Instance ID, Launch Time, and Type of instances that are missing a ‘Project’ tag.

aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query "Reservations[*].Instances[?Tags[?Key=='Project'] == null].[InstanceId, LaunchTime, InstanceType]" --output text

Once you’ve reviewed that list and confirmed they aren’t critical (and you’ve yelled on Slack to see if anyone claims them), you can use a modified version to terminate them. Be careful with this!

# WARNING: This terminates instances. Double-check your query.
INSTANCE_IDS_TO_TERMINATE=$(aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query "Reservations[*].Instances[?Tags[?Key=='Project'] == null].[InstanceId]" --output text)

if [ -z "$INSTANCE_IDS_TO_TERMINATE" ]; then
  echo "No untagged instances to terminate."
else
  echo "Terminating the following instances: $INSTANCE_IDS_TO_TERMINATE"
  aws ec2 terminate-instances --instance-ids $INSTANCE_IDS_TO_TERMINATE
fi

This is a hacky, reactive solution. It works in an emergency, but if you’re doing this every week, you’re doing it wrong. It’s time for a real process.

Solution 2: The ‘Tag and Tidy’ Strategy (The Permanent Fix)

This is where we move from being firefighters to being architects. The goal here is to create a system that automatically cleans up after itself. The core principle? No instance lives forever without a reason. This revolves around a strict tagging policy.

Step 1: Enforce a Tagging Policy

Your new rule is simple: every EC2 instance must have these tags at a minimum.


Tag Key	Purpose	Example Value
`owner`	Who is responsible for this resource?	`darian.vance`
`project`	What is this resource for?	`billing-api-refactor`
`ttl-hours`	Time-To-Live. How many hours should this exist?	`8`

You can enforce this using AWS Config Rules or Service Control Policies (SCPs) to prevent the launch of non-compliant instances.

Step 2: Automate the Cleanup

Next, you create a simple Lambda function that runs on a schedule (e.g., every hour via EventBridge). This function scans for all running instances, checks the ttl-hours tag, and compares it to the instance’s launch time. If the instance has expired, the Lambda terminates it.

Here’s a conceptual Python/Boto3 snippet for what that Lambda code might look like:

import boto3
import datetime

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    now = datetime.datetime.utcnow().replace(tzinfo=datetime.timezone.utc)

    reservations = ec2.describe_instances(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])['Reservations']

    for reservation in reservations:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            launch_time = instance['LaunchTime']

            ttl_tag = next((tag['Value'] for tag in instance.get('Tags', []) if tag['Key'] == 'ttl-hours'), None)

            if ttl_tag and ttl_tag.isdigit():
                ttl_hours = int(ttl_tag)
                expiration_time = launch_time + datetime.timedelta(hours=ttl_hours)

                if now > expiration_time:
                    print(f"Terminating expired instance: {instance_id}")
                    # ec2.terminate_instances(InstanceIds=[instance_id]) # Uncomment to activate!

This creates a self-cleaning environment. Developers can spin up resources freely, as long as they tag them with a shelf-life.

Solution 3: The ‘Nuke It From Orbit’ Option (The Nuclear Option)

Sometimes, a developer sandbox or a proof-of-concept account gets so cluttered that cleaning it up manually is impossible. You have hundreds of EC2 instances, RDS databases, S3 buckets, and IAM roles. You don’t know what’s important and what’s garbage. For these scenarios, there’s a tool I love and fear: aws-nuke.

The concept is terrifyingly simple: you write a configuration file that specifies what to KEEP. This can be specific IAM roles, a production VPC, or a particular S3 bucket. Then, you run the tool, and it programmatically deletes EVERYTHING ELSE in the account.

EXTREME WARNING: This is a weapon of mass destruction for AWS accounts. Never, EVER run this on a production account without extensive testing and a multi-person review of the config file. I’m serious. You can wipe out your entire company with a single command. Use this only for non-critical, ephemeral accounts that you want to reset to a clean state.

Using it involves creating a config file that looks something like this:

regions:
- "us-east-1"

account-blocklist:
- "999999999999" # Production Account ID

accounts:
  "123456789012": # Dev Sandbox Account ID
    presets:
    - "common"
    filters:
      IAMRole:
      - "AWSServiceRoleForOrganizations" # Keep this essential role
      - type: exact
        value: "MyBaseAdminRole"
      VPC:
      - "vpc-012345abcdef" # Keep the default VPC

When you have an account that needs a hard reset, this is the fastest way. But again, respect the power you’re wielding. It’s the ultimate solution for cloud clutter, but it comes with ultimate risk.