Built a Cost Spike Alert System After AWS Charged Me $800 for a Forgotten EC2 Instance

#aws #devops #cloudcomputing #finops

So there I was, checking my AWS bill like I do every month, and boom - $800 more than expected. Turns out I left a beefy EC2 instance running in ap-south-1 after a client demo three weeks ago. Classic mistake. We've all done it.

That night I decided enough was enough. I needed something that would ping me the moment costs started looking weird, not after the damage was done.

Here's what I built and how you can set it up too.

The Problem with Regular Budget Alerts

AWS Budgets exists, yeah. But here's the thing - it only tells you when you've crossed a line you drew yourself. If I set a $500 budget and I hit $499, silence. Hit $501, alert. But what if my normal spend is $200 and suddenly it jumps to $400? That's a 100% increase and Budgets won't say a word because I'm still "under budget."

What I needed was something smarter. Something that knows what normal looks like and yells when things go sideways.

Enter Cost Anomaly Detection

AWS has this service called Cost Anomaly Detection that does exactly this. It uses ML to figure out your spending patterns and alerts you when something breaks the pattern. And get this - it's free. No extra charges for the detection itself.

Let me walk you through setting this up.

Part 1: Basic Setup (10 minutes)
Go to Billing Console > Cost Anomaly Detection. If you created your AWS account recently, there might already be a default monitor there.
Click Create monitor and pick AWS Services as the type. This watches all your services together. Name it something useful - I called mine all-services-prod.

Now create an alert subscription:

Name: daily-cost-alerts
Frequency: Individual alerts (I want to know immediately)
Threshold: $50 (adjust this based on your typical spend)

For notification, you have two options:

Direct email - simple, works fine
SNS topic - more flexible, what I recommend

I chose SNS because I wanted to send alerts to Slack and also format the messages eventually.

Part 2: Setting Up SNS

Head to SNS Console and create a standard topic. I named mine cost-alerts.
Add your email as a subscriber. You'll get a confirmation email - click the link or nothing will work. I've forgotten this step more times than I'd like to admit.
Go back to Cost Anomaly Detection and update your subscription to use this SNS topic.
At this point, you've got a working system. AWS will detect weird spending and email you. Done.
But the emails look terrible. Just a blob of JSON. So I added a Lambda function to make them readable.

Part 3: Making Alerts Actually Useful

{"anomalyId":"abc123","accountId":"1234567890","impact":{"totalImpact":127.45},"rootCauses":[{"service":"Amazon EC2","region":"us-east-1"}]}

Not exactly scannable when you're checking your phone at dinner.

Here's a Lambda function that turns this mess into something human:

import json
import boto3
import os

sns = boto3.client('sns')

def lambda_handler(event, context):
    for record in event.get('Records', []):
        try:
            msg = json.loads(record['Sns']['Message'])

            impact = msg.get('impact', {}).get('totalImpact', 0)
            causes = msg.get('rootCauses', [])

            # Build readable message
            text = f"Cost spike detected: ${impact:.2f}\n\n"
            text += "What's causing it:\n"

            for c in causes:
                svc = c.get('service', 'Unknown')
                region = c.get('region', 'Unknown')
                text += f"- {svc} in {region}\n"

            text += f"\nCheck it out: https://console.aws.amazon.com/cost-management/home#/anomaly-detection"

            sns.publish(
                TopicArn=os.environ['ALERT_TOPIC'],
                Subject=f'AWS Cost Alert: ${impact:.2f} spike',
                Message=text
            )

        except Exception as e:
            print(f"Error processing: {e}")
            raise

    return {'statusCode': 200}

Create this function with Python 3.11 runtime. Add an environment variable ALERT_TOPIC with your SNS topic ARN.

The function needs permission to publish to SNS. Add this to the execution role:

{
    "Version": "2012-10-17",
    "Statement": [{
        "Effect": "Allow",
        "Action": "sns:Publish",
        "Resource": "arn:aws:sns:ap-south-1:YOUR_ACCOUNT:cost-alerts"
    }]
}

Now you need to wire it up. Create another SNS topic called cost-anomaly-raw. Set your Lambda as a subscriber to this topic. Then update Cost Anomaly Detection to send to cost-anomaly-raw instead of directly to cost-alerts.

The flow is now:
Anomaly detected → cost-anomaly-raw → Lambda formats it → cost-alerts → Your email

Part 4: Catching Expensive Stuff in Real-Time

Here's the thing about Cost Anomaly Detection - it relies on billing data, which can be delayed by hours. If someone spins up a p4d.24xlarge (those GPU monsters that cost $30+/hour), I don't want to find out 6 hours later.
EventBridge lets us catch these events as they happen.

Create an EventBridge rule with this pattern:

{
  "source": ["aws.ec2"],
  "detail-type": ["EC2 Instance State-change Notification"],
  "detail": {
    "state": ["running"]
  }
}

Point it at this Lambda:

import boto3
import os

sns = boto3.client('sns')
ec2 = boto3.client('ec2')

# Instance types that'll wreck your budget
EXPENSIVE = ['p4d', 'p3', 'p2', 'x2', 'u-', 'dl1', 'inf1', 'g5', 'g4dn']

def lambda_handler(event, context):
    instance_id = event['detail']['instance-id']

    resp = ec2.describe_instances(InstanceIds=[instance_id])
    instance_type = resp['Reservations'][0]['Instances'][0]['InstanceType']

    # Check if it's something expensive
    if any(e in instance_type for e in EXPENSIVE):
        msg = f"Heads up: {instance_type} just launched\n"
        msg += f"Instance: {instance_id}\n"
        msg += f"Region: {event['region']}\n"
        msg += f"\nThese instances are expensive. Make sure this was intentional."

        sns.publish(
            TopicArn=os.environ['ALERT_TOPIC'],
            Subject=f'Expensive EC2 launched: {instance_type}',
            Message=msg
        )

    return {'statusCode': 200}

Now you get pinged within seconds when someone launches a GPU instance or any other budget-killer.

What This Actually Catches

After running this for about two months, here's what it's caught for me:

A Lambda function that got stuck in a retry loop (spotted within 2 hours instead of end of month)
Teammate who launched m5.4xlarge instead of t3.medium for testing
Unexpected data transfer spike when a client started hammering our API
S3 request costs jumping 3x after a deployment (turned out we had a logging misconfiguration)

Few Things I Learned

Start with a higher threshold. I initially set $20 and got way too many alerts for normal fluctuations. Bumped it to $50 and the noise dropped significantly.

The ML needs time. Cost Anomaly Detection takes about 10 days to learn your patterns for new services. Don't expect accurate alerts on day one.

Daily summaries exist. If individual alerts feel like too much, you can switch to daily digest emails. I use individual for production accounts and daily for dev.

Tag your resources. You can create monitors based on cost allocation tags. If you tag by team or project, you can send alerts to the right people automatically.

Total Cost to Run This

Cost Anomaly Detection: Free
Lambda invocations: Maybe $0.10/month if that
SNS notifications: Cents
The whole thing costs practically nothing to run.

Conclusion

Look, cloud cost management isn't glamorous. Nobody's going to pat you on the back for setting up billing alerts. But when you catch a runaway service before it racks up a four-figure bill, you'll be glad you spent the hour setting this up.

The basic setup (Part 1 and 2) takes maybe 10 minutes. That alone will save you from most surprises. Add the Lambda formatting if you want nicer alerts. Add the EventBridge rule if you're paranoid about expensive instances like I am now.