DEV Community

Rick Wise
Rick Wise

Posted on

AWS Cost Anomaly Detection: When machine learning meets your billing

Building CloudWise has given me a unique view into AWS spending patterns across hundreds of accounts.

The Problem

Many teams struggle to detect anomalous spending, leading to unexpected bills.

What the Data Shows

After analyzing $10M in AWS spend, here's what we discovered:

Key Findings

  • 30% of overspend due to unused EBS volumes
  • 25% variance linked to idle EC2 instances
  • 20% from overlooked S3 storage classes

The Most Common Mistake

Teams often prioritize compute resources but neglect storage costs, which can accumulate rapidly.

Quick Fix That Works

Implement alerts for spending thresholds in AWS Budgets.

Implementation Steps

  1. Audit Phase: Use AWS Cost Explorer to identify spending spikes.
  2. Quick Wins: Start with the easiest savings first, like deleting idle resources.
  3. Monitor: Set up alerts to catch anomalies before they escalate.
  4. Scale: Apply learnings across all accounts and environments.

Your Experience?

What patterns have you noticed in your AWS bills? Drop a comment - I love learning from other developers' experiences.


I'm building CloudWise to help developers get clarity on AWS costs. Always happy to share insights from our data analysis.

Top comments (4)

Collapse
 
vikas_tripathi_51bed73349 profile image
Vikas Tripathi

Really valuable data — the finding that 30% of overspend
comes from unused EBS volumes matches exactly what I've
been seeing while building a similar tool.

The storage neglect pattern is real. Teams obsess over
right-sizing EC2 but completely ignore the passive waste
accumulating in unattached volumes and forgotten snapshots.

Interesting that you analyzed $10M in spend — at that
scale the patterns must be very consistent across different
company sizes and industries.

One question: did you find that the anomaly detection
approach works better for catching sudden spikes, while
static scanning works better for chronic passive waste?
I've been thinking about how these two approaches
complement each other.

Collapse
 
cloudwiseteam profile image
Rick Wise

Spot on, Vikas. You nailed the distinction perfectly.

We found exactly that: Anomaly detection is great for 'incidents' (e.g., a loop spinning up 100 instances or a massive data transfer spike), but it's terrible at catching 'chronic' waste.
The problem with using anomaly detection for things like unattached EBS volumes is that if you don't catch it on Day 1, the model quickly learns that this higher spend is just your 'new normal.' It stops flagging it as an anomaly.

That's why we built CloudWise with a hybrid engine:
Static Scanners for resource state (e.g., 'Is this volume attached? No? -> Flag it').
ML Models for usage patterns (e.g., 'Is this database traffic 500% higher than last Tuesday? -> Flag it').

You really can't solve cloud cost optimization with just one or the other. Great to hear you're seeing the same patterns in your build!

Collapse
 
vikas_tripathi_51bed73349 profile image
Vikas Tripathi

Appreciate that, Rick.

The “new normal” problem is exactly what worried us while designing DAL.

We’re exploring an additional layer where historical baseline shifts are compared not just statistically but contextually — like tagging infrastructure intent (temporary scaling vs structural growth).

Curious — in CloudWise, how are you preventing baseline drift from masking slow-burn inefficiencies over 30–60 day windows?

Would love to exchange notes sometime if you’re open to it.

Thread Thread
 
cloudwiseteam profile image
Rick Wise

That 'slow-burn' drift is the hardest part to solve, and it's exactly what we're tackling in our design right now.

We're moving away from just raw cost history and focusing on Unit Economics as the solution. The goal is to correlate spend with a business metric (like API requests or active users).

If EC2 spend grows 5% but traffic is flat, that's drift. If both grow 5%, that's scale. Normalizing cost against usage (e.g., 'Cost per 1k Requests') should make those slow burns pop out as a rising trend line, even if the daily variance is small.

We're still early in building this out, so I'd love to exchange notes and hear how your 'contextual layer' approach is working! Here is LinkedIn profile so we can connect and swap notes.