Muskan

Posted on Jun 19

Spot AWS cost anomalies before they wreck your budget

#finops #aws #cloud #devops

Quick take

AWS bill spikes are almost never random. They follow four predictable signals: a service line that grew faster than your traffic, a region that was not in the plan, a usage type that was unused last month, and a percentage delta that crosses the 30% threshold. Catch all four early, and the next budget incident becomes a Slack notification, not a Monday-morning fire.

If you only have 60 seconds, this is the shape:

AWS Cost Anomaly Detection is free and a fine baseline, but it lags by 24 to 48 hours.
Real-time anomaly detection is the 2026 standard, and several tools now ship with auto-remediation.
The four signals to set up alerts on are service spike, region drift, usage-type creep, and percentage delta.

Why cost anomalies hit harder in 2026

I get pulled into post-incident reviews where a single weekend cost the team $14,000 in surprise spend. The shape of these incidents has shifted twice in the last year.

AI workloads create spiky bills. A GPU instance booted by a training job that forgot to terminate runs you $24 per hour on a p5.48xlarge. Over a weekend that is $1,150. Most teams discover it Monday.

Multi-account complexity hides the source. Org-level Cost Explorer averages across accounts. A dev account that ran a $5,000 misconfigured Bedrock workload looks like a 3% bump at the org level and gets missed.

FOCUS billing changed the data model. AWS now exports billing in the FOCUS standard, which is great for portability but breaks every dashboard that hard-coded lineItem/UsageAmount. Half the anomaly alerts I see were tuned against the old schema and silently stopped firing in 2025.

The teams that catch anomalies fast have all moved away from "monthly bill review" toward streaming detection.

The four signals of a real cost anomaly

Not every cost increase is an anomaly. I use a four-signal framework to filter noise from real incidents. When two or more fire on the same service in the same day, that is a real anomaly.

1. Service line growth outpaces traffic

Compare cost growth to a known usage proxy: requests per second, active users, jobs run. If S3 cost grew 40% while request volume grew 5%, something is off. Set this as the floor signal.

2. A region appears that was not in the plan

Look at cost grouped by region. If ap-southeast-3 shows $200 yesterday and you do not operate there, you have either misconfigured a deployment or someone is mining. Both are urgent.

3. A usage type that was zero last month

DataTransfer-Inter-Region-Out going from $0 to $400 in a week usually means a misconfigured cross-region replication. EBS-Snapshots doubling overnight means a backup script that never deletes. These usage-type creeps are the most expensive to ignore.

4. A percentage delta that crosses 30%

The rule of thumb. A daily cost on a single service that crosses 30% above the trailing 7-day average is an anomaly. Below 30% is usually traffic seasonality. Above 30% is something you should look at within the hour.

Detection: native AWS versus commercial tools

There are two paths to spotting anomalies in 2026. The free path works for small accounts. The commercial path is mandatory above roughly $50,000 per month of spend.

AWS Cost Anomaly Detection

Free, native, integrated with SNS and Slack. Three weaknesses to know. First, it polls billing data on a 24 to 48 hour delay, so a Saturday spike alerts on Monday. Second, the detection groups can be coarse, alerting on "EC2" rather than the specific usage type. Third, it cannot take action, only notify.

For small accounts and dev environments this is enough. For production, it is the floor, not the ceiling.

Commercial real-time tools

The commercial tier reads the AWS Cost and Usage Report stream and surfaces anomalies within minutes. The top trade-off is between detection accuracy, response time, and how aggressively the tool can act on the anomaly.

The 2026 anomaly detection tool comparison

Here are the tools I see most teams evaluating, with what each actually catches.

Tool	Detection latency	Auto-remediation	Multi-cloud
AWS Cost Anomaly Detection	24 to 48 hours	No	AWS only
CloudZero	Near real-time	No (alerts only)	AWS, GCP, Azure
Vantage	Near real-time	Limited	AWS, GCP, Azure
Datadog Cost Mgmt	Near real-time	No	AWS, GCP, Azure
ZopNight	Real-time	Yes, with guardrails	AWS, GCP, Azure
Harness CCM	Hourly	Recommendation	AWS, GCP, Azure
nOps	Real-time	Karpenter actions	AWS-focused

What the table does not show: whether the tool will actually do something about the anomaly once detected. Most still stop at the notification step. ZopNight and nOps are the two I have seen that will, with permission, terminate a runaway resource within minutes. That is the difference between a $200 incident and a $14,000 one.

How to act once an anomaly is detected

Detection is half the job. The other half is the runbook.

Tag the anomaly with a severity within 5 minutes. Is it a security event, a misconfig, or expected growth? The right action depends on this.
Quarantine the resource if security or misconfig. Stop the EC2 instance, suspend the IAM key, revoke the cross-region replication. Move fast, blame later.
Open an incident channel if the spend rate is above $1,000/day. The same as a P2 production incident.
File the root cause in your incident tracker within 24 hours. A pattern that fires twice becomes an automated guardrail.

This is where the commercial tools earn their fee. The good ones let you preview the blast radius of a proposed remediation before running it, so you do not accidentally kill a production workload while chasing a cost spike.

Where anomaly detection still falls short

The honest part. Three cases break even the best tools.

Slow-burn anomalies. A 5% daily increase compounded over 60 days doubles the bill. None of the threshold-based tools catch this because no single day crosses 30%. The fix is a separate longitudinal trend check that runs weekly.

Reserved instance and Savings Plan distortion. When commitments apply, on-demand cost drops and may look like an anomaly going the other way. Anomaly tools that do not understand commitments fire false alarms here. Verify the tool reads your commitment schedule.

Shared service attribution. A spike in NAT Gateway traffic is real, but which team caused it is not in the billing data. You need a Kubernetes cost allocation layer on top to map the spike to a team.

Frequently asked questions

Is AWS Cost Anomaly Detection enough for production?
For workloads under $50,000 per month of AWS spend, it is usually enough as long as you accept the 1 to 2 day lag. Above that threshold, the lag itself costs more than a commercial tool.

How is FOCUS billing changing this?
FOCUS gives you a portable schema across AWS, GCP, and Azure. Anomaly tools built on FOCUS detect across clouds in one query. The trade-off is that AWS-only tools have richer per-service detail.

Should the alert go to Slack or PagerDuty?
Slack for under $500 per day. PagerDuty for above $1,000 per day or any security-related anomaly. Anything in between depends on your team's on-call discipline.

Do anomaly tools work with Bedrock and SageMaker?
The native AWS tool covers both. Most commercial tools support Bedrock as of mid-2026. Confirm SageMaker training job detection separately, since it bills differently from inference.

Can I write my own anomaly detection?
Yes. The Cost and Usage Report is exported to S3 hourly, and a basic 30% threshold check is 40 lines of Python. The commercial tools justify themselves at the auto-remediation and multi-cloud join layers, not detection alone.

What was your last anomaly incident?

If you remember a Monday morning where the AWS bill ate a sprint, the question worth asking is which of the four signals would have caught it on Saturday. Drop your incident in the comments. I will tell you which signal I would have wired up first.

DEV Community