DEV Community

Cover image for AWS Cost Anomaly Detection for ECS Teams: What It Catches, What It Misses, and How to Set It Up
Matt
Matt

Posted on • Originally published at fortem.dev

AWS Cost Anomaly Detection for ECS Teams: What It Catches, What It Misses, and How to Set It Up

AWS Cost Anomaly Detection for ECS: Setup Guide 2026

Originally published at https://fortem.dev/blog/aws-cost-anomaly-detection-ecs
Set up AWS Cost Anomaly Detection for ECS Fargate fleets with per-environment tag monitors. Includes Terraform config, threshold strategy, and what the 24h delay means for your team.


Guide

aws-cost-anomaly-detectionecs-cost-monitoringfargate-cost-alerts

AWS Cost Anomaly Detection is free, ships with ML-based pattern detection, and can catch ECS spend spikes automatically. The catch: it runs on billing data that's up to 24 hours old, and the default setup monitors all ECS spend as one pooled number — not per environment. A spike in staging looks identical to a spike in prod. This guide covers how CAD actually works, how to wire it to your environment tags for per-environment alerts, what Terraform to drop in, and where the tool has real blind spots.

TL;DR

  • 01CAD is free and uses ML — no static thresholds to maintain, no per-service configuration by default.
  • 02The default AWS service monitor pools all ECS spend together — set up a tag-based monitor to get per-environment alerts.
  • 03Detection takes up to 24 hours after a spike appears in billing data. Sub-12-hour spikes often go undetected.
  • 04IMMEDIATE alerts require an SNS topic, not an email address — email-only subscriptions get a ValidationException.
  • 05CAD is your monthly fire detector. It won't catch a runaway task that was killed before the billing data arrived.

Ready to use — drop this into your Terraform today

Tag-based monitor on your environment key, SNS topic with correct IAM policy, and an IMMEDIATE subscription with combined $ + % threshold. Replace alerts@yourcompany.com with your on-call address.

# SNS topic for cost anomaly alerts
resource "aws_sns_topic" "cost_anomaly" {
  name = "ecs-cost-anomaly-alerts"
}

# Required: grant CAD permission to publish to SNS
data "aws_iam_policy_document" "cost_anomaly_sns" {
  statement {
    sid     = "AllowCostAnomalyDetection"
    effect  = "Allow"
    actions = ["SNS:Publish"]
    principals {
      type        = "Service"
      identifiers = ["costalerts.amazonaws.com"]
    }
    resources = [aws_sns_topic.cost_anomaly.arn]
    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"
      values   = [data.aws_caller_identity.current.account_id]
    }
  }
}

resource "aws_sns_topic_policy" "cost_anomaly" {
  arn    = aws_sns_topic.cost_anomaly.arn
  policy = data.aws_iam_policy_document.cost_anomaly_sns.json
}

resource "aws_sns_topic_subscription" "oncall_email" {
  topic_arn = aws_sns_topic.cost_anomaly.arn
  protocol  = "email"
  endpoint  = "alerts@yourcompany.com"
}

# Tag-based monitor — one ML baseline per environment tag value
resource "aws_ce_anomaly_monitor" "env_monitor" {
  name         = "ecs-per-environment-monitor"
  monitor_type = "CUSTOM"

  monitor_specification = jsonencode({
    Tags = {
      Key          = "environment"      # must match your cost allocation tag key
      MatchOptions = ["EQUALS"]
    }
  })
}

# Subscription: IMMEDIATE via SNS (email-only = ValidationException)
resource "aws_ce_anomaly_subscription" "env_alerts" {
  name      = "ecs-environment-anomaly-alerts"
  frequency = "IMMEDIATE"

  monitor_arn_list = [aws_ce_anomaly_monitor.env_monitor.arn]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_anomaly.arn
  }

  depends_on = [aws_sns_topic_policy.cost_anomaly]

  # AND logic: both conditions must be met to reduce alert noise
  threshold_expression {
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
        values        = ["30"]           # $30 minimum impact
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
    and {
      dimension {
        key           = "ANOMALY_TOTAL_IMPACT_PERCENTAGE"
        values        = ["25"]           # 25% above expected
        match_options = ["GREATER_THAN_OR_EQUAL"]
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

How AWS Cost Anomaly Detection works

CAD uses ML to model your normal spend per dimension, runs approximately 3× daily on billing data that's up to 24 hours old, and alerts when actual spend deviates from expected by more than your configured threshold.

The service launched in 2020 and has been updated significantly since. The November 2025 update switched from calendar-day batches to rolling 24-hour windows — meaning the model now compares your current spend against the same time of day in previous periods, rather than against a full-day total. For ECS workloads with business-hours patterns, this reduces false positives on Monday mornings when spend jumps from a quiet weekend.

Monitor dimension What it tracks ECS use
AWS services All ECS spend pooled across all envs Default — too coarse for fleets
Linked accounts Per AWS account spend Useful for account-per-env setups
Cost allocation tags Per tag value (e.g. per environment) Best for ECS fleets using env tags
Cost categories Per business unit or product Useful for multi-product orgs

The ML model adjusts for trends and seasonality automatically. You don't set a fixed budget cap — the model learns what "normal" looks like for your specific spend pattern and alerts only when that pattern breaks. The tradeoff: the model needs at least 10 days of history per dimension before it can fire. A brand-new ECS environment with zero history gets no anomaly alerts until day 11.

What ECS cost spikes CAD actually catches

CAD catches ECS spend anomalies at AWS service level by default — meaning all your environments pooled together. It reliably catches sustained scale-out events, forgotten running environments, and Fargate On-Demand vs Spot fall-through spikes that last longer than one billing cycle.

The "sustained" qualifier matters. per-environment cost visibility on ECS is already hard — CAD makes it harder when all environments share one anomaly baseline. A $200 spike in your dev environment looks like noise when your prod environment spends $2,000.

Scenario Spike type CAD (default) Delay
Dev env running 24/7 after sprint ends Sustained (3+ days) Catches it 24–48 hours
Fargate Spot falls back to On-Demand for 8 hours Sustained (8+ hours) Usually catches it 24 hours
Runaway task scales to 50 replicas for 3 hours Short spike (<6h) Often misses Task gone before detection
NAT Gateway burst from one env's batch job Single-env spike Misses (pools with others) No per-env alert without tag monitor

KEY INSIGHT: Key insight The 24-hour delay is the biggest constraint for ECS teams. A Fargate task that scaled out at 9am and was killed by 3pm generates no anomaly alert — the spending happens in a single billing period, and CAD reads billing data with a 24-hour lag. By the time the data arrives, the task is gone.

Setting up a per-environment tag monitor

Create a Cost Allocation Tag monitor on your environmentkey. Each tag value — dev, staging, prod — gets its own ML baseline and can fire independently without one environment's spend pattern polluting another's alert.

Before creating a tag-based monitor, your cost allocation tags must be activated. Tags only appear in Cost Explorer — and therefore in CAD — after activation. This catches teams off guard: you've been tagging your ECS tasks for months, but none of that data flows into CAD until you flip the switch.

Prerequisite — activate cost allocation tags

  1. 1.Open AWS Billing and Cost Management console
  2. 2.Navigate to Cost allocation tags → "AWS-generated tags" and "User-defined tags" tabs
  3. 3.Find your environment tag key (e.g. "environment") → click Activate
  4. 4.Wait up to 24 hours for historical data to appear in Cost Explorer
  5. 5.Then create your CAD monitor — do not create it before activation

Choose an AWS managed monitor(not customer managed) for your environment tag. Managed monitors automatically discover new tag values as you add environments — if you spin up a new "staging-eu" environment next month, the monitor picks it up without any config change. The trade-off: all tag values share one alert threshold. If you need different thresholds for prod vs dev, use customer managed monitors — but they cap at 10 tag values per monitor.

New tag values need 10 days of billing history before CAD can model normal spend and fire alerts. Plan for this when spinning up a new environment — don't expect anomaly alerts in the first two weeks.

Terraform: the full config

Two core resources: aws_ce_anomaly_monitor (tag-based) and aws_ce_anomaly_subscription (SNS, IMMEDIATE). Email-only subscriptions cannot use IMMEDIATE frequency — you need an SNS topic with the correct IAM policy first.

The Terraform block at the top of this article is the complete production config. Two things to get right:

SNS topic policy

The SNS topic must explicitly grant costalerts.amazonaws.com permission to publish. Without this policy, CAD sends no error — alerts fail silently and you get nothing. The aws:SourceAccount condition limits the permission to your own account only.

CUSTOM vs DIMENSIONAL monitor type

Tag-based monitors use monitor_type = "CUSTOM" with a monitor_specification JSON block. Service-level monitors use monitor_type = "DIMENSIONAL" with monitor_dimension = "SERVICE". These are different resource shapes in the Terraform provider — using the wrong type will error at apply time.

For teams using consistent cost allocation tagging across environments, the tag-based monitor can also be created as an AWS managed monitor (not customer managed) — which auto-discovers new environments. The Terraform resource for a managed TAG monitor looks slightly different: omit the monitor_specification block and instead use monitor_type = "DIMENSIONAL" with monitor_dimension = "TAG". Check the AWS Cost Anomaly Detection docs for the current provider version syntax.

Threshold strategy for ECS fleets

For a 10-environment fleet, start with $30 absolute AND 25% relative (AND logic). Production alone warrants a lower absolute threshold — $20 with 20% catches real incidents without drowning in dev noise. The AWS default (40% + $100) is too blunt for ECS environments with variable baselines.

The problem with the $100 default: a dev environment spending $40/month on idle Fargate tasks can spike to $120 — a 200% increase — and never trigger an alert because the $100 absolute threshold isn't met. For small environments, percentage-based thresholds catch what dollar thresholds miss.

Environment Absolute ($) Percentage (%) Logic
Production $20 20% AND
Staging $30 25% AND
Dev / ephemeral $15 30% AND
All environments (fallback) $30 25% AND

Use AND, not OR. OR logic on a percentage threshold fires every time a tiny environment has any activity after a quiet weekend — because 100% above $0 is infinite. AND requires both the dollar amount and the percentage to be exceeded simultaneously, which dramatically reduces noise from small environments with variable usage.

After the first few weeks, mark detected anomalies as "Accurate anomaly" or "Not an issue" in the console. CAD uses this feedback to tune the model. A model trained on your team's feedback converges on your actual noise floor faster than one running without it.

Where CAD falls short for ECS teams

CAD won't catch a Fargate task that scales to 50 replicas, runs for 6 hours, and is killed before billing data arrives. It also can't alert on per-service cost within an environment — only on per-environment total spend.

Three hard limits to plan around:

No real-time detection

Cost Explorer has up to a 24-hour data lag. CAD runs 3× per day on that data. A Fargate task that spends $300 between 8am and 5pm on a Tuesday won't appear in CAD until Wednesday at the earliest — and only if the spending pattern looks anomalous relative to your history. Real-time cost monitoring requires CloudWatch metrics and billing alarms, which operate on estimated charges with a different (faster) refresh cycle.

No service-level granularity within an environment

The tag-based monitor fires when total spend for the "dev" tag value deviates from normal. It cannot tell you which ECS service within "dev" caused the spike. Root cause analysis surfaces up to 10 contributing factors (service, region, account, usage type) — but these are dimensions in Cost Explorer, not ECS service names. You still need Cost Explorer or a per-service tagging strategy to narrow it down.

Scheduled environments create false anomalies

If you schedule non-prod environments to stop outside business hours, CAD sees a cost of $0 at night and a spike every morning when they restart. The ML model learns this pattern over time — but the first 2–4 weeks after introducing scheduling will generate false positive alerts. Disable alerts during the model warm-up period or set a higher absolute threshold temporarily.

KEY INSIGHT: Key insight CAD is your monthly fire detector. It catches sustained burns — a forgotten environment left running, a Spot fallback that held for three days. Fortem's per-environment cost tracking is your smoke alarm: it sees what's happening now, before it becomes a billing-cycle problem.

"AWS Cost Anomaly Detection runs approximately three times a day after your billing data is processed. Anomaly detection relies on the data from Cost Explorer which has a latency of up to 24 hours. Therefore, it can take up to 24 hours to detect an anomaly after the anomalous usage happens."

AWS Cost Anomaly Detection FAQ, verified June 2026

If you read this, you might also want to know

Can I use CAD with a multi-account ECS setup?

Yes. In a management account, create a linked account monitor to track per-member-account spend. Combine it with a tag-based monitor per account if you want both dimensions. Member accounts can only create an AWS service monitor — linked account and tag monitors require the management account.

What if my ECS tasks don't have environment tags yet?

Tag-based monitoring only works on costs that are tagged. Untagged ECS tasks appear in the 'no tag value' bucket. The fastest path: add a default_tags block to your Terraform AWS provider — every resource gets the environment tag automatically without changing individual resource configs.

Does CAD replace AWS Budgets?

No — they answer different questions. Budgets: 'alert me when I cross $X.' CAD: 'alert me when I'm abnormally above my historical pattern, even if I haven't hit a fixed cap.' Use Budgets for hard financial limits and CAD for pattern deviation. A $50 spike in a normally-$10 environment is an anomaly even if it's well below your budget cap.

Common questions

Is AWS Cost Anomaly Detection free?

Yes. CAD itself is free — no charge for monitors, alert subscriptions, or email/SNS delivery. The underlying data comes from Cost Explorer, which charges $0.01 per API request for programmatic access. Console use is free. For most ECS teams, the total cost is $0.

How long does it take for Cost Anomaly Detection to start working?

CAD needs at least 10 days of historical billing data per monitored dimension before it can model 'normal' spend. After setup, alerts can take up to 24 hours to fire — Cost Explorer (which CAD reads) has a built-in delay of up to 24 hours. A Fargate task that scales out and is killed within 12 hours may never trigger an alert.

Can AWS Cost Anomaly Detection detect ECS Fargate cost spikes?

Yes, with caveats. The default AWS service monitor sees all ECS spend pooled together — a spike in one environment dilutes across others. For per-environment detection, create a tag-based monitor on your 'environment' cost allocation tag. Each tag value gets its own ML baseline and can fire independently.

What is the difference between AWS Cost Anomaly Detection and AWS Budgets?

Budgets use static thresholds: 'alert me when I spend more than $500.' CAD uses ML to detect deviation from your historical pattern — it catches a 200% spike even if you're only at $50 total. Use both: Budgets for hard caps, CAD for pattern deviation. They answer different questions.

How do I get immediate alerts from Cost Anomaly Detection?

Set alerting frequency to 'Individual alerts' — but this requires an SNS topic, not an email address. Attempting to use IMMEDIATE frequency with an email-only subscription triggers a ValidationException. Create an SNS topic, grant costalerts.amazonaws.com permission to publish, then subscribe your email to the SNS topic.

What triggers a cost anomaly alert?

By default: spend 40% above expected AND at least $100 above expected. Both conditions must be met. You can customize both thresholds — for ECS environments with variable usage, combine an absolute threshold ($20–$50) AND a percentage threshold (20–30%) using AND logic to avoid false positives on small environments.

### See what Fortem shows you that CAD doesn't CAD catches sustained billing ano

Worth reading

Use Case · Why Can't You See Per-Environment AWS Costs?Cost Explorer shows you by service, by account, by region. Not by environment. Here's why, and what to do about it.Use Case · How to Control CloudWatch Logs Costs on ECSECS creates log groups with no retention by default. 4 steps to cut CloudWatch costs by 60–80% without touching application code.


See your real per-env cost: fortem.dev/ecs-cost-calculator

Top comments (0)