DEV Community: Matt

Why Headroom Breaks on AWS Bedrock — and How to Fix All Four Failures

Matt — Tue, 07 Jul 2026 11:15:00 +0000

You're on AWS Bedrock because compliance won't let your data leave AWS. Your agent workflow multiplies tokens, the bill climbs, and you found Headroom — the open-source compression proxy that promises 60–95% fewer tokens (their benchmark). You point it at Bedrock, and it silently does almost nothing.

I wrote the full guide with the exact error strings, the fixes, and measured numbers. Here's the TL;DR.

The four ways it silently breaks on Bedrock

404 on native routes — the proxy 404s /model/{id}/invoke and /inference-profiles, so Claude Code's auto-mode classifier fails closed (issue #1589).
InvalidSignatureException — SigV4 signs a hash of the request body; compression rewrites the body, so the signature no longer matches (PR #1220).
Dead prompt cache — in --mode cache it prints prefix_frozen but never injects a cachePoint, so Bedrock reports savings_pct: 0.0 (issue #1345). And Bedrock caps you at 4 cache markers: A maximum of 4 blocks with cache_control may be provided. Found 5.
ModuleNotFoundError: botocore — crashes on temporary STS credentials because the image doesn't bundle botocore (issue #1551).

The savings math (a formula, not a promise)

Bedrock bills input + 1.25×write + 0.1×read. Caching attacks the read term (0.1× — a 10× discount on the repeated prefix); compression attacks the input term. They stack.

Measured on a 6-turn agentic session (us-east-1, July 2026, Claude Sonnet 4.5):

No optimization: 619,865 token-equivalents
Cache only (fixed): 257,250 — −58.5%
Cache + compression: 183,672 — −70.4%

One trap: compact JSON (the default from json.dumps / JSON.stringify / boto3) compresses 0%. The same data with whitespace compresses 43% — token count tracks whitespace.

The full guide covers the step-by-step fix, how to verify the cache actually hits (cacheReadInputTokens > 0 on the second request), and the whole error-string reference.

👉 Read the complete guide →

Caprock is a managed distribution of Headroom that ships all four Bedrock fixes and runs entirely inside your VPC, so nothing leaves your AWS account — caprock.dev.

How to Put an ALB in Front of ECS Fargate

Matt — Tue, 07 Jul 2026 11:14:14 +0000

How to Put an ALB in Front of ECS Fargate

Originally published at https://fortem.dev/blog/ecs-load-balancer-guide
Attach an Application Load Balancer to ECS Fargate: the awsvpc ip target-type rule, health checks that pass, the Terraform to wire it, and shared-vs-per-service ALB cost at fleet scale.

Guide

The task is RUNNING, but the target group says unhealthy and the URL 503s. Attaching an Application Load Balancer to a Fargate service has three real traps: the awsvpc "ip target type" rule, a health check that never passes, and — at 10+ services — one ALB per service quietly costing more than the compute. Here's the Terraform-native version, with the numbers to debug the 503 and to decide whether to share one ALB.

TL;DR

Fargate uses the awsvpc network mode, so the target group MUST use target type "ip", not "instance" — tasks are ENIs, not EC2 instances. This is the #1 first-time error.
ECS registers and deregisters tasks with the target group automatically through its service-linked role. You attach the target group to the service; you don't script registration.
An unhealthy target is 90% of the time a security-group gap or a health-check path that doesn't return 200. Fix the SG from the ALB to the task port first.
One ALB per service is simplest and costs ~$16/month each. At 20 services that's ~$330/month — a shared ALB with host/path routing collapses it to one base charge.
Tune deregistration delay (300s→5s) and stop timeout (30s→2s), and trap SIGTERM, for fast zero-downtime deploys.

Ready to use — ALB + target group + listener for a Fargate service

# The load balancer and its security group
resource "aws_lb" "app" {
  name               = "app-alb"
  load_balancer_type = "application"
  subnets            = var.public_subnet_ids
  security_groups    = [aws_security_group.alb.id]
}

resource "aws_security_group" "alb" {
  name   = "app-alb-sg"
  vpc_id = var.vpc_id
  ingress { from_port = 443, to_port = 443, protocol = "tcp", cidr_blocks = ["0.0.0.0/0"] }
  egress  { from_port = 0,   to_port = 0,   protocol = "-1",  cidr_blocks = ["0.0.0.0/0"] }
}

# Target group — MUST be target_type = "ip" for Fargate (awsvpc)
resource "aws_lb_target_group" "app" {
  name        = "app-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    path                = "/health"
    matcher             = "200"
    interval            = 15
    healthy_threshold   = 2
    unhealthy_threshold = 3
  }

  # Drain fast for short-lived requests (default is 300s)
  deregistration_delay = 5
}

resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.app.arn
  port              = 443
  protocol          = "HTTPS"
  certificate_arn   = var.acm_certificate_arn
  default_action { type = "forward", target_group_arn = aws_lb_target_group.app.arn }
}

# The task's SG must allow the ALB's SG on the container port
resource "aws_security_group_rule" "alb_to_task" {
  type                     = "ingress"
  from_port                = 8080
  to_port                  = 8080
  protocol                 = "tcp"
  security_group_id        = aws_security_group.task.id
  source_security_group_id = aws_security_group.alb.id
}

# Attach the target group to the service — ECS registers tasks for you
resource "aws_ecs_service" "app" {
  name            = "app"
  cluster         = var.cluster_arn
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  health_check_grace_period_seconds = 60

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "app"
    container_port   = 8080
  }

  network_configuration {
    subnets         = var.private_subnet_ids
    security_groups = [aws_security_group.task.id]
  }
}

The two lines that break most first attempts: target_type = "ip" and the alb_to_task security-group rule.

Why Fargate changes how you attach a load balancer

Fargate tasks use the awsvpc network mode, so each task is an elastic network interface with its own IP — not an EC2 instance. That's why the target group must use target type ip.

On EC2-backed ECS, a task maps to a host and a port, so the classic target type of instance works — the load balancer routes to an instance ID and a dynamically-mapped port. Fargate has no host you own. Every task gets its own ENI and private IP, and the load balancer has to route to that IP directly. Register a Fargate service against an instance target group and nothing registers at all — no error you'd expect, just zero targets and a 503.

"For services with tasks using the awsvpc network mode, when you create a target group for your service, you must choose ip as the target type, not instance. This is because tasks that use the awsvpc network mode are associated with an elastic network interface, not an Amazon EC2 instance."

— AWS ECS documentation, verified July 2026

One upside of the ip target type: it supports cross-VPC connectivity, where the instance type requires the load balancer and tasks in the same VPC.

The Terraform to wire it up (target type ip)

One ALB, one target group with target_type = ip, one listener, and a security-group rule from the ALB to the task port. ECS registers tasks automatically through its service-linked role.

The Ready-to-use block above is the whole thing. The part people miss isn't the ALB — it's the wiring between it and the task. Two security groups: one on the ALB that lets the public in on 443, and a rule on the task's security group that lets the ALB's security group reach the container port. Reference the ALB's SG as the source, not a CIDR — the task should only accept traffic from the load balancer, never the open internet.

You never write registration logic. Amazon ECS holds a service-linked IAM role whose whole job is to register a task with the target group when it starts and deregister it when it stops. You declare the load_balancer block on the service, and ECS keeps the target group in sync with the running tasks as they cycle through deploys and scaling.

Why your target is unhealthy (the 503 debug)

An unhealthy target is almost always one of three things: the security group blocks the ALB on the task port, the health-check path returns non-200, or the app starts slower than the grace period.

AWS maintains an entire knowledge-center article on this one failure, which tells you how common it is. Work the causes in order — security group first, because it's both the most frequent and the easiest to overlook. If the task's SG doesn't allow the ALB's SG on the container port, the health check can't even reach the app, and the target sits unhealthy from the moment it launches.

Root cause	Symptom	Fix
Security group gap	Target stuck unhealthy from launch	Allow the ALB's SG inbound on the container port
Health path returns non-200	Target flaps or never passes	Point the check at a path that returns 200 (e.g. /health)
Grace period too short	Task killed and restarted in a loop	Set healthCheckGracePeriodSeconds above real startup time
Wrong target type	No targets register at all	Use target_type = "ip" for awsvpc / Fargate
Check port mismatch	Unhealthy despite app being up	Match the target-group port to the container port

A recurring "unhealthy" that clears and comes back is different from one that never passes — it's usually load or a slow dependency, and that's where metrics earn their keep. Pairing target health with task-level signals is exactly what monitoring ECS Fargate across a fleet is for: a flapping target with a memory alarm behind it tells you the health check isn't the bug, the app is.

KEY INSIGHT: A newly-registered target only needs one passing health check to go healthy — the healthy-threshold count only applies when a target recovers from unhealthy back to healthy. So a target that never goes healthy on first launch is almost never a threshold problem; it's a reachability or path problem. Chase the security group and the path, not the thresholds.

ALB health checks vs container health checks

Different systems. The ALB health check decides if a target gets traffic; the container health check decides if ECS restarts the task. Run both — keep the grace period longer than startup.

This is a genuinely confusing overlap — practitioners ask on AWS re:Post whether they should use one, the other, or both. The answer is both, because they answer different questions. The target group's health check governs traffic: fail it and the ALB stops routing to that task. The container health check in the task definition governs lifecycle: fail it and ECS kills and replaces the container. A task can be healthy to ECS but pulled from the ALB, or the reverse.

The one setting that ties them together is health_check_grace_period_seconds on the service. It tells ECS to ignore failing ALB health checks for the first N seconds after a task starts, so a slow-booting app isn't killed before it's ready. Set it above your real cold-start time — a JVM service that takes 45 seconds to warm up needs a grace period well past that, or every deploy turns into a restart loop.

One ALB per service, or a shared ALB?

One ALB per service is simplest but costs about $16/month each. A shared ALB with host- or path-based listener rules serves many services on one base charge — the default past a handful of services.

Every ALB carries an hourly base charge whether it's serving one request or a million. At roughly $16.43/month per load balancer, the per-service pattern looks free at three services and turns into a line item at twenty. A single ALB can front dozens of services through listener rules — route api.example.com to one target group and /admin to another — so you pay one base charge for the whole fleet.

Services	One ALB each (base)	One shared ALB (base)
5 services	$82/mo	$16/mo
10 services	$164/mo	$16/mo
20 services	$329/mo	$16/mo

Base ALB charge only ($0.0225/hr, us-east-1, verified July 2026). LCU charges ($0.008/LCU-hr) add on top for both and scale with traffic.

KEY INSIGHT: The shared ALB isn't free savings — it trades dollars for coupling. One listener config now affects every service behind it, and a bad rule change can 503 the whole fleet instead of one app. The rule of thumb: per-service ALBs while you have a handful and want blast-radius isolation; a shared ALB once the base charges outweigh the isolation, with each service still on its own unique target group.

Zero-downtime deploys: deregistration delay and SIGTERM

The default 300-second deregistration delay makes every deploy drain slowly. For sub-second services, drop it to 5s, set ECS_CONTAINER_STOP_TIMEOUT to 2s, and trap SIGTERM to drain in-flight requests.

When ECS replaces a task, the ALB stops sending it new requests and waits out the deregistration delay before cutting existing connections. The default is 300 seconds — five minutes of a draining task lingering per deploy. AWS's own guidance: if your responses finish in under a second, set deregistration_delay.timeout_seconds to 5. Don't do this for long-lived requests like slow uploads or streaming — those need the room.

The other half is graceful shutdown. ECS sends SIGTERM, waits ECS_CONTAINER_STOP_TIMEOUT seconds (default 30), then SIGKILLs. If your app ignores SIGTERM, it eats the full wait on every deploy. Trap SIGTERM to stop accepting new connections and finish in-flight ones, and a task that drains in 500ms exits cleanly instead of waiting out the timeout.

javascriptCopy

// Trap SIGTERM so in-flight requests finish before the task exits
process.on("SIGTERM", () => {
  server.close(() => process.exit(0)); // stop new conns, drain, then exit
});

The traps that bite at fleet scale

Load balancer config can't be changed in the console after the service is created — only via CLI or CloudFormation. Use a unique target group per service; sharing one breaks deployments.

Two AWS-documented rules cause most of the fleet-scale pain. First: once a service exists, you cannot change its load-balancer configuration from the console — you have to go through the CLI, CloudFormation, or the SDK, and the change forces a new deployment that re-registers every task. Discover this mid-incident and you'll waste time hunting for a console button that isn't there.

Second: give every service its own target group. Sharing one target group across services "might lead to issues during service deployments," per AWS — the two services fight over the same registration set and deploys go sideways. This is the ingress side of ECS networking; for service-to-service traffic inside the fleet, the internal-ALB and DNS options are a separate decision covered in which ECS service discovery you should use — Cloud Map, Service Connect, or an internal ALB for East-West traffic, versus the public ingress ALB here.

If you read this, you might also want to know

Do I need an NLB instead of an ALB for gRPC or raw TCP?

For HTTP/HTTPS and gRPC, the ALB is right — it speaks layer 7 and supports gRPC target groups. For raw TCP/UDP, non-HTTP protocols, or when you need a static IP and ultra-low latency, use a Network Load Balancer. The Fargate wiring is nearly identical: target_type = ip, and ECS registers tasks the same way.

How do I front multiple Fargate services on one domain?

Use one ALB with path- or host-based listener rules. /api forwards to the api target group, /billing to the billing target group; or api.example.com and app.example.com split by host. Each service keeps its own unique target group — you're only sharing the load balancer and its listeners, not the target group.

Can the ALB and the Fargate tasks live in different VPCs?

Yes, if the target group uses target type ip (which Fargate requires anyway). The ip target type supports cross-VPC connectivity; the instance target type does not. You still need routing and security-group rules that let the ALB reach the task IPs across the VPC boundary.

Frequently asked questions

Why is my ECS Fargate target unhealthy in the ALB?

Almost always one of three things: the task's security group doesn't allow inbound traffic from the load balancer's security group on the container port, the health-check path returns a non-200 status, or the application takes longer to start than the health-check grace period allows. Check the security group rule first — it's the most common cause.

Should I use ALB health checks, container health checks, or both?

They do different jobs, so run both. The ALB (target group) health check decides whether a task receives traffic. The container health check in the task definition decides whether ECS restarts the container. Keep the ALB health-check grace period longer than your container's startup time so a slow boot doesn't get killed before it's ready.

Why must Fargate use ip target type instead of instance?

Fargate tasks run in the awsvpc network mode, so each task gets its own elastic network interface and IP address rather than sharing an EC2 instance. The target group must use target type "ip" to register those task IPs. "instance" target type only works for EC2-backed tasks that map to an instance ID.

How much does an Application Load Balancer cost?

In us-east-1, $0.0225 per hour (about $16.43/month) plus $0.008 per LCU-hour for capacity used (verified July 2026). The hourly base charge is per ALB, so running one ALB per service at 20 services is roughly $330/month in base charges alone — a shared ALB with host/path routing collapses that to one base charge.

Can one ALB serve multiple ECS services?

Yes. A single ALB routes to many services using host-based or path-based listener rules — api.example.com to one target group, /admin to another. Give each service its own unique target group; sharing one target group across services causes problems during deployments.

Every ALB, target group, and service — mapped

At 10+ environments, nobody can say which ALB fronts which service, or which target groups are draining money on dead tasks. Fortem maps the whole ECS fleet — load balancers, target groups, and the services behind them — on one screen. Book a 20-minute call and we'll walk yours.

Book a 20-min call

Worth reading

GuideECS Service Discovery: Cloud Map vs Service ConnectThe other half of ECS networking — East-West service-to-service traffic. Cloud Map, Service Connect, or an internal ALB, with a decision table.LandingAWS ECS Fargate: What It Is, How It Works, What It CostsThe head-term reference for ECS Fargate — where networking, the awsvpc mode, and load balancing sit in the bigger picture.

Map every ALB and target group across your fleet: fortem.dev/book

How Do You Monitor ECS Fargate Across 10+ Environments?

Matt — Tue, 07 Jul 2026 11:14:04 +0000

How to Monitor ECS Fargate at Scale

Originally published at https://fortem.dev/blog/ecs-fargate-monitoring
Monitor ECS Fargate across 10+ environments: which metrics to alarm on, what Container Insights actually costs per metric, and a Terraform for_each fleet alarm pattern.

Guide

A CPU alarm on one service is a five-minute job. The same three alarms on thirty services in five accounts — consistently, and without a Container Insights bill that quietly grows into four figures — is the actual work. This guide covers which metrics are worth paying for, the exact Container Insights cost math at fleet scale, and a Terraform pattern that alarms your whole fleet from one module.

TL;DR

Fargate needs no monitoring agent. Container Insights collects CPU, memory, network, and ephemeral-storage metrics with no sidecar — the catch is the per-metric bill.
Enhanced observability bills $0.07/metric/month across cluster, service, task-def, task, and container. Enabled everywhere on a 30-service fleet, that runs into four figures a month.
Turn enhanced observability ON in prod, OFF in dev. Dev doesn't need container-level metrics — that one split roughly halves the bill.
Alarm as code with a Terraform for_each: one module, N services, zero copy-paste. Hand-writing three alarms per service is 90 blocks at 30 services.
The metric that catches most incidents isn't CPU — it's RunningTaskCount below desired.

Ready to use — fleet alarms from one Terraform module

# One module, every service. Add a service to the map, get its 3 alarms.
variable "services" {
  type = map(object({
    cluster       = string
    desired_count = number
  }))
  # Example:
  # {
  #   api      = { cluster = "prod", desired_count = 4 }
  #   worker   = { cluster = "prod", desired_count = 2 }
  #   payments = { cluster = "prod", desired_count = 3 }
  # }
}

resource "aws_sns_topic" "alerts" {
  name = "ecs-fleet-alerts"
}

# CPU > 90% for 5 min
resource "aws_cloudwatch_metric_alarm" "cpu" {
  for_each            = var.services
  alarm_name          = "ecs-${each.key}-cpu-high"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 1
  threshold           = 90
  comparison_operator = "GreaterThanThreshold"
  dimensions          = { ClusterName = each.value.cluster, ServiceName = each.key }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

# Memory > 80% for 5 min
resource "aws_cloudwatch_metric_alarm" "mem" {
  for_each            = var.services
  alarm_name          = "ecs-${each.key}-mem-high"
  namespace           = "AWS/ECS"
  metric_name         = "MemoryUtilization"
  statistic           = "Average"
  period              = 300
  evaluation_periods  = 1
  threshold           = 80
  comparison_operator = "GreaterThanThreshold"
  dimensions          = { ClusterName = each.value.cluster, ServiceName = each.key }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

# Running tasks below desired — the incident signal that matters most
resource "aws_cloudwatch_metric_alarm" "running" {
  for_each            = var.services
  alarm_name          = "ecs-${each.key}-tasks-low"
  namespace           = "ECS/ContainerInsights"
  metric_name         = "RunningTaskCount"
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 3
  threshold           = each.value.desired_count
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  dimensions          = { ClusterName = each.value.cluster, ServiceName = each.key }
  alarm_actions       = [aws_sns_topic.alerts.arn]
}

CPU/memory come from the AWS/ECS namespace (free vended metrics). RunningTaskCount comes from ECS/ContainerInsights — it needs Container Insights on the cluster.

Why Fargate monitoring is different from EC2

On Fargate you can't SSH to a host or run node_exporter. Every metric comes from CloudWatch or an in-task sidecar — Container Insights is the no-agent default, and it's metered per metric.

With EC2-backed ECS you own the instance, so you can install the CloudWatch agent, Prometheus node exporter, or any host-level tooling you like, and you pay for the instance either way. Fargate takes the host away. You get no shell, no daemon set, no privileged sidecar reading the host cgroup. What you get instead is CloudWatch: the free vended metrics (CPUUtilization, MemoryUtilization at the service level) and, when you turn it on, Container Insights.

This is genuinely convenient — no agent to patch, no version drift across a hundred tasks — but it changes the cost model. On EC2 your monitoring is bundled into the instance you already pay for. On Fargate, deeper visibility is a separate line item that scales with the number of tasks and containers you run. The rest of this guide is about spending that line item where it earns its keep and not where it doesn't.

KEY INSIGHT: The free service-level CPU and memory metrics in the AWS/ECS namespace are enough to alarm on. You do not need Container Insights to know a service is hot. You need Container Insights when you want to know which task or container inside that service is hot — and for RunningTaskCount and ephemeral-storage metrics.

The metrics that actually matter (and which to skip)

RunningTaskCount below desired catches more incidents than CPU. Alarm on running-task deficit, memory above 80%, and CPU above 90% — leave network and storage on a dashboard.

Most Fargate monitoring guides open with CPU and memory because those are the metrics everyone recognizes. But CPU is a slow-degradation signal — a throttled service is slow, not down. The signal that actually correlates with "customers are seeing errors" is a service that can't keep its desired number of tasks running: a crash loop, a failed health check, an image that won't pull. That shows up as RunningTaskCount dropping below DesiredTaskCount, long before CPU says anything.

Metric	Why it matters	Action
RunningTaskCount < desired	Service can't stay up — the #1 real incident signal	Alarm always
MemoryUtilization > 80%	OOM kills the task; no graceful degradation	Alarm always
CPUUtilization > 90%	Throttling — slow, not down	Alarm in prod
EphemeralStorageUtilized	Disk fills → task fails silently	Alarm if you write to disk
NetworkRx/TxBytes	Diagnostic, rarely actionable as an alarm	Dashboard only
DeploymentCount / TaskSetCount	Useful during blue/green, noisy otherwise	Dashboard only

Metrics are only half of observability — the other half is logs, and the two get correlated during an incident. If your log setup isn't solid, alarms just tell you something broke without telling you what. It's worth getting how to set up ECS logging the right way nailed down before you tune alarm thresholds, because a memory alarm with no readable logs behind it is a page you can't act on.

"EphemeralStorageReserved and EphemeralStorageUtilized ... are only available for tasks that run on Fargate Linux platform version 1.4.0 or later."

— AWS Container Insights ECS metrics, verified June 2026

What Container Insights actually costs at fleet scale

Enhanced observability bills $0.07 per metric per month across cluster, service, task-def, task, and container. At 30 services that's thousands of metrics — real money, not a rounding error.

Here's the arithmetic nobody shows you. AWS bills Container Insights metrics as custom metrics, and enhanced observability (released December 2, 2024) reports them at every level — a handful per cluster and per service, and then, critically, a set per task and per container. A count in the low teens per container feels harmless in isolation, until you multiply it by a real fleet. Containers are the multiplier that hurts: a 30-service fleet with a couple of containers per task is already at 180+ containers, each reporting its own metric set. The table below models that — the exact number moves with how many containers you pack per task, so treat it as an order-of-magnitude estimate, not a quote.

Fleet	Metrics reported	Cost / month	Cost / year
5 services	475	$33	$399
15 services	1,820	$127	$1,529
30 services	4,800	$336	$4,032

Modeled estimate. Enhanced observability at $0.07/metric/month (AWS CloudWatch pricing, verified June 2026); per-resource metric counts derived from the AWS enhanced-observability metric table, GPU metrics excluded. Your count varies with containers-per-task.

And that's just the metric bill. It's separate from log ingestion and storage, which the logging side of observability owns — CloudWatch Logs is $0.50/GB ingested and $0.03/GB stored, and the default Never-Expire retention means storage never stops growing. Watching that side is a different exercise; the mechanics of controlling CloudWatch Logs costs on ECS covers the retention and ingestion levers that stack on top of these metric numbers.

KEY INSIGHT: Standard and enhanced Container Insights bill the same $0.07 per metric. The difference is entirely how many metrics each reports — enhanced adds per-task and per-container granularity, which is exactly where the count (and the bill) explodes. So the question is never "standard or enhanced?" in the abstract. It's "on which resources is per-container granularity worth $0.07 times a dozen-odd metrics each?"

Turn enhanced observability ON in prod, OFF in dev

Dev environments don't need container-level metrics. Enable enhanced observability on prod clusters only; leave dev on standard or off. That one split roughly halves the Container Insights bill.

When something breaks in a dev environment, you redeploy it — you don't run a forensic investigation into which container held memory for three seconds too long. The per-container metrics that justify their cost in production are pure waste in dev, where the tasks are often idle or scaled to one. Yet the most common Container Insights mistake is flipping it on cluster-wide, in every account, and never revisiting it.

Because Container Insights is a per-cluster setting, the split is trivial to express — set it in the Terraform that every environment shares, keyed off the environment name:

resource "aws_ecs_cluster" "this" {
  name = "${var.environment}-cluster"

  setting {
    name = "containerInsights"
    # enhanced only in prod; dev/staging get standard (cheaper) or "disabled"
    value = var.environment == "prod" ? "enhanced" : "enabled"
  }
}

If your dev environments are genuinely throwaway, "disabled" is a defensible value for them — the free AWS/ECS service-level CPU and memory metrics still exist without Container Insights, so you keep your CPU and memory alarms. You lose RunningTaskCount and ephemeral-storage metrics in dev, which is usually an acceptable trade.

Alarm as code — the fleet for_each pattern

Hand-writing three alarms per service means 90 blocks at 30 services. A Terraform for_each over a service map creates CPU, memory, and running-task alarms for the whole fleet from one module.

The Ready-to-use block above is the whole pattern: a map of services, three for_each alarm resources, one SNS topic. Adding the thirty-first service is a one-line map entry, not three copied-and-tweaked alarm blocks that drift out of sync the moment someone edits one and forgets the rest. That drift is the real failure mode of click-ops and copy-paste monitoring — not that the alarms are wrong on day one, but that they're inconsistent by month six.

A few things the pattern gets right that hand-written alarms usually miss. The running-task alarm sets treat_missing_data = "breaching" — if the metric stops reporting entirely (a service deleted itself, or Container Insights got turned off), that should page you, not silently resolve. It uses evaluation_periods = 3 at 60-second periods so a single blip during a deploy doesn't fire. And every alarm points at one SNS topic, so routing to PagerDuty or Slack is one subscription, not thirty.

None of this is exotic — the CloudPosse ecs-cloudwatch-sns-alarms module wraps the same idea, and it's popular precisely because per-service alarm sprawl is a problem enough people hit to want it solved. Whether you use a module or the raw resources above, the principle is the same: the alarm definition lives in one place and applies to the fleet.

The newly-created-environment coverage gap

Alarms defined per service don't cover the environment someone spins up next week. Either enforce monitoring in the module every environment uses, or scope a Lambda to auto-alarm any tagged cluster.

Here's the failure that a service map quietly introduces: it only monitors the services in the map. When a developer spins up a preview environment, or a new team stands up a service in a fresh account, that workload is invisible until someone remembers to add it — and nobody remembers. You find the gap during the incident, when you go looking for the alarm that should have fired and it was never created.

There are two honest fixes. The first is to make monitoring non-optional in the shared module every environment is built from — if you can't create an ECS service without also creating its alarms, there's no gap to forget. The second, for teams whose environments aren't all built the same way, is AWS's own tag-scoped pattern: a Lambda that watches for new clusters and attaches a standard alarm set to any cluster carrying a given tag. AWS ships this because the gap is real enough to need a system, not discipline.

KEY INSIGHT: Coverage is a fleet property, not a per-service one. The right question isn't "does this service have alarms?" — it's "can a service exist in this org without alarms?" If the answer is yes, your monitoring has a hole shaped like every environment you haven't manually added yet.

Beyond metrics: events and per-environment cost

Metrics tell you a task is unhealthy; EventBridge tells you why it stopped. Route ECS task-state-change events to SNS, and tag every service so per-environment cost shows up in Cost Explorer.

A metric alarm says RunningTaskCount dropped. It doesn't say the task was killed by an OOM, failed its health check, or couldn't pull its image. That reason lives in the ECS Task State Change event, which ECS emits to EventBridge for free. A single EventBridge rule matching stoppedReason and routing to the same SNS topic turns "a task stopped" into "a task stopped because OutOfMemoryError" — the difference between a page you can act on and one you have to investigate.

The other blind spot at fleet scale is cost per environment. CloudWatch shows you utilization, not dollars, and AWS bills Fargate at the account level — so a staging environment burning money looks identical to a busy prod one until you've tagged everything. Consistent Environment and Service tags on every task, activated as cost-allocation tags, are what make per-environment spend visible in Cost Explorer — and they're the same tags the Lambda coverage pattern keys off. Monitoring and cost attribution end up being the same tagging discipline.

If you read this, you might also want to know

Do I need Datadog if I already have Container Insights?

Not for infrastructure metrics — Container Insights covers CPU, memory, network, storage, and task counts across the fleet. You reach for Datadog (or an OpenTelemetry sidecar) when you need custom application metrics, distributed traces, or a single pane across non-AWS systems. Many teams run Container Insights for infra and a sidecar only on the handful of services that need APM.

How do I monitor a Fargate task that exits immediately on startup?

Metrics won't help — a task that dies in seconds never reports a meaningful data point. The signal is the ECS Task State Change event in EventBridge, which carries the stoppedReason (image pull failure, essential container exited, OOM). Route that event to SNS and read the reason; the logs, if the log driver was configured, hold the stack trace.

Can I alarm on a metric across all environments at once?

Yes, with a metric-math or aggregate alarm — for example, total RunningTaskCount across a cluster versus total DesiredTaskCount. It's useful for a fleet-health top-line, but keep the per-service alarms too: an aggregate stays green while one critical service is fully down, because the healthy services mask it.

Frequently asked questions

How much does Container Insights cost for ECS?

$0.07 per metric per month with enhanced observability (verified June 2026). Metrics are reported at the cluster, service, task-definition, task, and container level, so the count scales with your fleet — a 30-service fleet with a couple of containers per task runs into thousands of metrics, pushing the bill into four figures a month if you enable it everywhere.

Does ECS Fargate need a monitoring agent?

No. On Fargate you can't install a host agent, but you don't need one — Container Insights collects CPU, memory, network, and ephemeral-storage metrics with no sidecar once you enable it on the cluster. A sidecar (OpenTelemetry, Datadog) is only needed for custom application metrics or traces.

What ECS metric should I alarm on first?

RunningTaskCount below DesiredTaskCount. A running-task deficit means a service can't stay up — that catches more real incidents than CPU. Add memory utilization above 80% (out-of-memory kills) and CPU above 90% (throttling) after that.

How do I monitor ephemeral (disk) storage on Fargate?

Use the EphemeralStorageUtilized and EphemeralStorageReserved metrics. They only exist with Container Insights on Fargate platform version 1.4.0 or later — without Container Insights, disk usage on a Fargate task is invisible until the task fails writing to a full filesystem.

Is enhanced observability worth it over standard Container Insights?

In production, yes — enhanced observability adds per-task and per-container metrics that let you find which container in a task is the problem. In dev, no. Both bill $0.07/metric; enhanced just reports far more metrics per resource, so the cost difference is entirely about how many resources you enable it on.

Every ECS environment, one view

Container Insights and per-service alarms tell you about one cluster at a time. Fortem puts every ECS Fargate environment — health, running tasks, and idle spend — on one screen, so the gap-shaped hole isn't there to fall into. Book a 20-minute call and we'll walk your fleet.

Book a 20-min call

Worth reading

Guide · How Should You Set Up ECS Logging?awslogs, FireLens, and the three decisions every ECS team gets wrong — the logs half of observability that your alarms correlate against.LandingAWS ECS Fargate: What It Is, How It Works, What It CostsThe head-term reference for ECS Fargate — how tasks, services, and clusters fit together, and where monitoring and cost sit in the bigger picture.

See every ECS environment in one place — book a 20-min call: fortem.dev/book

AWS ECS Fargate Security: What You Actually Configure (and What You Can't)

Matt — Thu, 02 Jul 2026 09:41:19 +0000

AWS ECS Fargate Security: What You Configure

Originally published at https://fortem.dev/blog/ecs-fargate-container-security
ECS Fargate security from the operator's seat: the two IAM roles, read-only root fs, secrets, awsvpc security groups, what Fargate won't let you configure, GuardDuty runtime, and fleet-scale drift.

Every "Fargate security best practices" list is the same checklist — non-root, read-only filesystem, least-privilege IAM. The list isn't wrong; it's just written for one task. If you run 10–40 ECS environments, the hard part isn't setting these knobs — it's keeping them identical everywhere and proving it for SOC 2. This is Fargate security from the operator's seat: the exact config, what Fargate won't let you touch, and where fleets actually break.

TL;DR

Fargate splits security into two IAM roles — the EXECUTION role (agent: pull image, ship logs, fetch secrets) and the TASK role (your app code). Confusing them is the #1 config mistake.
The Fargate hardening set is small and fixed: non-root user, readonlyRootFilesystem, secrets via Secrets Manager/SSM (never plaintext env vars), awsvpc security groups per task. Ephemeral storage is already AES-256 encrypted.
Fargate REMOVES options on purpose: no privileged containers, no host access, no SSH, CAP_SYS_ADMIN/NET_ADMIN blocked — and you can't run your own runtime agent (Falco DaemonSet). GuardDuty's injected sidecar is the only runtime-threat path.
GuardDuty ECS Runtime Monitoring bills per monitored vCPU-hour and won't attach to an already-running task — a real cost and coverage gap at 10+ environments.
At fleet scale the risk isn't the config, it's DRIFT: one env with a too-broad task role, a staging secret nobody rotated, a public-subnet task in environment #39.

Ready to use — copy this today

A hardened task-definition fragment, the writable-/tmpmount that makes read-only root actually work, and a least-privilege task role scoped to one environment's resources — no wildcards:

// 1. Hardened container definition (the fields that matter for security)
{
  "name": "app",
  "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/app:1.4.2",
  "user": "1000:1000",                    // non-root — matches USER in Dockerfile
  "readonlyRootFilesystem": true,          // root FS read-only
  "privileged": false,                     // (not supported on Fargate anyway)
  "secrets": [                             // secret VALUES never live here
    {
      "name": "DB_PASSWORD",
      "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-AbCdEf"
    }
  ],
  "mountPoints": [
    { "sourceVolume": "tmp", "containerPath": "/tmp", "readOnly": false }
  ],
  "logConfiguration": {
    "logDriver": "awslogs",
    "options": { "awslogs-group": "/ecs/app", "awslogs-region": "us-east-1", "awslogs-stream-prefix": "app" }
  }
}

// 2. The writable volume that keeps read-only root from breaking the app.
// Declared at the task level; mounted at /tmp above. Everything else stays RO.
"volumes": [
  { "name": "tmp", "host": {} }
]

// 3. Least-privilege TASK role — scoped to ONE environment's bucket, no wildcards.
// This is the role your app code uses. Give each environment its own.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::acme-prod-uploads/*"
    }
  ]
}

The two IAM roles Fargate security starts with

Fargate uses two IAM roles: the execution role lets the agent pull the image, ship logs, and fetch secrets; the task role is what your app code uses to call AWS. Mixing them up is the top mistake.

The execution role is used by the Fargate agent to set the task up before your container runs — pull the image from ECR, write logs, and inject any secrets the task definition references. The task role is used by your application at runtime to reach S3, DynamoDB, and other services. Secret injection and ECR pulls are execution-role permissions; attaching them to the task role is the single most common config error, and a recurring AWS re:Post question.

One thing worth internalizing: AWS is explicit that "containers are not a security boundary… each task running on Fargate has its own isolation boundary and does not share the underlying kernel, CPU, memory, or elastic network interface with another task." The isolation you rely on is the task, not the container. Scoping a separate least-privilege task role per environment is exactly the per-environment IAM isolation across the fleetthat keeps one compromised task from reaching another environment's data.

The Fargate hardening set (the config that matters)

The Fargate hardening set is short: run as a non-root user, set readonlyRootFilesystem, drop unneeded capabilities. Ephemeral storage is already AES-256 encrypted, so that box is checked for you.

By default a container runs as root unless your Dockerfile has a USERdirective. AWS's own guidance is to run as non-root and to lint Dockerfiles in CI, failing the build when the USERdirective is missing — that turns "we run non-root" from a hope into a gate.

A container's root filesystem is writable by default; you set readonlyRootFilesystem: true explicitly. Ephemeral storage — the 20 GiB scratch space (up to 200 GiB) each task gets — is encrypted with AES-256 using a Fargate-managed key for any task launched since May 2020, so encryption-at-rest for scratch data needs no action from you. The read-only setting is the one that bites people, which is the next section.

Making read-only root filesystem actually work

readonlyRootFilesystem: true breaks any app that writes to /tmp. The fix is a writable volume mounted at /tmp (and any other write path) while the rest of the filesystem stays read-only.

This is the friction point behind the glib "just make the filesystem read-only" advice. The moment you flip it on, an app that writes a session file, a cache, or a temp upload to /tmpstarts failing — and it fails as a confusing runtime crash, not as an obvious "permission denied on a read-only filesystem." It's a recurring ECS re:Post thread.

The fix is in the ready-to-use block above: declare an ephemeral volume at the task level and mount it at every path the app writes to (usually /tmp), keeping the rest of the root filesystem read-only. You end up declaring your writable surface explicitly, which is the whole security point — you now know exactly where the container can write.

KEY INSIGHT: Key insight Read-only root isn't a toggle you flip and forget — it's a contract that forces you to declare every writable path. Test it before prod; a missing /tmp mount looks like a random app crash, not a security setting.

Secrets: the sanctioned path vs plaintext env vars

Never put secrets in a task definition's plaintext environment block. Reference them in the secrets block from Secrets Manager or SSM; the execution role fetches and injects them at launch.

Anything in the task definition's environment block is stored and shown in plaintext in the console and the API — a DB password there is visible to anyone with ecs:DescribeTaskDefinition. The secrets block instead holds a reference (an ARN); the execution role resolves it at launch and injects the value as an env var the app sees, without the value ever living in the task definition.

For that to work, the execution role needs secretsmanager:GetSecretValue (for Secrets Manager) or ssm:GetParameters (for Parameter Store), plus kms:Decryptif the secret uses a customer-managed key. Note it's the execution role, not the task role — the agent fetches the secret before your code runs.

Network isolation with awsvpc and security groups

Fargate only runs in awsvpc mode, so every task gets its own ENI and IP and can carry its own security group. Scope security groups per task and keep tasks in private subnets with no public IP.

awsvpcis the only network mode available on Fargate, and it's the mode that gives each task a dedicated elastic network interface with its own private IP. That means a security group attaches to the task itself — you can scope ingress and egress per service instead of sharing one host's rules across everything, the way bridge mode forces on EC2.

The operational rule: tasks go in private subnets, reach ECR and Secrets Manager over a NAT gateway or VPC endpoints, and never get an auto-assigned public IP. The failure that undoes all of this is a single environment stood up from an older module that drops the task in a public subnet with a permissive security group. It's easy to miss when environments are built one at a time — the fleet-drift problem below.

What Fargate won't let you configure (vs EC2)

Fargate removes low-level controls on purpose: no privileged containers, no host access, no SSH, CAP_SYS_ADMIN and CAP_NET_ADMIN restricted. The only capability you can add is CAP_SYS_PTRACE.

A lot of "container security" advice assumes host-level control you simply don't have on Fargate. Privileged containers aren't supported, so Docker-in-Docker patterns don't run. Additional Linux capabilities like CAP_SYS_ADMIN and CAP_NET_ADMIN are restricted to prevent privilege escalation; only CAP_SYS_PTRACEcan be added, for observability and security tooling inside the task. And there's no host access at all — ECS exec is the only sanctioned way to get a shell into a running container.

ControlEC2 launch typeFargate

Privileged containersAllowedNot supported

Docker-in-DockerPossibleNot possible

Host access / SSHYes (host + ECS exec)No host; ECS exec only

Custom runtime agent (Falco)DaemonSet on the hostNone — GuardDuty sidecar

Linux capabilitiesAdd most capsOnly CAP_SYS_PTRACE addable

Kernel modules / sysctlsHost-level controlLocked down

Read-only root filesystemConfigurableConfigurable

Non-root userConfigurableConfigurable

The bottom two rows are the ones you still own. Everything above them is decided for you — which is the point of Fargate. The consequence that surprises people: you can't run your own runtime sensor, so runtime threat detection works differently.

Runtime threat detection = GuardDuty (there's no alternative)

Because you can't run your own agent on Fargate, GuardDuty ECS Runtime Monitoring is the only managed runtime-threat path. It injects a sidecar into each task and bills per monitored vCPU-hour.

On EC2 you'd run a Falco DaemonSet or a vendor agent on the host to watch process execution, file access, and network connections at runtime. On Fargate there's no host to put it on. AWS closes that gap with GuardDuty ECS Runtime Monitoring: when a task starts, GuardDuty attaches a managed security sidecar container to it. AWS is explicit that on Fargate you cannot manage that agent manually — GuardDuty is the only supported runtime path.

Two operational gotchas. First, a Fargate task is immutable, so GuardDuty won't attach the sidecar to a task that's _already_running — you stop and restart the task to bring it under monitoring. Second, it's billed per monitored vCPU-hour on a tiered rate (with a 30-day free trial), so across 10+ always-on environments it's a real, and often unpredictable, line item. Check the GuardDuty pricing pagefor your region's exact rate before you assume it's free.

Image supply chain — scanning and immutable tags

Secure the image before it runs: enable ECR scan-on-push, set tag immutability so a tag can't be silently overwritten, and fail the build on HIGH or CRITICAL CVEs. Basic scanning is free.

Runtime hardening only matters if the image itself is sound. ECR basic scanning is free, uses an AWS-native CVE database (the older Clair-based basic scanning was retired on October 1, 2025), and checks each image on push; enhanced scanning via Amazon Inspector goes deeper into OS and language packages. Tag immutability stops a second push of :v1.4.2 from silently replacing the bytes you already reviewed and deployed.

The registry is its own security surface — pull IAM, cross-account access, and lifecycle all matter at fleet scale. That's covered in depth in how ECR works for ECS Fargate teams; for security specifically, scan-on-push plus immutable tags is the baseline every repo should be born with.

Where fleets actually break — security drift at 10+ environments

At fleet scale the config is correct per-env but unaudited across envs. The real risks: one env with a too-broad task role, a secret rotated in prod but not staging, a public-subnet task missed.

Every control above is easy to set on one task. The problem is that a fleet of 10–40 environments is edited one task definition at a time, in isolation, and nothing shows you all of them at once. So the failures are failures of uniformity:

Task-role drift. Prod's role got scoped to least-privilege during the SOC 2 push; the six-month-old sandbox and demo-eu environments still carry a copy-pasted role with s3:* or a wildcard Resource. Nothing flags it, because each env's task def is edited on its own.
Secrets rotated unevenly. The DB credential is rotated in prod, but staging and qa still reference the old secret ARN — or worse, still carry the value in a plaintext environment var from before the migration. The secret is "handled" in one environment.
The public-subnet task nobody noticed. awsvpc and security groups are correct in 38 environments; #39 was stood up from an older Terraform module that drops the task in a public subnet with a permissive SG. It passes every single-env checklist, because that checklist only ever looks at one env.
Uneven read-only / non-root. readonlyRootFilesystem and the USER directive are enforced in the environments the platform team built, and skipped in the ones a product squad self-served. The hardening exists as a policy but not as a fact across the fleet.
GuardDuty on 30 of 40 clusters. Runtime Monitoring is on account-wide but excluded via tag on the clusters someone muted during a noisy-alert incident and never re-enabled. Ten environments have zero runtime detection, and no dashboard shows the gap.

Container security on Fargate is not hard to configure — it's hard to keep uniform and prove. The reframing from "is this task hardened?" to "is every environment hardened the same way?" is the whole job at 10+ environments, and it's the thing a per-task checklist structurally can't answer.

How this ties into your SOC 2 controls

SOC 2 doesn't ask if you CAN secure a task — it asks you to prove the control holds in every environment. Fleet-wide uniformity is the audit evidence, and it breaks when envs are configured by hand.

Each item in the hardening set maps to a control an auditor will ask about: least-privilege access (the task role), encryption (ephemeral storage and secrets), change monitoring (who edited a task definition). The gap auditors find isn't that you _can't_do these things — it's the one environment where the control isn't applied. Drift is the finding they circle in red, and proving uniformity across the fleet is the evidence that closes it.

If you read this, you might also want to know

Do I need GuardDuty if I already scan images in ECR?

Yes — they cover different phases. ECR scanning finds known CVEs in the image before it runs (build-time / supply chain). GuardDuty Runtime Monitoring watches behavior while the container runs — process execution, file access, outbound connections — and catches things a clean image can still do at runtime, like credential theft or crypto-mining. Scanning is prevention; runtime monitoring is detection.

Can I run Falco or my own security agent on Fargate?

No. Falco and similar tools need host-level access (a DaemonSet or kernel module) that Fargate doesn't give you — there's no host to install them on. GuardDuty ECS Runtime Monitoring, which injects an AWS-managed sidecar, is the supported substitute. If a vendor claims Fargate runtime coverage, they're either using GuardDuty's feed or running in-task with the limited CAP_SYS_PTRACE capability.

Is the task role or the execution role the one that reads my secrets?

The execution role. The Fargate agent uses the execution role to fetch a secret referenced in the task definition's secrets block and inject it before your container starts, so the execution role needs secretsmanager:GetSecretValue or ssm:GetParameters. The task role is only for AWS calls your application makes at runtime.

Map every environment's security config in 5 min: fortem.dev/audit

AWS ECR: How Container Registry Works for ECS Fargate Teams

Matt — Tue, 30 Jun 2026 21:51:26 +0000

AWS ECR Guide for ECS Fargate Teams

Originally published at https://fortem.dev/blog/aws-ecr-guide
AWS ECR from the ECS Fargate operator's seat: how pulls work, the execution-role IAM, why private-subnet tasks fail, real pricing, and the lifecycle policy that cuts the bill.

Every ECS Fargate deploy pulls an image from ECR — and ECR is the part nobody owns until it breaks. A task in a private subnet throws ResourceInitializationError, or five years of untagged images quietly push the bill to $400/month. This is ECR from the ECS operator's seat: how pulls actually work, the IAM the execution role needs, what it costs at fleet scale, and the lifecycle, scanning, and replication settings that matter at 10+ environments — with the AWS-verified pricing nobody else itemizes.

TL;DR

ECR is AWS's managed container registry — the default image store for ECS and EKS. Registry → repository → image, with IAM-based access and a short-lived auth token per pull.
The #1 ECR failure on Fargate is a private-subnet task that can't pull: it needs either a NAT gateway or three ECR VPC endpoints, plus AmazonECSTaskExecutionRolePolicy on the execution role.
ECR storage is $0.10/GB-month; same-region pulls to Fargate are free. The hidden bill is old images — one team went from $400/mo to ~$15/mo with a 30-day lifecycle policy.
At fleet scale three settings matter: lifecycle policies (cost), scan-on-push (security), and cross-account replication (multi-account image distribution).
For ECR-heavy fleets in private subnets, VPC interface endpoints are often cheaper than routing every pull through a NAT gateway.

Ready to use — copy this today

Push an image, then a lifecycle policy that keeps the bill flat, then the exact networking + IAM a private-subnet Fargate task needs to pull:

# 1. Authenticate Docker to your private ECR registry, then push
aws ecr get-login-password --region us-east-1 \
  | docker login --username AWS --password-stdin \
      123456789012.dkr.ecr.us-east-1.amazonaws.com

docker tag my-app:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

// 2. Lifecycle policy — kill the hidden storage bill.
// Rule 1: drop untagged layers after 1 day. Rule 2: drop non-prod after 30 days.
{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Expire untagged images after 1 day",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 1
      },
      "action": { "type": "expire" }
    },
    {
      "rulePriority": 2,
      "description": "Expire non-prod images after 30 days",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["dev", "staging", "pr-"],
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 30
      },
      "action": { "type": "expire" }
    }
  ]
}

# 3. The exact set a PRIVATE-subnet Fargate task needs to pull from ECR.
# Two interface endpoints (ecr.api, ecr.dkr) + an S3 gateway endpoint
# (ECR stores image layers in S3). No NAT gateway required.
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.endpoints.id]   # allow TCP 443 from tasks
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "ecr_dkr" {
  vpc_id              = var.vpc_id
  service_name        = "com.amazonaws.${var.region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.endpoints.id]
  private_dns_enabled = true
}

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = var.vpc_id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = var.private_route_table_ids
}

# The task EXECUTION role needs the ECR pull actions (AmazonECSTaskExecutionRolePolicy covers these):
#   ecr:GetAuthorizationToken, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer

What ECR actually is (for an ECS team)

ECR is AWS's managed container registry — the default image store for ECS and EKS. ECS pulls images at task launch using the execution role's IAM and a short-lived auth token.

AWS calls ECR "an extension of both services" — meaning if you run ECS Fargate, you already run ECR whether you think about it or not. Every task definition references an image, and that image almost always lives in a private ECR repository in your account. ECR is the boring dependency in the deploy path: invisible when it works, a production incident when it doesn't.

The generic "what is a container registry" explanation — it's a managed Docker registry, you push and pull over HTTPS, IAM controls access — is true and covered everywhere. The rest of this guide is the part that isn't: how ECR behaves from the seat of someone operating an ECS Fargate fleet, where the failures and the costs actually live.

Registry vs repository vs image — the model

One registry per account per region holds many repositories; each repository holds the tagged versions of one image, addressed asaccount.dkr.ecr.region.amazonaws.com/repo:tag.

The three nouns trip people up because AWS reuses "registry." Your registry is the per-account, per-region namespace. A repository holds one logical image — my-app — with all its tags and versions. An image is one immutable build, addressed by tag (:latest, :v2.3.1) or by digest.

Repositories are private by default. Public repos exist (ECR Public, for distributing images to anyone), but an ECS fleet pulls from private repos in its own account — which is why every pull needs both a network path and IAM, the two things the next sections fix.

How an ECS Fargate task pulls an image

At launch, the Fargate agent uses the task EXECUTION role to pull the image before your container starts. If that pull can't reach ECR or lacks IAM, the task dies with ResourceInitializationError.

The single most useful distinction in ECS: the execution role and the task role are different identities. The execution role is what the Fargate agent uses to set the task up — pull the image from ECR, fetch secrets, write logs — all before your code runs. The task role is what your application uses at runtime to reach S3, DynamoDB, and other services. ECR pulls are an execution-role concern; putting ECR permissions on the task role is a common dead end.

The pull is the very first thing that happens. That's why an ECR problem shows up as a task that never starts, not an app error — and why it's worth knowing the execution role's exact permissions, which sit alongside every task-definition field including the execution role.

Why private-subnet tasks fail to pull (the #1 ECR error)

A private-subnet Fargate task has no path to ECR by default. It needs a NAT gateway or three VPC endpoints (ecr.api, ecr.dkr, S3 gateway) — missing them is the #1 ResourceInitializationError cause.

This is the failure that fills AWS re:Post: a task in a public subnet pulls fine, you move it to a private subnet for security, and deploys start dying with ResourceInitializationError: unable to pull image. The image didn't change; the network path disappeared. ECR lives on the public AWS network, and a private subnet has no route to it without help.

Two ways to give it a path. A NAT gatewayroutes the task's outbound traffic to the internet, where it reaches ECR — simple, one resource. Or three VPC endpoints: an interface endpoint for ecr.api (the API), an interface endpoint for ecr.dkr (the Docker registry), and a _gateway_endpoint for S3 — because ECR stores the actual image layers in S3, and the pull fails silently if the task can't reach S3 too. The endpoint security group must allow TCP 443 from the task's security group.

KEY INSIGHT: The forgotten third endpoint is S3. Teams add the two ECR interface endpoints, see pulls still fail, and assume ECR is broken. ECR hands the agent a pre-signed S3 URL for the layers — no S3 endpoint, no layers, ResourceInitializationError. All three or none.

The execution-role IAM ECR needs

The task execution role needs AmazonECSTaskExecutionRolePolicy — or the equivalent three ecr: pull actions. Without it the pull is denied even with correct networking.

Three actions do the whole pull. ecr:GetAuthorizationToken gets the short-lived token Docker uses to log in. ecr:BatchGetImage fetches the image manifest. ecr:GetDownloadUrlForLayergets the pre-signed S3 URLs for each layer. AWS's managed AmazonECSTaskExecutionRolePolicy bundles all three plus the CloudWatch Logs permissions a task needs — attach it to the execution role and the IAM half is done.

The diagnostic rule of thumb: if the pull fails the same way from _every_subnet, it's IAM; if it fails only from private subnets, it's networking. The two failures look identical in the task event log, so check both — most wasted hours come from fixing the wrong half.

NAT gateway vs VPC endpoints — the cost decision

A NAT gateway is simplest but bills $0.045/hr plus $0.045/GB. For a fleet pulling constantly across many private subnets, three ECR VPC endpoints are often cheaper and keep pulls private.

The naive choice is a NAT gateway — it's one resource and it fixes the pull. But a NAT gateway is a per-environment fixed cost (roughly $32/month each before data), and image pulls are data-heavy: every task launch drags layers through it at $0.045/GB. A fleet that scales up and down all day, pulling on every launch, can run a surprising NAT data bill that's really just ECR traffic.

VPC interface endpoints have their own hourly cost, but pull traffic over them stays on the AWS network and avoids the NAT per-GB charge. For an ECR-heavy fleet — many environments, frequent launches — the endpoints usually win, and they remove image pulls as a reason your tasks ever touch the internet. This is the same fixed-vs-usage tradeoff that runs through the real per-environment cost of Fargate including NAT.

What ECR actually costs

ECR storage is $0.10/GB-month; same-region pulls to Fargate are free; cross-region transfer is $0.09/GB. New accounts get 500 MB free for a year. The real bill is accumulated old images.

Line itemCostNotes

Private storage$0.10 / GB-monthThe line that grows with old images

Pull to Fargate/ECS (same region)freeNo transfer charge in-region

Cross-region transfer out$0.09 / GBReplication / multi-region pulls

Free tier (new accounts)500 MB / moPrivate storage, first 12 months

Verified against the AWS ECR pricing page(July 2026). Note what's NOT here: pulling to your Fargate tasks in the same region costs nothing. So the bill isn't your deploys — it's storage that only ever grows. Which is the next section.

The hidden ECR bill — old images, and the fix

Untagged and stale images pile up invisibly. One team paid $400/month for five years of old images; a lifecycle policy (untagged after 1 day, non-prod after 30 days) dropped it to ~$15/month.

Every CI run pushes a new image. Every push of :latestorphans the previous one as an untagged layer. None of it is ever deleted unless you say so, and nobody opens the ECR console to look. So storage compounds quietly — a few GB a month becomes hundreds of GB over a few years, and the only signal is a bill line that's slowly climbed.

Lifecycle policies fix it for free. You write rules — by tag, age, or count — and ECR deletes the rest automatically. The pair in the ready-to-use block above covers most teams: expire untagged images after 1 day (the orphaned :latest layers nobody references), and expire dev/staging/PR images after 30 days (so a paused project stops billing). That single pair is the whole $400-to-$15 swing in the documented caseabove. Test rules in the console's dry-run preview before applying — a too-aggressive prod rule that deletes an image a service still references is its own incident.

Image scanning — catching vulnerabilities on push

ECR scan-on-push checks each new image for known CVEs as a per-repository setting. It's the cheapest first line of container vulnerability detection — one toggle per repo, free on the basic tier.

Turn on scan-on-push and every image gets checked against a CVE database the moment it lands, with results you can read in the console or via the API. Basic scanning is free and historically used the open-source Clair CVE database (AWS now describes it as native technology over the same CVE data); enhanced scanning (powered by Amazon Inspector) goes deeper into OS and language packages and bills per image. For a team heading into a SOC 2 audit, scan-on-push is the cheapest box to tick — it turns "do you scan container images" from a project into a per-repo toggle.

Distributing images across accounts

Multi-account ECS fleets need the same image in every account. Two clean paths: ECR cross-account replication (push once, pull locally) or a shared repository policy scoped to named accounts.

Once prod, staging, and dev live in separate AWS accounts, one image built in a CI account has to reach all of them. Replication copies the image into a local repo in each destination account, so every pull is in-account and in-region (and free). A shared repository keeps one copy and grants pull access via a repository policy naming the specific accounts. The shortcut to avoid is granting access to your whole Organization with aws:PrincipalOrgID — it works, but it opens the repo to every account in the Org, which is a finding an auditor circles in red. The full cross-account image-distribution mechanics live in the multi-account operating model.

Lifecycle, immutability, pull-through cache — fleet settings

Three registry settings matter at scale: lifecycle policies (cost), tag immutability (no silent :latest overwrites), and pull-through cache (mirror public images to dodge rate limits).

Tag immutability stops a second push of:v1.2.0 from silently replacing the first — so a tag always points at the exact bytes you deployed, which matters for rollbacks and audits. Pull-through cache mirrors an upstream public registry (Docker Hub, the ECR Public gallery) into your private registry on first pull, then keeps it fresh — it dodges Docker Hub rate limits that randomly fail deploys and keeps base-image pulls in-account. And lifecycle policies, from the cost section, are the third — the one setting whose absence shows up on the bill.

The fleet default

Set all three as defaults in your account baseline (or a repository-creation template) so every new repo is born with a lifecycle policy, immutable tags, and the pull-through cache rule — instead of each team rediscovering the $400 bill on its own.

If you read this, you might also want to know

Do I need a NAT gateway if I use ECR VPC endpoints?

Not for ECR pulls — the ecr.api + ecr.dkr + S3 endpoints give a private-subnet task everything it needs to pull. You still need a NAT gateway (or other endpoints) if the task itself reaches other internet services at runtime. Many teams drop NAT to endpoints-only and cut both cost and internet exposure.

What's the difference between the task role and the execution role for ECR?

The execution role is used by the Fargate agent to set the task up — pull the image from ECR, fetch secrets, write logs — before your container runs. The task role is used by your application code at runtime. ECR pulls are always an execution-role permission; putting ecr:* on the task role does nothing for the pull.

Does scan-on-push cost extra?

Basic scanning (Clair-based, OS packages) is free. Enhanced scanning via Amazon Inspector — which adds OS and programming-language package CVEs and continuous re-scanning — bills per image scanned. Most teams start with free basic scanning and upgrade specific repos to enhanced when compliance requires it.

Can two AWS accounts share one ECR repository?

Yes — attach a repository policy that grants the ECR pull actions to the specific account IDs that need it, and they pull cross-account. The cleaner pattern at scale is cross-account replication (each account pulls from its own local copy, free and in-region). Avoid granting access to the whole Organization via aws:PrincipalOrgID.

Map your fleet's ECR + NAT spend in 5 min: fortem.dev/audit

ECS Fargate Autoscaling: Target Tracking, Step, and Why It Doesn't Scale When You Expect

Matt — Tue, 30 Jun 2026 11:17:00 +0000

ECS Fargate Autoscaling: Target Tracking & Step Scaling

Originally published at https://fortem.dev/blog/ecs-fargate-autoscaling
ECS Fargate autoscaling explained: target tracking, step scaling, the right cooldowns, and the five reasons it doesn't scale when you expect — per the AWS docs.

Guide

You set a CPU target, autoscaling “works” — until a traffic spike it reacts to too slowly, or a service that quiets down won't scale back in. Autoscaling sits on top of the ECS Fargate service and task primitives, and its dynamic scaling follows rules most tutorials skip. This guide covers the three policy types, the settings that matter, and the five reasons it doesn't scale when you expect — each backed by the AWS docs, not by “set the target to 50% and hope.”

TL;DR

Target tracking is the right default: pick one metric (CPU, memory, or ALB requests per target), set one target, and AWS creates and manages the alarms.
It scales out fast and in slow on purpose — the managed alarms need ~3 minutes above target to add tasks, ~15 minutes below to remove them.
Five things break it: scale-in is off during deployments, ALB request count isn't supported on blue/green, editing the managed alarms, too-slow reaction to spikes, and thrashing from a short scale-in cooldown.
When target tracking is too slow for bursts, add a step scaling policy for the spike and keep target tracking for steady state — they coexist.

Quick answer — For most ECS Fargate services, use target tracking on CPU at 50% (or ALBRequestCountPerTarget for request-driven apps), with a ~60s scale-out and ~300s scale-in cooldown.AWS creates and manages the CloudWatch alarms — don't edit them. Scale-out happens after ~3 minutes above target; scale-in after ~15 minutes below, and it's turned off entirely during a deployment. If that's too slow for sudden spikes, add a step scaling policy on a steeper alarm and keep target tracking for steady state.

Ready to use — copy this today

Target tracking on CPU at 50% for one Fargate service, with sensible cooldowns. Register a scalable target, attach the policy, set min/max. Drop it into your Terraform and the service scales itself.

# Register the ECS service as a scalable target
resource "aws_appautoscaling_target" "svc" {
  service_namespace  = "ecs"
  resource_id        = "service/${var.cluster}/${var.service}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 2     # floor — never below this
  max_capacity       = 20    # ceiling — caps your worst-case cost
}

# Target tracking on average CPU at 50%
resource "aws_appautoscaling_policy" "cpu" {
  name               = "${var.service}-cpu-target"
  policy_type        = "TargetTrackingScaling"
  service_namespace  = aws_appautoscaling_target.svc.service_namespace
  resource_id        = aws_appautoscaling_target.svc.resource_id
  scalable_dimension = aws_appautoscaling_target.svc.scalable_dimension

  target_tracking_scaling_policy_configuration {
    target_value = 50.0
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    scale_out_cooldown = 60    # add tasks quickly
    scale_in_cooldown  = 300   # remove tasks slowly — avoids thrashing
  }
}

For request-driven services, swap the metric for ALBRequestCountPerTarget with a resource_labelpointing at your ALB target group. Don't hand-edit the CloudWatch alarms this creates — AWS manages them.

The three ways ECS scales (and which to use)

ECS has three modes: target tracking (hold a metric at a target), step scaling (tiered per-alarm adjustments), and scheduled (calendar). For load, target tracking is the default; step handles bursts.

All three run on AWS Application Auto Scaling, which adjusts your service's desired task count. They answer different questions. Target tracking asks “keep this metric here.” Step scaling asks “when this alarm breaks by this much, add this many tasks.” Scheduled asks “at this time, set capacity to this.”

Policy	What it does	When to use
Target tracking	Keep a metric (CPU, memory, ALB requests) at one target value	Steady load — the default
Step scaling	Tiered task adjustments per alarm-breach size	Sudden spikes, custom thresholds
Scheduled	Set capacity by date/time (cron)	Known calendar patterns — covered separately

This guide is about scaling to load — target tracking and step. Scheduled scaling is a different job: turning environments off on a calendar to cut idle spend. If that's what you're after, the full mechanics of scheduled scaling to stop environments off-hours live in their own guide — we won't repeat them here.

Target tracking — the default, and its three metrics

Target tracking holds one of three predefined metrics at a target — CPU, memory, or ALB requests per target — and AWS creates and manages the CloudWatch alarms. You set the number; it does the rest.

It works like a thermostat. You pick a number; the auto scaler adds or removes tasks to keep the metric near it. The three metrics fit different services:

CPU*ECSServiceAverageCPUUtilization* — the safe default for compute-bound services. Works everywhere, but it's a proxy: CPU can sit low while the service is still slow on I/O.
MEM*ECSServiceAverageMemoryUtilization* — for memory-bound workloads. Risky as a sole metric: many apps hold memory flat and never trigger a scale-in.
ALB*ALBRequestCountPerTarget* — the best signal for request-driven APIs. It scales on actual load, not a proxy. Caveat below.

The big convenience: target tracking removes the need to define alarms by hand. AWS builds two — a high alarm to scale out and a low alarm to scale in — and tunes them as load shifts.

KEY INSIGHT: Do not edit or delete the CloudWatch alarms that target tracking creates. Service Auto Scaling owns them — it adjusts them as your load changes and deletes them when you delete the policy. Hand-editing them looks fine until the next adjustment silently reverts your change — and then scaling misbehaves with no obvious cause.

Why it doesn't scale when you expect (5 failure modes)

Five reasons: scale-in is blocked during deployments, the scale-out/scale-in timing asymmetry, ALB request count isn't supported on blue/green, insufficient data never scales in, and editing the managed alarms breaks it.

Most autoscaling problems aren't bugs — they're documented behavior that surprises you at the wrong moment. Here's the catalog, with the symptom you'll see, the cause, and the fix.

1My service won't scale in

Cause — Scale-in is conservative by design. The managed low alarm typically needs ~15 consecutive minutes below the threshold before removing tasks, while scale-out fires after ~3 minutes above. So a service that quiets down still runs extra tasks for a quarter of an hour.

Fix — Accept it for steady services, or — if you need faster, asymmetric scale-in — disable scale-in on the target tracking policy and add a custom step scaling policy with your own thresholds. Step scaling trades away some of target tracking's churn protection for control.

2Nothing scaled during my deployment

Cause — Application Auto Scaling turns off scale-in while an ECS deployment is in progress. Scale-out still happens (unless suspended), but tasks added under load mid-deploy won't be removed until the deployment finishes.

Fix — Expected behavior — let the deployment finish, scaling resumes after. If you also want to suspend scale-out during deploys, set DynamicScalingOutSuspended on the scalable target, then clear it when the deploy completes.

3ALBRequestCountPerTarget scaling does nothing on blue/green

Cause — ALBRequestCountPerTarget is not supported for the blue/green deployment type. The policy exists but never drives scaling.

Fix — Use CPU or memory target tracking on blue/green services, or scale on request count only on rolling-update services. Don't mix the unsupported metric with blue/green and assume it works.

4A service with spiky metrics never scales in

Cause — Target tracking does not scale in on insufficient data — it refuses to read missing datapoints as 'low utilization', to protect availability. A service with gaps in its metric stream stays at its current task count.

Fix — Make sure the metric reports continuously (a healthy service emits CPU/memory every minute). For request count, ensure the ALB target group is receiving traffic the policy can read.

5Scaling went weird after someone 'fixed' an alarm

Cause — Someone hand-edited the CloudWatch alarm target tracking manages. The next automatic adjustment reverts or conflicts with the change, and scaling behaves unpredictably.

Fix — Never touch the managed alarms. Change behavior through the policy (target value, cooldowns) instead. If you need custom alarm logic, use step scaling, where you own the alarms outright.

Cooldowns and thrashing — the settings that matter

Scale-out cooldown ~60s keeps you responsive; scale-in ~300s prevents thrashing. Too short a scale-in cooldown thrashes tasks; a CPU target too high (80%) leaves no headroom to warm up.

The cooldown is how long Service Auto Scaling waits for a scaling action to take effect before doing more. The two directions want different values, for different reasons.

Setting	Default	Why
Metric + target	CPU at 50% (or ALB requests/target)	Leave headroom for new tasks to warm up
Scale-out cooldown	~60 sec	Stay responsive under rising load
Scale-in cooldown	~300 sec	Prevent thrashing on dips
Min / max tasks	Set both deliberately	Max caps cost; min holds a floor

Why asymmetric: scaling out should be quick — under rising load you want capacity now, so a short ~60s cooldown is fine. Scaling in should be slow — pull tasks too eagerly and a brief dip removes capacity you need 90 seconds later, so the service adds it back, then removes it again. That cycle is thrashing, and a ~300s scale-in cooldown is what stops it.

On the target value:50% is a sane default, not 80%. A high target means tasks only get added once the service is already near saturation — and new Fargate tasks take 30–90 seconds to start and warm up. By the time they're ready, the spike has already hurt latency. Lower target, more headroom, smoother scaling.

When target tracking is too slow: add step scaling

Target tracking reacts on ~3-minute datapoints, too slow for sudden spikes. Add a step scaling policy on a steeper alarm to jump capacity fast; keep target tracking for steady state — they coexist.

Target tracking is smooth but deliberate. For a service that goes from quiet to flooded in seconds — a flash sale, a batch kickoff, an SQS backlog — three-minute datapoints mean you're already dropping requests before it reacts. Step scaling fixes that: you define explicit thresholds (“CPU over 70% → add 4 tasks; over 90% → add 8”) and it jumps capacity the moment the alarm breaks.

You don't have to choose. A service can run both: target tracking for the steady baseline and a step policy for the spike. When you have multiple policies, Service Auto Scaling prioritizes availability — it scales out if any policy says to, and scales in only if all of them agree. So the aggressive step policy can add capacity fast without the cautious target policy ever fighting it.

KEY INSIGHT: The tradeoff with step scaling: you own the alarms, which means you also own the churn. Target tracking has built-in protections against rapid up-down cycling; step scaling does not. Use step for the burst, keep target tracking carrying the steady state, and you get fast reaction without hand-managing thrash control.

What this looks like across a fleet

One service's autoscaling is a Terraform block. At 10+ services across environments, you maintain scalable targets, policies, and per-service tuning — a surface that grows with every environment.

Autoscaling on one service is easy. The problem is multiplication. Each service needs its own scalable target, its own metric choice, its own cooldowns, and its own min/max — and the right values differ by service and by environment. A dev environment shouldn't scale to 20 tasks; production shouldn't cap at 4. Keeping that tuned by hand across a fleet is the work nobody budgets for.

It compounds with the costs you're already carrying. Autoscaling controls compute, but every environment also pays the fixed overhead each environment already carries — ALB, NAT Gateway, CloudWatch — which autoscaling can't touch. Scaling well is only part of running a fleet economically.

Fortem doesn't replace autoscaling — your policies keep doing their job. It gives you one place to see and tune scaling, scheduling, and cost across every ECS environment, so per-service drift doesn't pile up as your fleet grows.

If you read this, you might also want to know

Should I scale on CPU or memory?

CPU is the safer default — most services are compute-bound, and memory often holds flat (so it never triggers scale-in). Use memory only when you know the service is memory-bound, and even then pair it with a CPU or request-count policy so the service can still scale down. For request-driven APIs, ALBRequestCountPerTarget beats both — it scales on real load, not a proxy.

Can ECS Fargate scale to zero?

Yes, with a target tracking policy and min_capacity = 0. When capacity is 0 and the metric shows demand, Service Auto Scaling waits for one datapoint, scales out by the minimum amount, then resumes normal scaling from the actual running count. It's useful for spiky non-prod or batch services — but cold-start latency on the first request after zero is the tradeoff.

Does autoscaling fight my manual desired-count changes?

Yes. As long as an active scaling policy and alarm exist on the service, Service Auto Scaling can override a desired count you set by hand. If you need to pin capacity temporarily — say, during an incident — suspend scaling on the scalable target rather than fighting it with manual updates.

Common questions

Why won't my ECS Fargate service scale in?

Three common reasons. (1) Scale-in is conservative by design — the managed alarm typically needs ~15 consecutive minutes below the threshold before removing tasks, versus ~3 minutes to scale out. (2) Application Auto Scaling turns off scale-in entirely while an ECS deployment is in progress. (3) Target tracking treats insufficient metric data as 'do not scale in', so a service with gaps in its metric never scales down. If you need faster or asymmetric scale-in, disable scale-in on the target tracking policy and add a custom step scaling policy.

What's the difference between target tracking and step scaling for ECS?

Target tracking keeps a metric at a target value (like CPU at 50%) and AWS creates and manages the CloudWatch alarms for you — it's the easiest mode and the right default for steady load. Step scaling defines explicit alarm thresholds and how many tasks to add or remove at each breach level, so it reacts faster to sudden spikes. They can coexist: target tracking for steady state, a step policy for bursts.

Can ECS Fargate autoscale on ALB request count?

Yes — ALBRequestCountPerTarget is one of the three predefined target tracking metrics, alongside ECSServiceAverageCPUUtilization and ECSServiceAverageMemoryUtilization. It's often the best signal for request-driven services because it scales on actual load, not on a proxy like CPU. One caveat: ALBRequestCountPerTarget is not supported for the blue/green deployment type.

What cooldown should I use for ECS autoscaling?

Sensible defaults: a short scale-out cooldown (~60 seconds) to stay responsive, and a longer scale-in cooldown (~300 seconds) to prevent thrashing — tasks being added and removed repeatedly. Pair that with a CPU target around 50%, not 80%: too high a target leaves no headroom for new tasks to warm up before the metric spikes again.

Does ECS autoscaling work during a deployment?

Partly. Application Auto Scaling turns off scale-in processes while an ECS deployment is in progress, but scale-out continues unless you suspend it. So a service can still add tasks under load mid-deploy, but won't remove them until the deployment finishes. This does not apply to services using an external deployment controller.

### Running autoscaling across 10+ ECS environments? Per-service scaling drift i

Worth reading

LandingECS Environment SchedulingThe other half of scaling: stop non-prod environments off-hours on a calendar. Every scheduling approach and what breaks at fleet scale.GuideECS Fargate Best Practices: Running a Fleet of 10+ EnvironmentsNaming, fixed overhead, retention, Spot, quota isolation — the checklist for teams past ten environments, with real numbers.

Map your fleet in 5 min: fortem.dev/audit

How Do You Set Up RBAC on ECS Fargate Without Breaking Prod?

Matt — Tue, 30 Jun 2026 11:16:55 +0000

ECS Fargate RBAC: Scope Developer Access Safely

Originally published at https://fortem.dev/blog/ecs-fargate-rbac
IAM has no concept of an ECS environment. Build per-environment RBAC with ABAC tags — the working policy, the four ways it silently breaks prod, and where AWS-native IAM hits its ceiling.

Use Case · June 30, 2026 · 10 min read

You're the single human gate for ECS ops: developers ship through CI, but they can't restart staging or read a log without pinging you. You want to hand them scoped access — their environments, never prod — and every attempt is an IAM policy that grants too much or too little. This is the working RBAC model: an ABAC policy that scopes developers by environment tag, the four ways AWS-native IAM silently breaks prod, and where the ceiling is.

TL;DR

ECS has no "environment" concept in IAM. You build per-environment RBAC with tags (ABAC): a developer's principal tag must match the resource's Environment tag.
The working policy gates ecs:UpdateService / StopTask / DeleteService on aws:ResourceTag/Environment, and CreateService / RunTask on aws:RequestTag/Environment.
The trap: some ECS List actions ignore tag conditions entirely. Granted open to "*" they leak prod metadata — and PassRole lets a developer escalate past their environment.
On EC2 launch type, ECScape shows a low-privilege task can steal another task's credentials (instance-level isolation). Fargate's micro-VM isolation closes this — a real reason to be on Fargate.
AWS-native ABAC holds to about ten environments, then tag discipline and policy sprawl become the bottleneck. That's where a per-environment RBAC layer earns its keep.

ECS has no idea what an "environment" is

IAM has no native concept of an ECS environment. You simulate it: tag every resource with an Environment key, tag each developer's principal, and write policies that only allow an action when the two tags match.

IAM thinks in ARNs and tags, not environments. There is no ecs:Environment you can grant a developer "staging" access to. So you build the abstraction yourself with attribute-based access control (ABAC): give every cluster, service, and task an Environmenttag, give each developer's IAM role a matching principal tag, and condition every policy on the two being equal.

The alternative — listing cluster ARNs in the Resource block — works for three clusters and collapses at thirty. The simplest version of this, letting a developer restart staging with a policy matched to a *-stg-* cluster pattern, gets you started. ABAC is where it goes when you have real environments and real developers — and where it stays safe past the first handful.

Ready to use: the ABAC policy that scopes by environment

Gate ecs:UpdateService, StopTask, and DeleteService on aws:ResourceTag/Environment matching the developer's principal tag; gate RunTask and CreateService on aws:RequestTag/Environment. Add an explicit Deny on prod.

Tag the developer's IAM role with Environment=staging, tag every staging resource the same, and this policy lets them operate staging and nothing else. The ${aws:PrincipalTag/Environment} variable means one policy serves every environment — the match is dynamic.

Ready to use — copy this today

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "OperateOwnedEnvironment",
      "Effect": "Allow",
      "Action": [
        "ecs:UpdateService",
        "ecs:StopTask",
        "ecs:DeleteService"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:ResourceTag/Environment": "${aws:PrincipalTag/Environment}"
        }
      }
    },
    {
      "Sid": "CreateTaggedToOwnedEnvironment",
      "Effect": "Allow",
      "Action": [
        "ecs:CreateService",
        "ecs:RunTask"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestTag/Environment": "${aws:PrincipalTag/Environment}"
        }
      }
    },
    {
      "Sid": "PassOnlyOwnedEnvironmentRoles",
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::*:role/ecs/${aws:PrincipalTag/Environment}-*"
    },
    {
      "Sid": "NeverProd",
      "Effect": "Deny",
      "Action": "ecs:*",
      "Resource": "*",
      "Condition": {
        "StringEquals": { "aws:ResourceTag/Environment": "prod" }
      }
    }
  ]
}

Why the explicit Deny

The NeverProdstatement is a backstop, not the primary control. The Allow statements already scope to the developer's own environment — but an explicit Deny on the prod tag wins over any Allow, anywhere, including a future policy someone attaches by mistake. Defense in depth costs four lines.

Which ECS actions actually respect your tags

Not every ECS action honors tag conditions. StopTask, DeleteService, and UpdateService respect aws:ResourceTag; CreateService and RunTask respect both. But List actions like ListClusters and ListServices ignore tags entirely — scope those another way or they leak prod.

This is the table that decides whether your policy is airtight or quietly open. If you gate an action on aws:ResourceTagthat the action doesn't populate, the StringEqualscondition evaluates to false and the statement simply doesn't grant — it fails closed. The leak is the other direction: to make those tag-blind List and Describe calls usable at all, you end up granting them in a separate statement with no condition — and that open grant is what reaches every environment:

ActionResourceRequestScope it by

ecs:StopTaskyes—scope by resource tag

ecs:UpdateServiceyes—scope by resource tag

ecs:DeleteServiceyes—scope by resource tag

ecs:CreateServiceyesyestag at create time

ecs:RunTaskyesyestag at create time

ecs:DescribeServicesyes—scope by resource tag

ecs:ListClustersno—no tag conditions — leaks prod metadata

ecs:ListServicesno—no tag conditions — scope another way

For the actions that ignore tags — the account-level List calls — your only levers are separate AWS accounts or not granting them. A read-only ecs:ListClusters open to *won't let a developer change prod, but it shows them every environment's name and shape — metadata a strict RBAC model is supposed to withhold.

The four ways this breaks prod

RBAC breaks prod four ways: actions that ignore tags, PassRole escalation, untagged resources defaulting open, and — on EC2 — ECScape credential theft. Each is a specific misconfiguration, not bad luck.

1*Actions that ignore tags.* List and Describe calls that don't carry an Environment tag can't be scoped by one — your tag condition fails closed, so to make them usable you grant them with no condition, and that open statement reaches every environment's metadata. Audit which actions you've granted unconditioned against the support table above.
2*PassRole escalation.* A developer who can RegisterTaskDefinition and iam:PassRole a powerful role can launch a task that runs as that role — and reach anything that role can. Scope iam:PassRole to only the task and execution roles for their environment, never a wildcard. This is the non-obvious one, and the most dangerous.
3*Untagged resources slip the Deny.* A condition on aws:ResourceTag/Environment only matches resources that HAVE the tag — so an untagged cluster matches neither your scoped Allow nor your NeverProd Deny. It won't be reachable through the ABAC policy, but if any broader Allow exists, an untagged prod resource is no longer blocked by the Deny that was supposed to catch it. Enforce tagging at creation with a RequestTag condition or an SCP.
4*ECScape on EC2 launch type.* On EC2, a low-privilege task can steal the IAM credentials of a more privileged task on the same instance. Your task-level RBAC is real on paper and bypassed in practice. Fargate closes this — see the next section.

None of these throw an error when you deploy the policy. They're visible only when you test — or when the audit happens. The audit trail that proves who did what in CloudTrail is how you catch the escalation after the fact; the policy is how you prevent it.

Why Fargate makes this safer than EC2

On EC2 launch type, ECScape showed a low-privilege task can steal another task's IAM credentials — isolation is instance-level, not task-level. Fargate runs each task in its own micro-VM, so task-level RBAC actually holds.

The ECScape research is worth understanding because it undercuts an assumption most RBAC models rest on: that a task's IAM role is isolated to that task. On EC2, it isn't. Multiple tasks share one host, ECS delivers their credentials over a channel on that host, and a compromised low-privilege container can impersonate the agent and intercept the credentials destined for every other task on the instance. Your carefully scoped per-task roles become a shared pool.

On Fargate, each task gets its own micro-VM with isolated credentials and its own IMDS — there is no co-tenant to steal from. The mitigations on EC2 (block container access to IMDS at169.254.169.254, run privileged tasks on separate instances) are things you don't have to think about on Fargate. If your RBAC model assumes task-level isolation, Fargate is the launch type that makes the assumption true.

KEY INSIGHT: RBAC is only as strong as the isolation underneath it. A perfect ABAC policy on EC2 launch type can still be bypassed at the credential layer; the same policy on Fargate holds because the micro-VM boundary is real. The access model and the isolation model have to agree.

Where AWS-native RBAC hits its ceiling

ABAC works until tag discipline fails. Past about ten environments, every new env needs the tag applied everywhere, every policy re-audited, every untagged resource hunted down. IAM has no per-environment role concept to lean on.

The ABAC model is correct and it scales — until the thing it depends on, perfect tagging, stops being free. At ten environments, one missing Environment tag is a hole. At thirty, finding the missing tag is its own job. You end up writing SCPs to enforce tagging, AWS Config rules to flag untagged resources, and a runbook for onboarding each new environment into the policy set — maintaining the simulation of something IAM was never designed to model.

That's the point where a per-environment RBAC layer earns its keep: instead of hand-maintaining ABAC tags and policies, you grant a developer a role on the environments they own — restart, redeploy, read logs, run a one-off task — and prod is off-limits by construction, not by a condition key you hope is correct. The AWS-native approach is the right place to start; it's not the right place to be at fleet scale.

If you read this, you might also want to know

Can I scope ECS access by cluster instead of tags?

Yes, with a Resource block listing cluster ARNs or a StringLike condition on the cluster name (e.g. *-stg-*). It's simpler to reason about for a few clusters, but you edit every policy each time you add an environment. ABAC tags avoid that — at the cost of needing every resource tagged.

Does ECS Exec respect the same RBAC?

ECS Exec (the shell-into-a-container feature) is gated by ecs:ExecuteCommand, which supports the same aws:ResourceTag/Environment condition — so you can scope exec to a developer's own environment the same way. Many teams forget to scope it and leave a path into prod containers wide open.

How do I stop developers from escalating via PassRole?

Scope iam:PassRole to a path or ARN pattern that only covers their environment's roles (e.g. role/ecs/staging-*), never a wildcard. Without it, a developer who can register a task definition can pass any role they're allowed to pass and run a task as it — escalating straight past the environment boundary.

Do I need separate AWS accounts for hard RBAC boundaries?

For the strongest boundary, yes — an account boundary is the one thing a tag typo can't cross, and it's the only way to fully contain the tag-ignoring List/Describe actions. Most teams keep prod in its own account and share a non-prod account, using ABAC tags for per-environment scoping within each.

Book a 20-min fleet walkthrough: fortem.dev/book

How Do You Prepare ECS Fargate for a SOC 2 Audit?

Matt — Tue, 30 Jun 2026 11:16:14 +0000

How to Prepare ECS Fargate for SOC 2 Compliance

Originally published at https://fortem.dev/blog/ecs-compliance-soc2
AWS being SOC 2 certified doesn't make you compliant. The exact ECS Fargate task-definition settings an auditor flags — ECS.4, ECS.5, ECS.8, ECS.20 — and the copy-paste fixes.

Use Case · June 29, 2026 · 10 min read

A prospect won't sign without your SOC 2 Type II report. You bought Vanta, the dashboard lit up red — and half the failing controls point at your ECS task definitions. AWS being SOC 2 certified does not make you compliant — it covers the cloud; you own what runs in it. This is the ECS-specific remediation: the exact Fargate settings an auditor's tooling flags, the copy-paste fixes, and why the real work of Type II is six months of evidence, not a one-day config sprint.

TL;DR

AWS's SOC 2 covers the cloud (Artifact report, under NDA). Your ECS task definitions, IAM, and logging are yours to prove — that's the shared responsibility line.
Security Hub flags specific ECS controls: ECS.4 (non-privileged), ECS.5 (read-only root FS), ECS.8 (no secrets in env vars), ECS.9 (logging), ECS.20 (non-root user), ECS.2 (no public IP).
Most ECS findings map to two Trust Services Criteria: CC6 (logical access) and CC7 (monitoring). Fix the task-def parameters and they clear.
Type II is a ~6-month observation window. The hard part isn't the one-time fix — it's continuous evidence the controls held across every environment, every day.
SOC 2 does NOT require EKS-style admission controllers or a service mesh for ECS. Don't over-build controls the auditor never asked for.

AWS is SOC 2 certified — so why aren't you?

AWS's SOC 2 covers security of the cloud — datacenters, hypervisor, hardware. You own security in the cloud: your ECS task definitions, IAM, and logging. The auditor only tests your half.

The single most common SOC 2 misconception on AWS is that running on a SOC 2-certified platform makes you SOC 2-compliant. It doesn't. The shared responsibility model splits the control set: AWS proves the infrastructure is secure, and you prove that what you run on it is secure. You can download AWS's own SOC 2 report from AWS Artifact — the self-service portal in the console, gated behind an NDA — and hand it to your auditor. That lets them _inherit_AWS's infrastructure controls and stop looking at the datacenter.

What's left is your half — and for an ECS shop, your half is mostly task definitions, IAM scoping, and logging. The auditor will also want the audit trail of who changed what across your fleet — CloudTrail is the evidence source for that control. The rest of this guide is the ECS-specific part nobody else writes down.

AWS proves (inherited)You prove (tested)

Datacenter physical securityTask-def hardening (ECS.4/.5/.20)

Hypervisor & host patchingIAM scoping & least privilege

Fargate platform isolationSecrets handling (SSM / Secrets Manager)

Network backboneLogging & monitoring (ECS.9/.12)

Ready to use: a SOC 2-clean task definition

This Fargate task definition clears the Security Hub ECS controls that flag a task — the high-severity ones (ECS.4, ECS.5, ECS.8, ECS.9, ECS.2) plus the medium-severity non-root and Container Insights checks. Each hardened line is annotated with the control ID it satisfies.

The whole point: every flagged setting is a task-definition parameter, not an application rewrite. Drop this into your shared ECS module and every environment inherits the same clean baseline.

Ready to use — copy this today

resource "aws_ecs_task_definition" "app" {
  family                   = "use1-prod-main-app"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = "512"
  memory                   = "1024"
  execution_role_arn       = aws_iam_role.task_exec.arn
  task_role_arn            = aws_iam_role.task.arn

  container_definitions = jsonencode([{
    name  = "app"
    image = "${aws_ecr_repository.app.repository_url}:${var.image_tag}"

    user                   = "10001"   # ECS.20 — non-root Linux user
    privileged             = false     # ECS.4  — no elevated privileges
    readonlyRootFilesystem = true      # ECS.5  — read-only root filesystem

    # ECS.8 — secrets via valueFrom, never plain environment vars
    secrets = [{
      name      = "DB_PASSWORD"
      valueFrom = aws_secretsmanager_secret.db.arn
    }]

    # ECS.9 — task definition must declare a log configuration
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = "/ecs/use1-prod-main-app"
        "awslogs-region"        = "us-east-1"
        "awslogs-stream-prefix" = "app"
      }
    }
  }])
}

# ECS.2 — never auto-assign a public IP (ECS.16 is the task-set equivalent)
resource "aws_ecs_service" "app" {
  name            = "use1-prod-main-app"
  task_definition = aws_ecs_task_definition.app.arn
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnet_ids
    assign_public_ip = false
  }
}

If your container needs scratch space

readonlyRootFilesystem = true breaks apps that write to /tmp. Don't revert the control — mount a writable tmpfs volume for the paths that need it. The root filesystem stays read-only; the auditor still passes ECS.5; your app still writes its temp files.

The ECS controls an auditor's tooling flags

Security Hub checks specific ECS controls: ECS.4 non-privileged, ECS.5 read-only root filesystem, ECS.8 no credential keys in env vars, ECS.9 logging, ECS.20 non-root user, ECS.2 / ECS.16 no public IP.

These are the AWS-published Security Hub CSPM checks for ECS. An auditor — or your own CSPM dashboard, or Vanta/Drata pulling from Security Hub — surfaces failures by control ID. Each row below is a task-definition parameter and the value that fails it:

ControlFails whenThe fixTSC

ECS.4privileged = trueprivileged: falseCC6

ECS.5readonlyRootFilesystem false / absentreadonlyRootFilesystem: trueCC6

ECS.20Linux user is root or unsetuser: a non-root UIDCC6

ECS.2AssignPublicIp ENABLEDassign_public_ip: DISABLEDCC6

ECS.8AWS credential key in environment varssecrets via valueFrom (SSM / Secrets Manager)CC6

ECS.9no logConfigurationlogConfiguration with awslogs / FireLensCC7

ECS.12Container Insights off on clusterenable Container InsightsCC7

The TSC column is the Trust Services Criterion each control maps to — explained below. AWS tags these controls against NIST 800-53 and PCI; the SOC 2 mapping is the one your auditor draws, and it's worth handing them pre-drawn.

Fix it in the task definition

Every high-severity ECS finding is one task-definition parameter: readonlyRootFilesystem, user, privileged, secrets via valueFrom, logConfiguration, assignPublicIp. Set them once in your module.

None of these are application changes. They're container-definition fields — the same ones covered in every task-definition field and the common mistakes. The remediation that trips teams up isn't writing the values; it's applying them to every task definition across every environment without missing one.

KEY INSIGHT: Set the hardened parameters as module-level defaults, not per-service overrides. A shared ECS module that bakes in readonlyRootFilesystem = true, a non-root user, and assign_public_ip = falsemeans every new environment is compliant the day it's created — and you can't forget the setting on environment number twelve.

Map ECS controls to Trust Services Criteria

Most ECS findings map to two criteria: CC6 logical access (non-root, no privilege, no public IP) and CC7 monitoring (logging, Container Insights, GuardDuty). Your auditor wants the mapping spelled out.

SOC 2's Common Criteria are where ECS lives. CC6 (Logical and Physical Access Controls — the logical half is yours) is the bucket for least privilege: ECS.4 (no privileged containers), ECS.5 (read-only root filesystem), ECS.20 (non-root user), and ECS.2 / ECS.16 (no public IP) all reduce the access an attacker gets if a container is compromised. CC7 (System Operations) is monitoring: ECS.9 (logging configured), ECS.12 (Container Insights), and GuardDuty Runtime Monitoring give you the detection the criterion asks for. CC6 also covers human access — auditors expect per-environment RBAC scoping developer access so no one but the right people can touch a given environment.

Handing your auditor the control-to-criterion mapping pre-drawn — "ECS.5 satisfies our CC6.1 least-privilege control" — turns a back-and-forth into a checkbox. They do this mapping anyway; doing it for them shortens the audit.

The real work is six months of evidence

Type II isn't a one-day fix — it's a ~6-month observation window. The auditor wants proof the controls held every day, across every environment. One drifted task def in month four is a finding.

A first SOC 2 Type II runs roughly months 1–3 to implement controls, then a six-month window where the auditor verifies they operated continuously. Type I is a point-in-time snapshot; Type II is the movie. That distinction is the whole cost: the hardened task definition takes an afternoon, but proving it stayed hardened across eleven environments for six months is the part that consumes the platform team.

This is where multi-environment ECS shops feel it. Vanta and Drata automate the evidence collection from Security Hub and Config — but they collect whatever state exists. If a developer spins up environment twelve from an older module, or someone toggles a setting at 2am during an incident, the drift is real and the evidence captures it. The control that matters most at fleet scale isn't any single ECS.x check — it's knowing the whole fleet's state, continuously, so drift surfaces the day it happens, not in the auditor's sample three months later.

What SOC 2 does NOT require for ECS

SOC 2 is risk-based, not prescriptive. It does not require EKS-style admission controllers, a service mesh, or a specific scanner for ECS. Don't over-build controls the auditor never asked for.

Unlike a prescriptive standard, SOC 2 lets you define your own controls against the Trust Services Criteria, then proves you operate them. There is no SOC 2 line item that says "run OPA Gatekeeper" or "deploy a service mesh." Teams coming from Kubernetes sometimes import a control set they don't need — admission webhooks, Pod Security Standards, a sidecar mesh — none of which an ECS auditor asks for.

The trap is scope creep: every extra control you claim in your system description is a control you now have to produce six months of evidence for. Match your stated controls to your actual architecture. For ECS Fargate, the high-severity Security Hub controls plus scoped IAM, logging, and an audit trail cover the Common Criteria an auditor tests. Build those well; skip the rest.

If you read this, you might also want to know

Do I need a separate AWS account for SOC 2?

Not strictly. SOC 2 doesn't mandate account structure — it cares that prod is isolated and access is scoped. A separate prod account is the cleanest way to draw that boundary and makes the auditor's scoping trivial, but a single account with hard IAM separation and tagging can pass. Most teams split prod out anyway, for blast radius as much as for the audit.

Does Fargate make SOC 2 easier than EC2?

Yes, on the infrastructure half. Fargate removes the host from your responsibility — no EC2 patching, no host hardening, no SSH access control to evidence. AWS builds the patched Fargate platform versions; you stay current by running platform version LATEST and redeploying — ECS.10 fails if a service is pinned to an older version. You inherit more of the control set, leaving you the task-definition and IAM half. On EC2 launch type, host-level controls land back on you.

How long does a first SOC 2 Type II take?

Plan on 9-12 months end to end: roughly 1-3 months to implement and document controls, then a 6-month observation window the auditor reviews, then the report. A Type I (point-in-time) can be done in weeks and is sometimes used as a stepping stone to show progress while the Type II window runs.

Can Vanta or Drata collect ECS evidence automatically?

Partly. They integrate with AWS Security Hub and Config to pull control state — including the ECS.x findings — and map them to SOC 2 criteria automatically. What they can't do is fix drift or give you a single operational view of every environment's live state. They report what exists; keeping the fleet itself consistent is still on you.

Book a 20-min fleet walkthrough: fortem.dev/book

How Do You Manage ECS Fargate Across Multiple AWS Accounts?

Matt — Tue, 30 Jun 2026 11:16:09 +0000

Managing ECS Fargate Across Multiple AWS Accounts

Originally published at https://fortem.dev/blog/ecs-multi-account-management
How to operate an ECS Fargate fleet across multiple AWS accounts: cross-account IAM, central ECR, Transit Gateway cost, and the single-pane-of-glass AWS doesn't ship.

You already split prod, non-prod, and maybe per-region or per-customer into separate AWS accounts — for isolation, blast radius, and clean billing. Now you're paying the operating tax: logging in and out all day, re-deploying the same image to five accounts, and no single screen that shows what's running where. This is the operating model for an ECS Fargate fleet that already lives across accounts — plus the costs that never make the spreadsheet, and the case where splitting accounts isn't worth it.

TL;DR

A multi-account AWS Org gives you isolation; it takes away a single pane of glass. AWS has no "Lens for ECS" — you stitch one together from IAM, the CLI, and pipelines.
Five operating surfaces to solve: cross-account visibility, IAM access, image distribution (ECR), networking, and deploy.
Going multi-account has a real bill: Transit Gateway alone is ~$36/mo per VPC attachment plus $0.02/GB, before any data transfer.
The aws:PrincipalOrgID ECR shortcut opens your image repo to every account in the Org — convenient, and a security review waiting to happen.
Below ~10 environments, don't split accounts to feel enterprise. The operating tax outweighs the isolation upside until you're at fleet scale.

Why managing ECS across accounts is hard

A multi-account AWS Org buys you isolation and clean billing — but AWS ships no single control plane for ECS. There's no "Lens for ECS", so you operate each account by hand.

This is not a hypothetical complaint. On AWS re:Post, a practitioner running "lots of AWS Accounts" asked the obvious question: is there a way to manage ECS clusters across them "without logging back and forth between the accounts" — ideally a tool like Lens, the Kubernetes IDE. The answer, from an AWS-side responder, was blunt:

"First of all, I don't think there is a tool quite like Lens, for ECS. That is, centrally manage and store credentials of various clusters and provide single plane of glass for cluster management."

— AWS re:Post, "How to manage ECS Clusters across accounts?"

The recommended path is exactly the manual stitching you'd expect: named CLI profiles per account, cross-account IAM roles, and a centralized CI/CD pipeline that assumes into each target account. None of that is wrong. But you — not AWS — are the integration layer, which is the heart of the operations gap that opens up at fleet scale. A "what's running in euw1-stag-main?" question becomes a context switch, an assume-role, and a fresh console tab.

This guide assumes you've already decided to go multi-account. If you're still working out how to structure those accounts in the first place — one account per environment group versus a single shared account — that decision comes first, and it's a different article. The short version: prod gets its own account for blast radius; non-prod usually shares one until Fargate quota or audit scope forces a split. Once that's settled, the question becomes operational: how do you run the fleet day to day?

Ready to use: a cross-account fleet-viewer role

Deploy one read-only IAM role to every member account, assumable from a single ops account. Now one set of credentials can list and describe ECS across the whole Org — no console hopping.

Drop this Terraform into each member account (via your account-baseline module or StackSets). The role trusts only your ops account and grants ECS read plus the cost and tag reads you need for a fleet view. From the ops account you then aws sts assume-role into each account and run aws ecs list-clusters against all of them with one identity.

Ready to use — copy this today

# Deploy in EVERY member account. ops_account_id is your single
# operations/admin account that does the cross-account reading.
variable "ops_account_id" { type = string }

data "aws_iam_policy_document" "trust" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "AWS"
      identifiers = ["arn:aws:iam::${var.ops_account_id}:root"]
    }
    # Require an external ID to block the confused-deputy problem.
    condition {
      test     = "StringEquals"
      variable = "sts:ExternalId"
      values   = ["fortem-fleet-viewer"]
    }
  }
}

data "aws_iam_policy_document" "read" {
  statement {
    sid    = "EcsFleetRead"
    effect = "Allow"
    actions = [
      "ecs:List*",
      "ecs:Describe*",
      "ce:GetCostAndUsage",        # per-account cost for the fleet view
      "tag:GetResources",          # owner / env tags
      "cloudwatch:GetMetricData",  # task counts, CPU/mem
    ]
    resources = ["*"]
  }
}

resource "aws_iam_role" "fleet_viewer" {
  name               = "fortem-fleet-viewer"
  assume_role_policy = data.aws_iam_policy_document.trust.json
}

resource "aws_iam_role_policy" "fleet_viewer" {
  name   = "fleet-read"
  role   = aws_iam_role.fleet_viewer.id
  policy = data.aws_iam_policy_document.read.json
}

Why read-only, why an external ID

The fleet view is for answering "what's running where, who owns it, what does it cost" — it never needs write access, so it doesn't get any. The sts:ExternalId condition stops a third party who happens to learn your ops account ID from assuming the role on your behalf. This is the same role shape a control plane like Fortem uses — least privilege, customer-revocable in two clicks.

The five surfaces you have to operate

Operating ECS across accounts means solving five surfaces: fleet visibility, cross-account IAM access, image distribution, networking, and deploy. AWS gives you a primitive for each, not a system.

Each surface has a native AWS answer. The work — and the cost — is that none of them connect to each other, and you own the glue. Here's the map the rest of this guide follows:

SurfaceAWS primitiveThe gap it leaves

Fleet visibilityConsole, one account at a timeNo fleet-wide view of cost, owner, or state

IAM accessCross-account assume-roleYou build and maintain the role mesh yourself

Image distributionECR replication / repo policyPrincipalOrgID shortcut over-shares the repo

NetworkingTransit Gateway / VPC peering~$36/mo per attachment, before data

DeployCross-account pipeline roleOne image, N accounts — task defs drift

Cross-account visibility and IAM access

Cross-account IAM has two halves: a read role for the fleet view, and a per-resource assume-role for tasks that reach into another account. The fleet-viewer role covers the first; an STS trust chain covers the second.

The visibility half is the fleet-viewer role above. The access half is different: it's when a running task in one account needs a resource that lives in another — an S3 bucket, a DynamoDB table, a secret. The pattern is a role in the resource's account whose trust policy names the task role's account, which the task assumes via STS at runtime.

The footgun here is registration, not runtime. ECS validates role ARNs when you register a task definition, and a malformed trust policy throws Role is not valid — a recurring re:Post complaint. The task role must trust ecs-tasks.amazonaws.com, the ARN must be in the right account, and whoever registers the task def needs iam:PassRole for it — scope that PassRole grant per environment or a developer can escalate past their boundary.

KEY INSIGHT: At five accounts, every "what's running where" question is five logins or five assume-role calls. The IAM mesh you build for visibility is the same mesh you build for access — design it once, deploy it to every account through your baseline module, and never wire it per-account by hand.

Distributing images: central ECR done right

Pick one: ECR cross-account replication (push once, pull locally) or a shared repo via repository policy. The aws:PrincipalOrgID shortcut works but exposes the repo to every account in the Org.

Most teams build once in a shared services or CI account, then need that image in every workload account — the cross-account slice of how ECR works for ECS Fargate teams. Two clean answers. Replication: ECR copies the image to a local repo in each destination account, so pulls are in-region, in-account, and cheap — best when you have many accounts. Shared repository: one repo with a repository policy granting pull to specific account IDs — simplest when you have a handful.

The tempting shortcut is to skip the account list and grant pull to your whole Organization with a single condition. It works. It's also exactly what a re:Post answer flagged:

"Beware: this allows all accounts in an Organization to access the ECR repository! Double check with your security team if this is allowed!"

— AWS re:Post, "Central ECR for ECS in multiple accounts"

For a regulated buyer — the kind tracking the fixed overhead every environment carriesdown to the dollar — "every account in the Org can pull this image" is the kind of finding an auditor circles in red. Name the accounts explicitly, or use replication.

Networking across accounts — and what it costs

AWS recommends Transit Gateway first for cross-account ECS traffic. It's the clean answer and a real line item: ~$36/mo per VPC attachment plus $0.02/GB — multiply by account count.

AWS's own guidance on networking ECS services across accounts says to consider Transit Gateway first, then VPC peering, shared VPC, or PrivateLink. The quick decision:

Transit Gateway — hub-and-spoke for many accounts. Clean routing, but each VPC attachment bills hourly.
VPC peering — one or two account pairs. No hourly fee, but a full mesh of N accounts is N²/2 peerings to manage.
Shared VPC — one VPC, subnets shared into member accounts. Cheapest, but it softens the account boundary you split for.
PrivateLink — expose one service, not full reachability. Per-service, not per-network.

The number nobody puts in the spreadsheet: Transit Gateway charges $0.05 per VPC attachment-hour — about $36.50/month per attachment ($0.05 × 730 hrs) — plus $0.02/GB processed. In a hub-and-spoke, you pay an attachment for the hub VPC and one per spoke:

TopologyVPC attachmentsMonthly (attachments only)

Hub + 1 spoke2 attachments~$73/mo

Hub + 3 spokes4 attachments~$146/mo

Hub + 5 spokes6 attachments~$219/mo

That's before a single gigabyte crosses the gateway. Cross-account data processing at $0.02/GB adds up fast for chatty services. It's not a reason to avoid multi-account — it's a reason to know the bill before the architecture review, not after the invoice.

Deploying the same service to N accounts

One image, N accounts means one pipeline assuming a deploy role per target account. Parameterize the account ID and role ARN; never copy a pipeline per account — that's how environments drift.

The centralized model is the one AWS recommends and the one that scales: the pipeline lives in a single CI/CD account, and each stage assumes a deploy role in the target account to register the task definition and update the service. The image is built once and pulled from your central or replicated ECR. The task-definition template is one file; the account ID, role ARN, and environment-specific variables are inputs.

The anti-pattern is a copy of the pipeline per account. It feels faster on day one and it guarantees drift by month three — one account gets a CPU bump, another a new env var, a third a hotfix that never makes it back to the template. Multi-account already costs you a single pane of glass; don't also give up a single source of truth for what you deploy.

The drift test

Pick any two accounts running "the same" service and diff their live task definitions: aws ecs describe-task-definition in each, assumed via the fleet-viewer role. If they differ in anything but account-specific values, your per-account pipelines have already drifted — and nothing was watching.

When multi-account isn't worth it

Below ~10 environments, the operating tax — Transit Gateway cost, IAM sprawl, per-account deploys — usually outweighs the isolation upside. Split prod out for blast radius; keep non-prod together until it hurts.

Multi-account is the right call at scale and a self-inflicted wound at three environments. If you run a prod, a staging, and a dev, splitting them into five accounts to look enterprise buys you an IAM mesh, a Transit Gateway bill, and five places to deploy — to protect blast radius you could get from one account with disciplined tagging and SCPs.

The honest threshold: split prod into its own account early — that boundary is worth it for almost everyone. Hold non-prod in one shared account until something concrete forces a split: a regulated customer who needs hard isolation, a Fargate vCPU quota that dev load tests keep exhausting, or an audit scope you need to draw a clean line around. Add accounts because a requirement demands one, not because the org chart has room.

If you read this, you might also want to know

Can one ECS cluster span multiple AWS accounts?

No. An ECS cluster lives in exactly one account and one region. "Managing across accounts" always means N clusters in N accounts, coordinated from the outside — via cross-account IAM, a central pipeline, and a fleet view you assemble. There is no cluster object that straddles an account boundary.

How do I see total ECS cost across all my accounts in one place?

Enable Cost Explorer at the AWS Organization level from the management account, and tag every environment consistently so you can group by it. That gives you Org-wide cost but not per-environment ECS state. To tie cost to a running service and its owner across accounts, you need a tool that joins billing data with live ECS describe calls — which is what the fleet-viewer role's ce:GetCostAndUsage permission is for.

Should each customer get its own AWS account for ECS?

Only if isolation is a contractual or regulatory requirement. Per-customer accounts give the hardest blast-radius and data boundary, but they multiply every operating surface in this article by your customer count. Most B2B SaaS use shared accounts with per-tenant isolation inside the application until a specific customer's compliance posture forces a dedicated account.

Does AWS Organizations give me a single ECS dashboard?

No. Organizations handles account creation, consolidated billing, and service control policies — not a cross-account ECS view. AWS has no native single-pane-of-glass for ECS the way Lens works for Kubernetes; the cross-account visibility layer is something you build or buy.

Book a 20-min fleet walkthrough: fortem.dev/book

AWS Fargate vs Lambda: When Does Lambda Stop Being Cheaper?

Matt — Tue, 30 Jun 2026 11:15:28 +0000

AWS Fargate vs Lambda: When Lambda Stops Being Cheaper

Originally published at https://fortem.dev/blog/fargate-vs-lambda
AWS Fargate vs Lambda: the cost line is set by execution duration, not traffic. Breakeven math, hidden Lambda costs, and what the June 2026 MicroVMs launch changes.

Versus

Lambda is not categorically cheaper than Fargate, and Fargate is not categorically cheaper than Lambda. There is a crossover point, and it is set mostly by how long each invocation runs — not by how much traffic you get. Most comparisons stop at a feature table. This one gives you the breakeven in dollars, the hidden costs that move it, and what the new Lambda MicroVMs (launched June 22, 2026) change — and what they don't.

TL;DR

Lambda wins on short, spiky, event-driven work. Fargate wins on long-running, steady services. The line is duration × frequency, not raw traffic.
Real breakeven: a 200ms API crosses ~6–8M invocations/mo; a 2s background job crosses ~1M/mo; at 5s+ duration Lambda almost never wins at scale.
Per-request charges are only ~20–40% of a serverless bill. API Gateway ($3.50/M), CloudWatch Logs ($50–150/mo), NAT, and provisioned concurrency are the rest.
June 22, 2026: Lambda MicroVMs lifted the runtime limit from 15 min to 8 hours (16 vCPU / 32 GB, Firecracker) — but they target isolated sandboxes for AI and untrusted code, not always-on web services.

Quick answer — Lambda is cheaper for short, spiky workloads; Fargate is cheaper for long-running, steady ones. The crossover is set by execution duration: a 200ms API endpoint stays cheaper on Lambda up to roughly 6–8M invocations/month (including API Gateway and CloudWatch), while a 2s background job crosses to Fargate at about 1M/month. At 5s+ average duration, Lambda almost never wins at scale. Fargate compute runs $0.04048/vCPU-hr + $0.004445/GB-hr; Fargate Spot is ~68% cheaper. A monthly Lambda bill above ~$1,000 is a strong signal that at least one workload belongs on Fargate.

Ready to use — copy this today

The two cost formulas, side by side. Swap your own numbers in and you get the monthly figure for each service — then compare. Lambda bills GB-seconds plus a per-request fee; Fargate bills allocated vCPU and memory for the hours the task runs.

# ---- Lambda monthly cost ----
#   memory_gb     = function memory / 1024     e.g. 0.5
#   duration_s    = avg execution seconds      e.g. 0.2
#   invocations   = requests per month         e.g. 5_000_000
#
gb_seconds   = memory_gb * duration_s * invocations
compute      = gb_seconds * 0.00001667         # $/GB-second
requests     = invocations * 0.00000020        # $0.20 per 1M requests
api_gateway  = invocations * 0.0000035         # $3.50 per 1M (if used)
lambda_total = compute + requests + api_gateway

# ---- Fargate monthly cost (one always-on task) ----
#   vcpu = 0.5   mem_gb = 1   hours = 730 (24/7) or ~217 (business hrs)
fargate_total = (vcpu * 0.04048 + mem_gb * 0.004445) * hours
# Fargate Spot: multiply the compute rates by ~0.319 (≈68% off)

# ---- Worked example: 0.5 GB, 0.2s, 5M invocations ----
# Lambda  : 500,000 GB-s -> $8.34 compute + $1.00 requests + $17.50 API GW = ~$26.84/mo
# Fargate : (0.5*0.04048 + 1*0.004445) * 730                              = ~$18.02/mo  (24/7)
# At 5M invocations of a 200ms API, one always-on Fargate task already wins.

Rates: Lambda $0.00001667/GB-s + $0.20/1M requests; API Gateway $3.50/1M; Fargate $0.04048/vCPU-hr + $0.004445/GB-hr (Linux/x86, us-east-1, verified June 2026). Real systems need more than one Fargate task for redundancy — adjust hours and task count for your setup.

Where Lambda wins, where Fargate wins

Lambda bills per millisecond of execution and fits spiky, event-driven work; Fargate bills for allocated vCPU and memory and wins on long-running, steady services. The split is duration, not app type.

The framing “serverless is cheaper” hides the mechanism underneath. Lambda charges for the time your code runs, rounded to the millisecond, times the memory you assigned. When code runs rarely and briefly, you pay almost nothing between invocations. When it runs constantly, you pay for each of those milliseconds — and there are 2.6 billion of them in a month.

Fargate is the inverse. You pay for a task's vCPU and memory for as long as it exists, whether it serves one request or ten thousand per second. Idle time is wasted money; saturated time is a bargain.

Reach for Lambda

S3-triggered file processing
Webhook and HTTP handlers with bursty traffic
Scheduled (cron) jobs that run briefly
Queue and stream consumers with variable load
Anything that needs to scale from zero instantly

Reach for Fargate

Long-lived microservices and APIs
Services holding connections (WebSocket, gRPC)
Batch and ETL jobs past the 15-minute mark
Steady traffic where utilization stays high
Workloads needing precise CPU/memory control

KEY INSIGHT: A “serverless” service that runs 24/7 under steady load is paying Lambda's premium for elasticity it never uses. Elasticity is only free when your traffic is spiky. If your invocation graph is a flat line, you are buying the wrong abstraction.

The real cost breakeven

Breakeven is set by function duration: a 200ms API crosses around 6–8M invocations a month, a 2s background job around 1M. The longer the function runs, the sooner Fargate wins.

Invocation count is the number most teams reach for, but it is the wrong axis on its own. Duration multiplies it. A 200ms function and a 2s function at the same invocation count have a 10× difference in GB-seconds — and GB-seconds are what Lambda bills. That is why the breakeven for a longer function lands at a fraction of the invocations.

Workload	Lambda config	Compute only	+ API GW & logs
API endpoint	512 MB · 200 ms	~10M / mo	~6–8M / mo
Background processor	1024 MB · 2 s	~1.5M / mo	~1M / mo
Data pipeline	2048 MB · 500 ms	~5M / mo	~4M / mo

Breakeven = invocations/month above which Fargate is cheaper. Based on a 2026 third-party analysis of production workloads (LeanOps). The longer the function runs, the lower the breakeven.

Put a concrete profile through it. At 5 million invocations of a 200ms function with 512 MB, Lambda's compute plus an API Gateway in front already exceeds the cost of a single always-on Fargate task doing the same work. The Fargate bar is stacked: its own compute, plus the slice of the shared NAT Gateway it actually uses.

$27/mo

$22/mo

$10/mo

Lambda + API Gateway

5M × 200ms × 0.5 GB

Fargate (1 always-on task)

0.5 vCPU + 1 GB · 24/7

Fargate Spot

same task · ~68% off

Fargate computeShared NAT share

Monthly cost — 200ms API at 5M invocations/moFargate −19%

The NAT Gateway is a per-VPC resource that almost every AWS account already runs and shares across all tasks and environments in the VPC — so a single service carries only its slice (~$4/mo here), not the full ~$66/mo. Loading the whole NAT onto one Fargate task would overstate its cost. Lambda outside a VPC needs no NAT at all; Lambda inside a VPC shares the same gateway, so this overhead is roughly a wash between the two. Fargate Spot ($0.01291/vCPU-hr + $0.001417/GB-hr) is for fault-tolerant, stateless workloads, not strict-uptime prod APIs — shown for the cost picture, not as a drop-in here.

That comparison uses one Fargate task for clarity. Production needs at least two for redundancy, plus the rest of the fixed overhead an ECS environment carries — what a real Fargate environment costs once you count the ALB and CloudWatch alongside that NAT share. Fold it in and the crossover shifts, but the direction holds: the longer and busier the workload, the more Fargate pulls ahead.

These thresholds come from a 2026 third-party analysis of production workloads, not from AWS. Treat them as a starting estimate and confirm with your own numbers using the formulas above — your memory size, duration, and whether you front Lambda with API Gateway all move the line.

The hidden costs that move the line

Per-request charges are only 20–40% of a serverless bill. API Gateway ($3.50/M, often more than the Lambda itself), CloudWatch Logs ($50–150/mo at 1M+/day), NAT, and provisioned concurrency make up the rest.

The Lambda line item on your bill is the part most teams model. The rest hides in adjacent services that the function can't run without. A fair comparison has to count them, because Fargate either avoids them or pays them differently.

API Gateway — $3.50 per million requests

Most HTTP Lambdas sit behind API Gateway. At high request volume, that per-request fee routinely exceeds the Lambda compute cost itself. A Fargate service behind an Application Load Balancer pays a flat ~$22/month instead, regardless of request count.

CloudWatch Logs — $50–150/mo at 1M+ invocations/day

Every invocation writes a log stream. At a million-plus invocations a day, ingestion alone runs $50–150/month. Both platforms log, but Lambda's per-invocation granularity multiplies the line count fast.

Provisioned concurrency — billed whether or not it runs

The standard fix for cold starts keeps warm instances on standby and charges for them around the clock — the always-on cost model you chose Lambda to avoid.

CPU tied to memory

Lambda scales CPU with the memory setting. A CPU-bound function forces you to over-provision memory you don't need to get more cores. Fargate lets you set vCPU and memory independently.

KEY INSIGHT: If your monthly Lambda-related bill clears ~$1,000, moving the heaviest function group to Fargate is likely your highest-ROI infrastructure task this quarter. The savings rarely come from the function line alone — they come from dropping the API Gateway and CloudWatch surcharges that ride along with it.

The twist: Lambda MicroVMs move the boundary (June 2026)

On June 22, 2026 AWS shipped Lambda MicroVMs: up to 8 hours, 16 vCPU, 32 GB, Firecracker isolation. It removes the 15-minute limit — but it targets isolated sandboxes for AI and untrusted code, not always-on web services.

For years, the 15-minute timeout was the cleanest reason to leave Lambda: if a job ran longer, you moved it to Fargate or Batch. MicroVMs change that specific fact. Each session runs in its own dedicated MicroVM — Firecracker virtualization, no shared kernel, no shared resources with other sessions — and can hold state across user interactions for up to eight hours.

Max runtime

8 hours

vs 15 min standard

Max vCPU

per MicroVM

Max memory

32 GB

per MicroVM

Max disk

32 GB

per MicroVM

“Each session runs in its own dedicated MicroVM with no shared kernel and no shared resources between users, so untrusted code supplied by one user is contained to their execution environment.”

— AWS News Blog: Lambda MicroVMs, June 2026

The intended use cases tell you who this is for: AI coding assistants, interactive code environments, data analytics platforms, vulnerability scanners, and game servers that run user-supplied scripts. The common thread is running code you don't trust in a hard isolation boundary, with full lifecycle control over each session.

The part most takes will get wrong: MicroVMs are not Lambda becoming Fargate. A MicroVM is a lifecycle-managed session you launch, use, and tear down — not an always-on listener answering a steady stream of HTTP requests. For a long-lived web service or API, the right tool is still Fargate. What MicroVMs displace is the pattern where teams spun up a Fargate task as a sandbox to run untrusted or AI-generated code — that niche now has a purpose-built home.

If your isolation need is bigger than a single sandbox — a full copy of a service with its dependencies — that is still a container problem, closer to cloning a full environment instead of a single sandbox than to a MicroVM session.

On cost: AWS prices MicroVMs across three dimensions — compute (per-second, on baseline and peak usage), snapshot operations and storage, and data transfer. AWS has not published a flat per-second rate, so there is no clean number to drop into the breakeven math yet. MicroVMs is available in US East (N. Virginia and Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland).

Cold start and latency reality

Lambda cold starts run from tens to hundreds of milliseconds, worse inside a VPC or with large packages. Provisioned concurrency removes them — but then you pay around the clock, which erodes Lambda's main cost edge.

A cold start is the time Lambda needs to spin up a new execution environment when no warm one is available. For a small function it's tens of milliseconds; in a VPC, or with a heavy runtime and large deployment package, it climbs into the hundreds. For a user-facing API that occasionally idles, that tail latency is the complaint that shows up in your dashboards.

Fargate has its own startup cost — 30 to 90 seconds to launch a task — but it pays that once. After that the task stays warm and per-request latency is steady, because there is no per-invocation environment to create.

The catch is what fixing Lambda's cold start does to the bill. Provisioned concurrency keeps instances warm and charges for them whether or not a request arrives. That is the always-on cost model — the one Lambda was supposed to let you skip. Once you're paying for warm capacity full time, you've recreated Fargate's economics without Fargate's resource control.

The decision: a checklist

Pick Lambda for spiky or unpredictable traffic, short tasks under ~1s, and low volume. Pick Fargate for steady load, functions 5s+, fine CPU/memory control, or a Lambda bill above ~$300–1,000/mo.

Dimension	AWS Lambda	AWS Fargate
Billing unit	Per ms of execution (GB-seconds)	Per second of allocated vCPU + GB
Best traffic shape	Spiky, unpredictable, event-driven	Steady, sustained, always-on
Short tasks (<1s)	Wins on cost	Overkill
Long tasks (5s+)	Loses fast at scale	Wins on cost
Runtime ceiling	15 min (8 hr via MicroVMs)	Unbounded
Resource control	CPU tied to memory	vCPU + memory set independently
Cold start	Tens–hundreds of ms	30–90s once, then steady
Scale to zero	Native, instant	Manual (scheduling / desiredCount 0)

Reduced to a few if/then rules:

LAMBDATraffic is spiky or unpredictable, and idle periods are real.

LAMBDAEach invocation is short (under ~1s) and total volume is modest.

FARGATELoad is steady, or invocations average 5 seconds or more.

FARGATEYou need independent CPU and memory control, or runtime past 15 minutes for a service (not a sandbox).

FARGATEYour Lambda-related bill (compute + API Gateway + logs) is past ~$300–1,000/mo for one workload.

MICROVMYou need a hard isolation boundary to run untrusted or AI-generated code for up to 8 hours.

Most teams don't pick one and stop. A typical setup keeps spiky glue on Lambda, runs the steady services on Fargate, and now has MicroVMs for the sandbox case. The mistake isn't mixing them — it's leaving a workload on the wrong one after its traffic shape has changed.

If you read this, you might also want to know

What if my workload is partly spiky and partly steady?

Split it. Run the steady core — the part that handles baseline traffic 24/7 — on Fargate, and keep the spiky overflow or event-driven glue on Lambda. The mistake is forcing one runtime onto a workload with two traffic shapes. A Fargate service for the baseline plus Lambda for bursts is usually cheaper than either alone scaled to cover both.

Where does EC2 fit between Lambda and Fargate on cost?

EC2 sits below Fargate on raw compute price at high, steady utilization, because Reserved Instances and Savings Plans cut 30–50% — but you take on AMI patching, scaling, and capacity planning. The order at steady state is roughly EC2 < Fargate < Lambda on cost, and roughly Lambda < Fargate < EC2 on operational burden. Fargate is the middle: more expensive than tuned EC2, far less to operate.

Are MicroVMs cheaper than running a Fargate task as a sandbox?

There's no clean answer yet — AWS prices MicroVMs on per-second compute plus snapshot storage and data transfer, with no published flat rate. The likely advantage is operational, not only dollar: MicroVMs give per-session isolation and lifecycle control out of the box, where a Fargate sandbox makes you build task launch, teardown, and isolation yourself. Compare on total effort, not the compute line alone.

Common questions

Is Fargate cheaper than Lambda?

It depends on execution duration, not just traffic. Lambda is cheaper for short, spiky workloads; Fargate is cheaper for long-running, steady ones. For a 200ms API the crossover is around 6–8M invocations/month once you include API Gateway and CloudWatch. For a 2s background job it drops to about 1M/month. At 5s+ average duration, Lambda almost never wins at meaningful scale.

At what point does Lambda become more expensive than Fargate?

By one 2026 analysis of production workloads: a 512MB/200ms API crosses around 6–8M invocations/month, a 1024MB/2s background processor around 1M/month. The longer each invocation runs, the lower the breakeven. A practical signal: if your monthly Lambda-related bill clears $1,000, at least one function group is almost certainly cheaper on Fargate.

Can AWS Lambda run longer than 15 minutes now?

Yes, in a specific form. Standard Lambda functions still cap at 15 minutes. But Lambda MicroVMs, launched June 22 2026, support up to 8 hours of runtime with 16 vCPU, 32 GB memory, and 32 GB disk per MicroVM, isolated via Firecracker. They are built for isolated sandboxes running AI-generated or untrusted code — not as a replacement for always-on web services, which still belong on Fargate.

Do I need API Gateway with Fargate?

No. Fargate services typically sit behind an Application Load Balancer (~$22/month per environment), not API Gateway. Lambda's HTTP path usually requires API Gateway at $3.50 per million requests, which often exceeds the Lambda compute cost itself. Removing that per-request charge is one reason high-traffic APIs get cheaper on Fargate.

Can Fargate Spot beat Lambda on cost?

For fault-tolerant, steady workloads, yes — by a wide margin. Fargate Spot runs at $0.01291/vCPU-hr + $0.001417/GB-hr, about 68% below on-demand. A stateless service that restarts cleanly on a 2-minute interruption notice runs far cheaper on Spot than on Lambda once traffic is steady. Spot is wrong for production APIs with strict uptime needs.

### Running a fleet of always-on Fargate environments? Once you've chosen Fargat

Worth reading

LandingFargate vs EC2: When Each Launch Type Makes SenseThe other half of the compute decision — bin-packing math, Spot, and the ~60% utilization line where EC2 overtakes Fargate.Guide · What Does AWS Fargate Actually Cost Per Environment?AWS says $0.04048/vCPU-hr. Here's the real per-environment cost once you count ALB, NAT Gateway, CloudWatch, and data transfer.

See your real per-env cost: fortem.dev/ecs-cost-calculator

How to Find and Kill Orphaned ECS Environments Before They Drain Your Budget

Matt — Tue, 30 Jun 2026 11:15:22 +0000

How to Find and Kill Orphaned ECS Environments

Originally published at https://fortem.dev/blog/ecs-orphaned-environments
A stopped ECS service costs $0 in compute — but the ALB ($16/mo) and NAT Gateway ($32/mo) keep billing. Here's how to find and delete orphaned environments before they drain your budget.

Use Case

Every team with 10+ ECS environments has at least one nobody uses anymore. The Fargate tasks stopped when the feature shipped — or didn't. But the ALB kept running. The NAT Gateway kept running. Six months later you're looking at a $400 line item on the bill and nobody can explain it.

TL;DR

A stopped ECS environment (desired=0) still costs $48–65/mo in ALB + NAT Gateway overhead.
Fargate is honest — it bills $0 when desired=0. ALB and NAT Gateway don't know and don't care.
3 CLI commands surface every orphan in your account in under 5 minutes.
Kill order matters: tasks → service → target group → ALB → NAT Gateway → log groups.
5 forgotten environments = ~$3,900/year in pure waste, no compute running.

Ready to use — run this today

Find every ECS service at desired=0 across all clusters in the current AWS account:

# List all clusters
aws ecs list-clusters --query 'clusterArns[]' --output text | tr '\t' '\n' \
| while read cluster; do
    echo "=== $cluster ==="
    aws ecs list-services --cluster "$cluster" \
      --query 'serviceArns[]' --output text | tr '\t' '\n' \
    | xargs -r -P4 -I{} aws ecs describe-services \
        --cluster "$cluster" --services {} \
        --query 'services[?desiredCount==`0`].[serviceName,desiredCount,runningCount]' \
        --output table
  done

Requires: AWS CLI v2, credentials with ecs:ListClusters, ecs:ListServices, ecs:DescribeServices

What makes an ECS environment orphaned

An ECS environment is orphaned when its desired count hits 0 but the supporting infrastructure — ALB, NAT Gateway, log groups — keeps running and billing.

Three patterns cause this. The first is the feature branch that shipped (or got cancelled): someone set desiredCount=0 to "pause" the environment, meant to delete it later, and never did. The ECS console shows 0/0 tasks — looks fine, no alarms fire, nobody notices. This is different from deliberately using a calendar to schedule ECS environments off nights and weekends: a paused-and-forgotten environment never comes back, and nobody is tracking it.

The second is the deprecated microservice. The team migrated to a new service, pointed traffic at it, and left the old one running at zero. It still has an ALB. It still has a NAT Gateway routing its (nonexistent) outbound traffic. The Terraform state still references it.

The third pattern is specific to EC2-backed ECS clusters: an instance fails to register with the cluster — misconfigured IAM role, broken ECS agent, VPC networking issue — and sits in the Auto Scaling group in a healthy state while ECS has no idea it exists. AWS's own documentation describes it: "the instance will just sit there, idling along doing nothing in an unregistered orphaned state."

All three share the same symptom: the ECS console looks clean. No errors. No alerts. Just a steady, invisible charge on the monthly bill.

The real cost of a dead environment

One orphaned Fargate environment with zero running tasks costs $48–65/month: ALB $16.43 + NAT Gateway $32.40 + CloudWatch log storage. No compute, but the infrastructure meter runs.

Orphaned environment — monthly cost with 0 running tasks

ALB (base)$16.43/mo

NAT Gateway$32.40/mo

CloudWatch logs$8.00/mo

ECR storage$9.00/mo

Total infrastructure overhead$65.83/mo

Fargate compute: $0.00 — desired count is 0, no tasks run. The infrastructure doesn't care.

KEY INSIGHT: Fargate is honest — it bills $0 when desiredCount is 0, because no tasks are running. ALB and NAT Gateway aren't connected to ECS service state. They bill by the hour, unconditionally. An environment at zero is indistinguishable from an environment at 100 tasks as far as those services are concerned.

The ALB base rate is $0.0225/hr (verified June 2026) — $16.43/month whether or not a single request passes through it. NAT Gateway is $0.045/hr (verified June 2026) — $32.40/month per AZ. If your environment spans two AZs, that's $64.80/month just in NAT Gateway overhead.

At 5 forgotten environments, that's $3,900/year in pure infrastructure waste. No compute. No traffic. No one using it.

The number teams miss when auditing is also the fixed overhead per environment that persists regardless of task count. An environment costs money from the moment you create the ALB and NAT Gateway — not from the moment tasks start running.

How to find orphaned environments

Three AWS CLI commands surface every ECS service with zero desired count and their associated infrastructure across all clusters — no third-party tools, no console clicking.

Command 1 — find all zero-desired services. The script in the ready-to-use block above lists every service at desiredCount=0. Run it in each region you use. Filter by cluster name to narrow the scope.

Command 2 — find ALBs with no healthy targets. A stopped environment's target group shows 0 healthy targets. This is the fastest way to cross-reference which ALBs are attached to dead environments:

aws elbv2 describe-target-groups --query \
  'TargetGroups[*].[TargetGroupName,TargetGroupArn]' --output text \
| while read name arn; do
    health=$(aws elbv2 describe-target-health --target-group-arn "$arn" \
      --query 'TargetHealthDescriptions[?TargetHealth.State==`healthy`] | length(@)')
    echo "$health healthy  $name"
  done | sort -n

Target groups with 0 healthy targets are candidates for deletion — but check Command 3 first before acting.

Command 3 — find log groups with no recent writes. CloudWatch log groups that haven't received a write in 30+ days are orphaned log storage. They cost $0.50/GB/month to store and accumulate silently:

# Find log groups with no writes in the last 30 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/Logs \
  --metric-name IncomingLogEvents \
  --dimensions Name=LogGroupName,Value=/ecs/your-service \
  --start-time $(date -u -d '30 days ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 2592000 \
  --statistics Sum \
  --query 'Datapoints[0].Sum'

A return value of null or 0.0 means the log group is dead. Automate this check across all /ecs/* log groups to build a full orphan list.

What to check before you delete

Before deleting any environment, verify three things: no scheduled job points at it, no CI/CD pipeline references the cluster name, and the ALB isn't shared between multiple services.

EventBridge scheduled rules. Nightly jobs, weekly reports, scheduled ECS tasks — all reference a cluster and service by name. Check for rules targeting your environment before deleting:

aws events list-rules --query 'Rules[*].[Name,ScheduleExpression,State]' --output table
# Then for each relevant rule:
aws events list-targets-by-rule --rule <rule-name> --query 'Targets[*].EcsParameters'

Terraform state. If the environment was created with Terraform, its state file still references the service, cluster, ALB, and target group. Deleting resources manually without running terraform destroy first will leave Terraform in a broken state on the next plan. Either run terraform destroy -target per resource or remove the state entries manually with terraform state rm.

Shared ALBs. Some teams route multiple environments through a single ALB using listener rules and host-based routing. Check whether your ALB has multiple listener rules before deleting it:

aws elbv2 describe-listeners --load-balancer-arn <alb-arn> \
  --query 'Listeners[*].ListenerArn' --output text \
| xargs -I{} aws elbv2 describe-rules --listener-arn {} \
  --query 'Rules[*].[Priority,Conditions[0].Values[0]]' --output table

If only one rule exists (the default forward rule), the ALB is dedicated to this environment and safe to delete. Multiple rules mean other services depend on it — remove only the rules and target groups belonging to the orphaned service, leave the ALB intact.

Also check the CloudTrail audit log to see who last touched the environment — and when. An environment last modified 8 months ago by a developer who left the company is safe to kill. One touched last week by a CI/CD pipeline is not.

Kill order: the right sequence

Delete in this order: set desiredCount=0 → drain tasks → delete ECS service → delete target group → delete ALB listener rule → delete ALB → delete NAT Gateway → delete log groups. Wrong order causes dependency errors and leaves billing running.

Step 1Scale to zero and drain

aws ecs update-service --cluster <cluster> --service <service> --desired-count 0
# Lower drain timeout first to avoid waiting 5 minutes:
aws elbv2 modify-target-group-attributes \
  --target-group-arn <tg-arn> \
  --attributes Key=deregistration_delay.timeout_seconds,Value=30

Step 2Delete the ECS service

aws ecs delete-service --cluster <cluster> --service <service> --force

Step 3Delete target group and ALB

# Remove listener rules first, then target group, then ALB
aws elbv2 delete-rule --rule-arn <rule-arn>
aws elbv2 delete-target-group --target-group-arn <tg-arn>
# Disable deletion protection if set:
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn <alb-arn> \
  --attributes Key=deletion_protection.enabled,Value=false
aws elbv2 delete-load-balancer --load-balancer-arn <alb-arn>

Step 4Delete NAT Gateway and release EIP

aws ec2 delete-nat-gateway --nat-gateway-id <ngw-id>
# Wait for deletion, then release the Elastic IP:
aws ec2 release-address --allocation-id <eip-alloc-id>

Step 5Delete CloudWatch log groups

# List and delete all log groups for this service:
aws logs describe-log-groups --log-group-name-prefix /ecs/<service-name> \
  --query 'logGroups[*].logGroupName' --output text \
| tr '\t' '\n' \
| xargs -I{} aws logs delete-log-group --log-group-name {}

Common error

Deleting an ALB while its listener rules still reference target groups throws a dependency error. Always delete rules before target groups, and target groups before the ALB. If you see ResourceInUse, run the listener rules describe command above to find what's still attached.

How to prevent orphans from accumulating

Tag every environment at creation with owner, created-by, and ttl. A weekly Lambda that flags services where TTL has passed costs $0 to run and surfaces every stale environment before it accumulates 6 months of charges.

Tagging convention. Apply these tags to every ECS service, ALB, target group, and NAT Gateway at creation time. Without consistent tags, the audit script above has no way to determine ownership or expected lifetime:

# Terraform example — tag every resource at creation
locals {
  env_tags = {
    owner      = "platform-team"
    created-by = "terraform"
    env-type   = "staging"          # feature | staging | prod
    ttl        = "2026-09-01"       # ISO date — when this env expires
    service    = "payments-v2"
  }
}

resource "aws_ecs_service" "this" {
  # ...
  tags = local.env_tags
}

resource "aws_lb" "this" {
  # ...
  tags = local.env_tags
}

Weekly janitor Lambda. An EventBridge rule triggers a Lambda every Monday. The Lambda lists all ECS services, checks the ttl tag against today's date, and posts a Slack message for every service that's past its TTL or has been at desiredCount=0 for more than 7 days. No auto-deletion — just surfacing. The team decides what to kill.

Fortem does this automatically: the dashboard shows per-environment cost, flags services that have been at zero desired count for more than N days, and lets you kill them from the UI without touching the AWS console. For teams managing 20+ environments, the manual audit above gets expensive in engineer time quickly.

If you read this, you might also want to know

What if my orphaned environment is in a different AWS account?

Run the same CLI commands with --profile . If you use AWS Organizations, the easiest cross-account audit is AWS Config aggregator — it surfaces resources tagged with ttl across all member accounts without logging into each one.

Does deleting an ECS service also delete the underlying ECR images?

No. ECR images are independent of ECS services. Deleting the service leaves all images in ECR intact. Images cost $0.10/GB/month to store — a separate cleanup. Use aws ecr describe-images --repository-name to list images and aws ecr batch-delete-image to remove old ones.

How do I know if an ALB is shared between multiple ECS environments?

Check listener rules: aws elbv2 describe-rules --listener-arn . More than one non-default rule means multiple services share the ALB. Count the target groups attached — one per environment. Delete only the rules and target group for the orphaned service, leave the ALB.

Common questions

Does ECS charge for a service when desired count is 0?

Fargate compute is $0 when desired count is 0 — you only pay for running tasks. But the supporting infrastructure bills regardless: an ALB costs $0.0225/hr ($16.43/mo) and a NAT Gateway costs $0.045/hr ($32.40/mo) whether or not any tasks are running.

How do I find all ECS services with zero desired count across all clusters?

Run: aws ecs list-clusters --query 'clusterArns[]' --output text | tr '\t' '\n' | while read c; do aws ecs list-services --cluster "$c" --output text --query 'serviceArns[]' | tr '\t' '\n' | xargs -I{} aws ecs describe-services --cluster "$c" --services {} --query 'services[?desiredCount==0].[serviceName,clusterArn]' --output text; done

Does deleting an ECS service also delete its load balancer?

No. Deleting an ECS service does not delete the ALB, target groups, or listener rules. You must delete them separately after the service is gone. Check for deletion protection on the ALB first — it will block deletion with a cryptic error if enabled.

How long does ECS service drain take before I can delete it?

The default deregistration delay is 300 seconds (5 minutes). You can lower it to 30 seconds on the target group before deleting the service: aws elbv2 modify-target-group-attributes --target-group-arn --attributes Key=deregistration_delay.timeout_seconds,Value=30

Can I automate orphan detection without a third-party tool?

Yes — an EventBridge rule + Lambda that runs weekly, lists all ECS services with desiredCount=0, cross-references their age via tags or CloudTrail, and posts to Slack costs effectively $0 to run. The full pattern is covered in the prevention section above.

### Stop guessing which environments are costing you money Fortem shows per-envi

Worth reading

LandingECS Environment SchedulingYour environments run 168 hours a week. Your team works 40. See all four scheduling approaches and what breaks at fleet scale.Use Case · Why Can't You See Per-Environment AWS Costs?Cost Explorer doesn't break down by environment by default. Here's the tagging strategy and Cost Explorer config that fixes it.

Map your fleet in 5 min: fortem.dev/audit

Why Do AWS Staging Environments Cost So Much?

Matt — Sun, 21 Jun 2026 15:01:58 +0000

Why AWS Staging Environments Cost So Much (2026 Guide)

Originally published at https://fortem.dev/blog/aws-staging-environment-cost
AWS staging environments run 168 hours a week. Your team works 40. Here's where the money goes on ECS Fargate — and how to cut it without touching production.

Guide

You have 10 ECS environments. Most of them are staging, QA, or dev. No one is using them at 2am on Saturday. But Fargate bills by the second, and by the time the monthly invoice arrives the number is larger than expected. This isn't an infrastructure design problem — it's an idle compute problem. Here's where the money goes, and what moves the needle. For the full per-environment math, see the AWS Fargate pricing breakdown, or model your own fleet in the cost calculator.

TL;DR

01Non-prod ECS environments run 168 hours a week. Your team works 40. That's 128 hrs/week of idle compute per environment.
02Fargate compute is ~68% of your ECS bill. The rest (CloudWatch Logs, ALB baseline) doesn't stop when the environment sits idle.
03NAT Gateway, VPC, and often ALB are shared across environments — that overhead doesn't multiply. Compute does.
04Fargate Spot cuts non-prod compute by up to 70% for fault-tolerant tasks. Not suitable for demo environments or shared QA sessions.
05Business-hours scheduling (Mon–Fri 09:00–19:00) cuts active compute time to ~30% of the 24/7 baseline with zero architecture changes.

Ready to use — drop this into your Terraform today

ECS Application Auto Scaling scheduled actions — stops all tasks at 19:00 and restarts at 09:00, Mon–Fri. No Lambda required. Replace your-cluster and your-service with your values. Repeat the aws_appautoscaling_* blocks for each service.

# Register the ECS service as a scalable target
resource "aws_appautoscaling_target" "staging_svc" {
  max_capacity       = 4
  min_capacity       = 0
  resource_id        = "service/your-cluster/your-service"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Stop at 19:00 UTC Mon–Fri
resource "aws_appautoscaling_scheduled_action" "stop_evening" {
  name               = "stop-staging-evening"
  service_namespace  = aws_appautoscaling_target.staging_svc.service_namespace
  resource_id        = aws_appautoscaling_target.staging_svc.resource_id
  scalable_dimension = aws_appautoscaling_target.staging_svc.scalable_dimension
  schedule           = "cron(0 19 ? * MON-FRI *)"

  scalable_target_action {
    min_capacity = 0
    max_capacity = 0
  }
}

# Restart at 09:00 UTC Mon–Fri
resource "aws_appautoscaling_scheduled_action" "start_morning" {
  name               = "start-staging-morning"
  service_namespace  = aws_appautoscaling_target.staging_svc.service_namespace
  resource_id        = aws_appautoscaling_target.staging_svc.resource_id
  scalable_dimension = aws_appautoscaling_target.staging_svc.scalable_dimension
  schedule           = "cron(0 9 ? * MON-FRI *)"

  scalable_target_action {
    min_capacity = 1
    max_capacity = 4
  }
}

# Optional: Fargate Spot capacity provider for non-prod
resource "aws_ecs_service" "staging_svc" {
  # ... your existing service config ...

  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 1
  }

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 0
    base              = 0
  }
}

Monthly compute cost — 10 non-prod environments (80 services, 0.5 vCPU each)

us-east-1, Linux x86, on-demand rates June 2026

24/7 on-demand

$1,442/mo

Business hours on-demand

-70%$428/mo

Business hours + Fargate Spot

-91%$128/mo

Business hours = Mon–Fri 09:00–19:00 (50 hrs/wk, ~217 hrs/mo). Fargate Spot at 70% discount. Shared infrastructure (NAT Gateway, VPC, ALB) not included — shared cost does not multiply per environment.

Why non-prod spend stays invisible

Non-prod costs get lumped into a single “infrastructure” line item with no per-environment breakdown. No one owns the number, so it doesn't get fixed.

Production gets optimized after a big bill. Staging gets the same config it had when the second engineer joined and no one has touched it since. The reason isn't negligence — it's visibility. AWS Cost Explorer shows you ECS as a service total. Without per-environment cost allocation tags, there's no way to see that your staging environment costs more than your QA environment, or that three dev environments have been running since February with no active work behind them.

The result: non-prod spend is invisible in reviews, gets absorbed into the overall AWS bill, and deferred indefinitely with “it's just staging, we'll fix it later.”

KEY INSIGHT: Key insight “Nobody noticed because staging bills get lumped into ‘infrastructure costs’ and nobody questions them.” — practitioner, dev.to

Where the money goes on Fargate

Fargate compute is ~68% of a typical ECS bill at $0.04048/vCPU-hr and $0.004445/GB-hr. The remaining 32% — CloudWatch Logs at $0.50/GB ingested, ALB baseline at $0.0225/hr — doesn't scale to zero when tasks are idle.

The big number is compute, and compute is the lever. But a few non-obvious charges compound the problem for non-prod environments specifically:

01

CloudWatch Logs — verbose by default

Non-prod environments often run at DEBUG log level. A service generating 1 GB/day of logs costs $15/month in ingestion alone. Multiply by 8 services and 10 environments and you have a meaningful line item that has nothing to do with compute.
02

Container Insights — charged per observation

Container Insights is on by default on many clusters. For non-prod, it adds cost without adding value. Turn it off on dev and staging clusters.
03

ALB dedicated to one environment

If each environment has its own ALB, the $0.0225/hr base charge ($16.43/mo) runs regardless of traffic. Teams running 10 environments with dedicated ALBs pay $164/mo in ALB base charges before a single request is processed.

The 168-hour problem

A non-prod environment running 24/7 runs 168 hours a week. Your team works 40. That gap — 128 hours per week of idle compute per environment — is the real cost driver on Fargate.

Let's do the math on a realistic fleet. Ten non-prod environments, each running 8 services at 0.5 vCPU and 1 GB memory:

Scenario	Hrs/mo active	Compute/mo	vs 24/7
24/7 on-demand	730	$1,442	—
Business hours on-demand	~217	$428	−70%
Business hours + Spot	~217	~$128	−91%

80 services × 0.5 vCPU × $0.04048/hr + 80 × 1 GB × $0.004445/hr. Business hours = Mon–Fri 09:00–19:00 UTC (~217 hrs/mo).

KEY INSIGHT: Key insight The compute in a non-prod environment doesn't know it's 2am on Sunday. It charges the same rate as a Tuesday afternoon.

Fargate bills by the second with no minimum charge. A task stopped at 19:00 pays nothing until it restarts at 09:00. That's not an approximation — it's how the billing model works. The savings from scheduling are immediate and exact.

What shared infrastructure changes (and doesn't change)

NAT Gateway, VPC, and often ALB are shared across environments. That overhead doesn't multiply per environment. What multiplies is compute — one set of running tasks per environment, billed independently.

A well-structured ECS fleet shares:

—NAT Gateway — one per VPC, ~$32.85/mo base. Shared across all environments. $3.29/env at 10 environments.
—ALB with host-based routing — one ALB routes to all environments via hostname rules. $16.43/mo base total, not per environment.
—VPC, subnets, security groups — no per-environment charge.

What doesn't share: Fargate task hours, CloudWatch Logs ingestion per environment, and ECR image pull data. These are the numbers that multiply at fleet scale — and they're all driven by idle compute.

This is why the fix is scheduling tasks, not redesigning network architecture. Once you understand that shared infra is already cheap per environment, the question becomes: how do you stop paying for 128 idle compute hours per week?

You can set up per-environment cost allocation tags with AWS Cost Anomaly Detection to get alerted when any single environment deviates from its historical spend baseline — useful once you have scheduling in place and want to catch drift.

Fargate Spot for non-prod: when it works, when it doesn't

Fargate Spot runs non-prod tasks on spare AWS capacity at up to 70% off on-demand rates. It works well for dev and QA. Avoid it for environments used for customer demos or with stateful in-memory work that can't tolerate a restart.

The mechanics: AWS gives 2 minutes' warning via SIGTERM before reclaiming Spot capacity. ECS marks the task as SPOT_INTERRUPTIONand, if desired count is still > 0, launches a replacement.

Environment type	Fargate Spot?	Reason
Dev environments	✓ Yes	Stateless, restartable, no active users
Feature branch preview	✓ Yes	Ephemeral, restartable on interrupt
CI / integration tests	✓ Yes	Short-lived tasks, retry on failure
QA (automated)	✓ Yes	Tests restart automatically on failure
QA (live session)	✗ Risky	Interrupt kills active QA session
Demo environment	✗ No	Customer impact if interrupted
Staging (production-like)	✗ Usually not	Used for final validation, needs stability

The capacity provider strategy in the Terraform block above sets FARGATE_SPOT weight=1, FARGATE weight=0 — pure Spot. For environments that need occasional stability, set Spot weight to 3 and on-demand weight to 1 to prefer Spot but fall back automatically.

Business-hours scheduling: the fastest ROI

Scheduling ECS tasks to stop at 19:00 and restart at 09:00 Mon–Fri cuts active compute time from 730 hours/month to ~217 hours — a 70% reduction with no architecture changes required.

The AWS-native approach uses ECS Application Auto Scaling scheduled actions. No Lambda function, no custom scheduler, no third-party tool — this is a first-class ECS feature. The Terraform block at the top of this article implements it exactly. For the full picture of every approach to scheduling ECS environments off nights and weekends, including where the native path breaks down at fleet scale, see the dedicated guide.

A few operational details worth knowing before you deploy:

—Deregistration delay. ALB target groups have a default 300-second deregistration delay. Reduce this to 30 seconds on non-prod target groups so environments stop promptly at 19:00 instead of draining for 5 minutes.
—Stateful services. RDS and ElastiCache run independently — they're not stopped by this config. Data persists across task restarts. EFS mounts reattach on task start.
—Timezone offset. EventBridge cron uses UTC. Mon–Fri 09:00–19:00 ET is 13:00–23:00 UTC. Adjust the cron expressions for your team's timezone.
—Override capability. The scheduled action sets desired count — any engineer can manually set it back to 1 for an after-hours session. The schedule resumes as normal the next morning.

At 10+ environments, this math becomes unavoidable

One staging environment running 24/7 is an annoyance. Ten of them is a line item that starts appearing in board decks. The fix doesn't scale manually.

Manual scheduling via the AWS console or one-off Terraform blocks works at 1–2 environments. At 10+, the operational overhead compounds:

—Schedule drift — different engineers set different start/stop times, no one audits
—Environment-specific hours — the ML team needs their env at 6am, QA needs theirs until 9pm
—On-demand overrides — “can you keep staging up tonight, we have a client demo” — sent in Slack, forgotten in Terraform
—New environments inherit no schedule by default — the next dev environment someone spins up runs 24/7 until someone notices

This is where fleet-level tooling pays for itself. Fortem manages scheduling across all non-prod environments from one interface — with override capability per environment, audit log of who changed what, and defaults that apply to new environments automatically.

See which environments in your fleet are burning budget right now.

Talk to us about your fleet

Questions this article doesn't answer

How do I actually see which environment is costing what in AWS?+

Enable cost allocation tags for your environment key in the AWS Billing console, then use Cost Explorer with a Group by filter on that tag. You'll see per-environment spend broken out as individual rows. Our article on per-environment cost visibility walks through the exact steps.

Can I automatically stop ECS environments when there's no active deployment or open PR?+

Not with native ECS scheduling alone — you'd need to wire EventBridge to your CI/CD events. A GitHub Actions workflow can call the ECS UpdateService API to set desired count to 0 when a PR is closed and back to 1 when a new deployment completes. Some teams add this to their deploy pipeline directly.

What's the difference between desired count = 0 and deleting the ECS service entirely?+

Setting desired count to 0 stops all running tasks but preserves the service definition, IAM roles, capacity provider strategies, and auto-scaling rules. The service restarts exactly as configured. Deleting the service removes all of this and you'd need to recreate it from Terraform. For scheduling, use desired count = 0 — not service deletion.

Does stopping and restarting ECS tasks affect RDS or other stateful services?+

RDS, ElastiCache, and other stateful services run independently of ECS task count. Stopping tasks at 19:00 has no effect on your database — it continues running (and billing) until you separately stop it. Data persists across task restarts. EFS volumes reattach automatically when tasks start again.

Common questions

Is Fargate Spot available for ECS services or only tasks?

Fargate Spot is available for ECS services through capacity provider strategies. You set FARGATE_SPOT as a capacity provider with a weight in your ECS service definition. Tasks get scheduled on Spot capacity when available. If AWS needs the capacity back, tasks receive a SIGTERM with a 2-minute warning before SIGKILL.

Does setting ECS desired count to 0 stop billing immediately?

Yes. When desired count reaches 0 and running tasks drain and stop, Fargate billing stops within seconds — Fargate charges by the second with no minimum. However, other resources associated with the environment (ALB if dedicated, CloudWatch Log Groups, RDS) continue to incur charges independently.

How do I set up a schedule to stop ECS services on nights and weekends?

Use ECS Application Auto Scaling scheduled actions — no Lambda required. Create a scalable target for each ECS service, then add two scheduled actions: one to set desired count to 0 at your stop time and one to restore it in the morning. EventBridge cron expressions handle the schedule. Terraform example is included in this article.

Will reducing non-prod ECS task size break anything?

It depends on what the task does. For services that only handle QA traffic or automated tests, dropping from 1 vCPU to 0.5 vCPU rarely causes issues. The risk is for tasks that run build pipelines, data migrations, or integration tests under time constraints — those may fail or time out. Right-size based on actual observed CPU and memory utilization, not on what production uses.

How does Fargate Spot handle interruptions in ECS?

AWS sends a SIGTERM to the task 2 minutes before reclaiming capacity, then sends SIGKILL. ECS marks the task as stopped with reason SPOT_INTERRUPTION. If the ECS service has a desired count greater than 0, it will launch a replacement task — on Spot if available, falling back to on-demand if not (depending on your capacity provider strategy weights).

See your real per-env cost: fortem.dev/ecs-cost-calculator