Matt

Posted on Jun 4 • Edited on Jun 30 • Originally published at fortem.dev

ECS Multi-Environment Strategy: What Breaks at 10 That Worked Fine at 3

#aws #ecs #fargate #platform

What Breaks When You Scale Past 10 ECS Environments?

Originally published at https://fortem.dev/blog/ecs-multi-environment-strategy
Naming conventions, cluster structure, and AWS limits that surface when ECS environments scale from 3 to 10+. From engineers running 100+ environments.

Three ECS environments are manageable with AWS-native tooling and reasonable discipline. Ten environments expose every naming shortcut, every IAM approximation, and every missing inventory tool. This guide covers what changes — and what to get right before you hit the wall.

TL;DR

Every ECS environment carries $85–100/month in fixed overhead (ALB, NAT Gateway, CloudWatch) before any compute runs — at 10 environments that's $850–1,000/month.
The naming pattern {region}-{account}-{envname} (e.g. use1-prod-main) is the single decision that makes IAM scoping, cost attribution, and CloudWatch filtering work at fleet scale.
Five limits surface between environments 8–12: Fargate vCPU quota, ENI ceiling, Cloud Map 100-service namespace, ALB 32-char target group names, and 100 listener rules per ALB.
AWS-native scheduling requires 160 Auto Scaling actions to manage 10 environments × 8 services — and has no developer override mechanism, no timezone handling, and no startup health checks.

The overhead nobody puts in the spreadsheet

Every ECS environment carries $85-100/month in fixed overhead — ALB ($22/mo), NAT Gateway ($33-66/mo), CloudWatch logs — before a single container runs. At 12 environments, that is $1,000+/month of cost that tags miss and scheduling cannot eliminate.

When engineers estimate ECS environment costs, they calculate compute: vCPU hours, memory hours, maybe RDS. What they miss is the fixed overhead that exists before a single container runs.

Every environment needs its own ALB, NAT Gateway (ideally in each AZ for HA), and CloudWatch log groups. These costs are flat — they don't scale with usage, they don't go away when you stop tasks at night, and they don't appear on the compute line in Cost Explorer.

ResourceMonthly costNotes — Application Load Balancer: $22/mo$0.0225/hr base + $0.008/LCU-hr

NAT Gateway (2 AZs)~$66/mo$0.045/hr × 2 AZs + $0.045/GB data

CloudWatch log retention$3–15/moDepends on log volume + retention days

SSM parameters, ECR storage$1–5/moUsually negligible, adds up at scale

Total fixed overhead$85–100/moBefore first task runs

At 3 environments, that's ~$300/month in overhead — noticeable but manageable. At 10 environments it's $850–1,000/monthbefore a single task runs. At 20 environments it's a $1,700–2,000/month line item that doesn't appear anywhere obvious. The full breakdown of Fargate's real per-environment cost covers the pricing model in detail.

What you can do about it

Share the ALB across non-prod environments using host-based routing rules (one ALB, multiple environments via different hostnames). This eliminates per-environment ALB cost for dev/staging. NAT Gateway is harder to share cleanly — teams that care about NAT cost switch non-prod environments to public subnet placement with no NAT. Slightly less secure, meaningfully cheaper. Prod always gets its own ALB and NAT.

KEY INSIGHT: Every ECS environment carries $85–100/month in fixed overhead — ALB, NAT Gateway, CloudWatch — before a single container runs. At 10 environments that's $850–1,000/month of infrastructure cost that doesn't scale with usage and doesn't stop when tasks are scheduled offline.

Download the skill file — audit your fleet

This skill file audits every ECS cluster in your account: names vs. convention, zero-task environments costing $85-100/mo each, and Cloud Map and ALB limit usage. This skill file inventories every cluster as an environment, checks names against a convention, flags likely-orphaned environments, and reports how close you are to the Cloud Map and ALB limits covered below.

*ECS Environment Inventory & Naming Audit Lists every cluster with service count *

Naming: the one convention that rules everything

Use {region_short}-{account}-{envname} (e.g. use1-prod-main) as a single prefix propagated to every ECS resource — cluster, task def, SSM path, IAM role, and log group. grep it in logs, script it in bash, join it with billing data. The convention itself matters less than enforcement: pick one and automate it.

At 3 environments you can get away with ad-hoc names. At 10 you can't — because every AWS resource name is a billing dimension, an IAM scope, and a CloudWatch filter. Inconsistent names mean you can't attribute cost, can't write scoped IAM policies, and can't build dashboards without a lookup table.

The convention that works at fleet scale encodes three things in every resource name: region, account (or account group), and environment name. In this order:

{region_short}-{account}-{envname}

# Examples
use1-prod-main         # us-east-1, prod account, primary production env
use1-prod-stg1         # us-east-1, prod account, staging env
usw2-dev-dev1          # us-west-2, dev account, first dev env
usw2-dev-qa1           # us-west-2, dev account, QA env
usw2-dev-demo          # us-west-2, dev account, demo env

This prefix becomes the root of every resource name in that environment. One Terraform local generates everything:

locals {
  # e.g. "use1-prod-main" or "usw2-dev-qa1"
  env_prefix = "${var.region_short}-${var.account}-${var.envname}"
}

# ECS cluster
resource "aws_ecs_cluster" "main" {
  name = local.env_prefix
  # → "use1-prod-main"
}

# ECS service (env already in cluster name — service is just the component)
resource "aws_ecs_service" "api" {
  name    = "api"
  cluster = aws_ecs_cluster.main.id
}

# Task definition family (global per account — must carry full prefix)
resource "aws_ecs_task_definition" "api" {
  family = "${local.env_prefix}-api-td"
  # → "use1-prod-main-api-td"
}

# SSM paths (hierarchy enables per-service IAM scoping)
resource "aws_ssm_parameter" "db_host" {
  name = "/${local.env_prefix}/api/DB_HOST"
  # → "/use1-prod-main/api/DB_HOST"
}

# IAM roles (global per account — carry full prefix)
resource "aws_iam_role" "task_role" {
  name = "${local.env_prefix}-api-task-role"
  # → "use1-prod-main-api-task-role"
}

# CloudWatch log group
resource "aws_cloudwatch_log_group" "api" {
  name              = "/ecs/${local.env_prefix}-api"
  retention_in_days = var.log_retention_days
  # → "/ecs/use1-prod-main-api"
}

Why SSM paths matter specifically: the hierarchy /use1-prod-main/api/* lets you write a single IAM policy statement that gives the API task access to exactly its own secrets — nothing else:

{
  "Effect": "Allow",
  "Action": ["ssm:GetParameter", "ssm:GetParameters"],
  "Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/use1-prod-main/api/*"
}

Flat SSM names (USE1-PROD-MAIN-API-DB_HOST) lose this entirely. You end up with a wildcard Resource: "*" or a list of 40 individual parameter ARNs. One team's migration from flat to hierarchical SSM naming took two weeks and three deployment freezes.

ResourcePatternExample

ECS Cluster{env_prefix}use1-prod-main

ECS Service{service}api (inside cluster)

Task Def Family{env_prefix}-{service}-tduse1-prod-main-api-td

SSM Path/{env_prefix}/{service}/{PARAM}/use1-prod-main/api/DB_HOST

IAM Task Role{env_prefix}-{service}-task-roleuse1-prod-main-api-task-role

Log Group/ecs/{env_prefix}-{service}/ecs/use1-prod-main-api

Target Group{service}-{envname}-tgapi-main-tg ⚠ 32 chars

Service Connect NS{envname}.localmain.local

The 32-character ALB target group limit

This is the hardest constraint in the naming stack. A target group named use1-prod-main-payments-api-tg is 30 characters — right at the limit. Add a longer service name and you blow it. The fix: drop the region and account from target group names (they're already implied by the ALB, which lives in one region and one account), and use only envname + service + tg. Plan your abbreviation table before your first service, not after your fifteenth.

Enforce naming in Terraform with a variable validation block — reject envnames that don't match your pattern before any resource gets created. If you're building the full module from scratch, the ECS Fargate Terraform module structure guide covers the complete module layout alongside this naming scheme:

variable "envname" {
  type = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9]{1,7}$", var.envname))
    error_message = "envname must be 2–8 lowercase alphanumeric chars (e.g. main, dev1, qa2)"
  }
}

Cluster structure at 10+ environments

Two approaches at 10+ environments: one ECS cluster shared across environments (simpler, namespace isolation only) or one cluster per environment (harder isolation, higher overhead). AWS limits the soft ceiling at ~5,000 services per cluster — you will hit organizational chaos before the quota.

With the {region}-{account}-{envname} scheme, the cluster structure decision is already mostly made: each envname gets its own ECS cluster. The cluster name is the environment identifier. Everything else in that environment — services, task definitions, log groups, IAM roles — inherits from it.

The practical question is how to organize these clusters across AWS accounts:

One AWS account per environment groupRecommended

Prod environments in one account, all non-prod in another. This is the most common pattern at 30–200 person companies. It keeps prod IAM boundaries hard, separates Fargate vCPU quota pools, and makes Cost Explorer attribution clean.

use1-prod-mainuse1-prod-stg1usw2-prod-main← prod account

usw2-dev-dev1usw2-dev-qa1usw2-dev-demousw2-dev-data1← dev account

Single account, all environments

All clusters in one AWS account. Simpler to start, but Fargate quota is shared — a dev load test can exhaust the regional quota and prevent prod from scaling. Works fine at 3 environments; becomes a risk at 10+.

use1-prod-mainuse1-dev-dev1use1-dev-qa1use1-dev-stg1← single account

ECS clusters are free. The cost of having more clusters is management overhead, not AWS billing. At 10+ environments that overhead is real — which is the case for using tooling that treats the environment as the unit of management, not individual services. Once you pick the per-account layout, the day-to-day question becomes how to manage ECS Fargate across those accounts — cross-account IAM, central ECR, networking cost, and the fleet view AWS doesn't ship.

Five problems that appear at 10 environments

Ten environments surface five distinct problems: Fargate quota exhaustion, ENI ceiling, IAM role proliferation, Cloud Map namespace limit, and ALB listener rule cap — all between environment 8 and 12. Each one compounds the others.

These don't show up at 3 environments. They all show up, roughly simultaneously, somewhere between environment 8 and environment 12 — the point where platform engineering for ECS at 10+ environments stops being optional and the operations gap starts costing you real money.

Fargate quota exhaustion in prodquota

Fargate vCPU quota is per-region, per-account. Dev and prod share the same pool if they share an account. A developer running load tests against a dev environment can exhaust the regional Fargate quota and prevent production from scaling up during a traffic spike. AWS has no native mechanism to reserve quota for production — the only fix is account separation.

ENI exhaustion before compute limitsnetworking

Every Fargate task in awsvpc mode (the only Fargate mode) gets its own ENI. A fleet of 10 environments × 8 services × 2 tasks each = 160 ENIs. Default regional ENI limits can become a hard ceiling before you hit any compute limit. File a support ticket to raise the limit before you need it — AWS processes these routinely but not instantly.

IAM role proliferationIAM

The correct pattern — one task execution role + one task role per service per environment — generates 2 × N services × M environments IAM roles. At 10 services and 4 environments that's 80 IAM roles. The temptation is to share roles across environments to reduce the number. Don't. Sharing means a misconfigured dev task can access prod secrets. Generate roles programmatically from your Terraform module; the number stops being a problem when you stop counting them manually.

Cloud Map namespace limitservice discovery

AWS Cloud Map limits a single namespace to 100 ECS services. If you use ECS Service Connect and point multiple clusters at the same namespace (e.g., prod.local), you'll hit this ceiling sooner than expected. At 10 environments × 10 services = 100 services in one namespace — exactly at the limit. This is a hard limit and cannot be increased. Fix: per-cluster namespaces. Each envname gets its own: main.local, stg1.local, dev1.local.

ALB listener rule ceilingload balancing

An ALB supports 100 listener rules per listener by default. If you share one ALB across non-prod environments using host-based routing (recommended for cost), you'll have roughly N environments × M services rules. At 8 environments × 12 services = 96 rules — right at the limit. The adjustable workaround (multiple listeners, multiple ALBs) adds cost and complexity. The simpler fix is dedicated listener rules per environment namespace rather than per service.

"Fargate on-demand vCPU quotas (default 6 vCPUs, auto-increases with usage but starts low per region), VPC network interface limits (default 500 per region), ALB listener rule ceilings (100 rules), and Cloud Map namespace limits (100 services) all become hard constraints as environments scale."

— Amazon ECS service quotas, verified June 2026

The environment inventory problem

No AWS-native tool shows all environments, owners, running task counts, and costs in one view — at 15+ environments that gap means you are paying for environments nobody remembers provisioning. Tags help but 30-40% of resources are untagged or mistagged. At 15+ environments, the inventory gap means you are paying for environments you forgot you had.

At 3 environments everyone knows what's running. At 10, someone asks "is anyone still using usw2-dev-data1?" and nobody knows for certain.

There is no AWS-native tool that shows you all environments, their owners, their running task counts, their last deployment time, and their monthly cost in one view. What teams do — and why each falls short:

AWS Cost Explorer with tags✓Cost attribution if tagging is consistent✗No real-time status, no task counts, 24-hour lag on cost data

ECS console, cluster by cluster✓Real-time task counts✗No cost, no ownership, no cross-account view

Slack channel where people announce environments✓Ownership context✗Immediately out of date, no automation, ignored

Spreadsheet / wiki page✓Good intentions✗Stale within a week, nobody updates it after incidents

AWS's ECS Split Cost Allocation Data (launched 2023) partially closes the cost visibility gap — it attributes Fargate spend per task using aws:ecs:clusterName and aws:ecs:serviceName system tags as billing dimensions. This works well — but only if your cluster and service names are consistent. Which is why naming comes first.

The real cost of invisible environments

Orphaned environments — ones nobody is actively using but nobody has turned off — are the most expensive line in any ECS bill. At $85–100/month fixed overhead plus compute, a forgotten environment running 24/7 costs $200–400/month. Teams with 10+ environments typically have 1–3 orphaned environments at any given time. The inventory problem isn't inconvenient — it's expensive.

Scheduling at fleet scale

AWS-native scheduling requires 160 Auto Scaling actions to manage 10 environments with 8 services each — and provides no developer override, no timezone handling, and no startup health checks. Automating it with Lambda + EventBridge requires per-environment cron expressions, timezone handling, and monitoring for silent failures. The real win is not the automation itself — it is that the savings become predictable and recurring, not ad-hoc.

Non-prod environments run 168 hours a week. Your team works ~55. Scheduling environments offline outside business hours cuts compute cost by 60–70%— for most teams it's the single largest ECS cost lever available.

The problem: AWS-native scheduling operates at the service level. To schedule one environment with 8 services, you need 16 Auto Scaling actions (stop + start per service). At 10 environments that's 160 actions to create, maintain, and update when schedules change.

EnvironmentsServices eachAuto Scaling actionsSchedule change cost

38488 updates

1081608–16 updates

201040010–20 updates

There are three additional problems that emerge specifically at fleet scale:

✗

Timezone complexity

EU teams want environments down at 18:00 CET. US East wants 20:00 EST. US West wants 20:00 PST. Each requires separate cron expressions that account for DST. At 10+ environments with multiple team timezones, maintaining these expressions is a part-time job.

✗

No developer overrides

A developer working late on a deadline wants to keep their environment up past the scheduled stop time. With AWS-native scheduling, that requires either platform engineer access or IAM permissions broad enough to be a security concern. The friction means developers stop requesting overrides — and start asking to remove scheduling entirely.

✗

Silent failed starts

The scheduled start fires. Lambda runs. Desired count updates. But a service fails to start — image pull error, IAM issue, resource limit. The cron job succeeded; the environment didn't come up. AWS doesn't surface this. You need separate health checking or developers start their morning debugging an environment that's half-running.

What teams end up doing

Teams start with EventBridge + Lambda at 3–5 environments. By 10 environments they're maintaining a scheduling codebase. By 15–20 environments, the maintenance burden outweighs the savings — and environments quietly go back to running 24/7. The savings disappear not because scheduling doesn't work, but because the tooling to maintain it at scale doesn't exist in AWS natively.

See what 10+ envs really cost: fortem.dev/ecs-cost-calculator

DEV Community