What Breaks When You Scale Past 10 ECS Environments?
Originally published at https://fortem.dev/blog/ecs-multi-environment-strategy
Naming conventions, cluster structure, and the five AWS limits that surface when environments scale past 10. Written by platform engineers running 100+ ECS environments.
Three ECS environments are manageable with AWS-native tooling and reasonable discipline. Ten environments expose every naming shortcut, every IAM approximation, and every missing inventory tool. This guide covers what actually changes — and what to get right before you hit the wall.
The overhead nobody puts in the spreadsheet
When engineers estimate ECS environment costs, they calculate compute: vCPU hours, memory hours, maybe RDS. What they miss is the fixed overhead that exists before a single container runs.
Every environment needs its own ALB, NAT Gateway (ideally in each AZ for HA), and CloudWatch log groups. These costs are flat — they don't scale with usage, they don't go away when you stop tasks at night, and they don't appear on the compute line in Cost Explorer.
ResourceMonthly costNotes — Application Load Balancer: $22/mo$0.0225/hr base + $0.008/LCU-hr
NAT Gateway (2 AZs)~$66/mo$0.045/hr × 2 AZs + $0.045/GB data
CloudWatch log retention$3–15/moDepends on log volume + retention days
SSM parameters, ECR storage$1–5/moUsually negligible, adds up at scale
Total fixed overhead$85–100/moBefore first task runs
At 3 environments, that's ~$300/month in overhead — noticeable but manageable. At 10 environments it's $850–1,000/monthbefore a single task runs. At 20 environments it's a $1,700–2,000/month line item that doesn't appear anywhere obvious.
What you can actually do about it
Share the ALB across non-prod environments using host-based routing rules (one ALB, multiple environments via different hostnames). This eliminates per-environment ALB cost for dev/staging. NAT Gateway is harder to share cleanly — teams that care about NAT cost switch non-prod environments to public subnet placement with no NAT. Slightly less secure, meaningfully cheaper. Prod always gets its own ALB and NAT.
Naming: the one convention that rules everything
At 3 environments you can get away with ad-hoc names. At 10 you can't — because every AWS resource name is a billing dimension, an IAM scope, and a CloudWatch filter. Inconsistent names mean you can't attribute cost, can't write scoped IAM policies, and can't build dashboards without a lookup table.
The convention that works at fleet scale encodes three things in every resource name: region, account (or account group), and environment name. In this order:
{region_short}-{account}-{envname}
# Examples
use1-prod-main # us-east-1, prod account, primary production env
use1-prod-stg1 # us-east-1, prod account, staging env
usw2-dev-dev1 # us-west-2, dev account, first dev env
usw2-dev-qa1 # us-west-2, dev account, QA env
usw2-dev-demo # us-west-2, dev account, demo env
This prefix becomes the root of every resource name in that environment. One Terraform local generates everything:
locals {
# e.g. "use1-prod-main" or "usw2-dev-qa1"
env_prefix = "${var.region_short}-${var.account}-${var.envname}"
}
# ECS cluster
resource "aws_ecs_cluster" "main" {
name = local.env_prefix
# → "use1-prod-main"
}
# ECS service (env already in cluster name — service is just the component)
resource "aws_ecs_service" "api" {
name = "api"
cluster = aws_ecs_cluster.main.id
}
# Task definition family (global per account — must carry full prefix)
resource "aws_ecs_task_definition" "api" {
family = "${local.env_prefix}-api-td"
# → "use1-prod-main-api-td"
}
# SSM paths (hierarchy enables per-service IAM scoping)
resource "aws_ssm_parameter" "db_host" {
name = "/${local.env_prefix}/api/DB_HOST"
# → "/use1-prod-main/api/DB_HOST"
}
# IAM roles (global per account — carry full prefix)
resource "aws_iam_role" "task_role" {
name = "${local.env_prefix}-api-task-role"
# → "use1-prod-main-api-task-role"
}
# CloudWatch log group
resource "aws_cloudwatch_log_group" "api" {
name = "/ecs/${local.env_prefix}-api"
retention_in_days = var.log_retention_days
# → "/ecs/use1-prod-main-api"
}
Why SSM paths matter specifically: the hierarchy /use1-prod-main/api/* lets you write a single IAM policy statement that gives the API task access to exactly its own secrets — nothing else:
{
"Effect": "Allow",
"Action": ["ssm:GetParameter", "ssm:GetParameters"],
"Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/use1-prod-main/api/*"
}
Flat SSM names (USE1-PROD-MAIN-API-DB_HOST) lose this entirely. You end up with a wildcard Resource: "*" or a list of 40 individual parameter ARNs. One team's migration from flat to hierarchical SSM naming took two weeks and three deployment freezes.
ResourcePatternExample
ECS Cluster{env_prefix}use1-prod-main
ECS Service{service}api (inside cluster)
Task Def Family{env_prefix}-{service}-tduse1-prod-main-api-td
SSM Path/{env_prefix}/{service}/{PARAM}/use1-prod-main/api/DB_HOST
IAM Task Role{env_prefix}-{service}-task-roleuse1-prod-main-api-task-role
Log Group/ecs/{env_prefix}-{service}/ecs/use1-prod-main-api
Target Group{service}-{envname}-tgapi-main-tg ⚠ 32 chars
Service Connect NS{envname}.localmain.local
The 32-character ALB target group limit
This is the hardest constraint in the naming stack. A target group named use1-prod-main-payments-api-tg is 30 characters — just inside the limit. Add a longer service name and you blow it. The fix: drop the region and account from target group names (they're already implied by the ALB, which lives in one region and one account), and use only envname + service + tg. Plan your abbreviation table before your first service, not after your fifteenth.
Enforce naming in Terraform with a variable validation block — reject envnames that don't match your pattern before any resource gets created:
variable "envname" {
type = string
validation {
condition = can(regex("^[a-z][a-z0-9]{1,7}$", var.envname))
error_message = "envname must be 2–8 lowercase alphanumeric chars (e.g. main, dev1, qa2)"
}
}
Cluster structure at 10+ environments
With the {region}-{account}-{envname} scheme, the cluster structure decision is already mostly made: each envname gets its own ECS cluster. The cluster name is the environment identifier. Everything else in that environment — services, task definitions, log groups, IAM roles — inherits from it.
The practical question is how to organize these clusters across AWS accounts:
One AWS account per environment groupRecommended
Prod environments in one account, all non-prod in another. This is the most common pattern at 30–200 person companies. It keeps prod IAM boundaries hard, separates Fargate vCPU quota pools, and makes Cost Explorer attribution clean.
use1-prod-mainuse1-prod-stg1usw2-prod-main← prod account
usw2-dev-dev1usw2-dev-qa1usw2-dev-demousw2-dev-data1← dev account
Single account, all environments
All clusters in one AWS account. Simpler to start, but Fargate quota is shared — a dev load test can exhaust the regional quota and prevent prod from scaling. Works fine at 3 environments; becomes a risk at 10+.
use1-prod-mainuse1-dev-dev1use1-dev-qa1use1-dev-stg1← single account
ECS clusters are free. The cost of having more clusters is management overhead, not AWS billing. At 10+ environments that overhead is real — which is the case for using tooling that treats the environment as the unit of management, not individual services.
Five problems that appear at 10 environments
These don't show up at 3 environments. They all show up, roughly simultaneously, somewhere between environment 8 and environment 12.
01
Fargate quota exhaustion in prodquota
Fargate vCPU quota is per-region, per-account. Dev and prod share the same pool if they share an account. A developer running load tests against a dev environment can exhaust the regional Fargate quota and prevent production from scaling up during a traffic spike. AWS has no native mechanism to reserve quota for production — the only fix is account separation.
02
ENI exhaustion before compute limitsnetworking
Every Fargate task in awsvpc mode (the only Fargate mode) gets its own ENI. A fleet of 10 environments × 8 services × 2 tasks each = 160 ENIs. Default regional ENI limits can become a hard ceiling before you hit any compute limit. File a support ticket to raise the limit before you need it — AWS processes these routinely but not instantly.
03
IAM role proliferationIAM
The correct pattern — one task execution role + one task role per service per environment — generates 2 × N services × M environments IAM roles. At 10 services and 4 environments that's 80 IAM roles. The temptation is to share roles across environments to reduce the number. Don't. Sharing means a misconfigured dev task can access prod secrets. Generate roles programmatically from your Terraform module; the number stops being a problem when you stop counting them manually.
04
Cloud Map namespace limitservice discovery
AWS Cloud Map limits a single namespace to 100 ECS services. If you use ECS Service Connect and point multiple clusters at the same namespace (e.g., prod.local), you'll hit this ceiling sooner than expected. At 10 environments × 10 services = 100 services in one namespace — exactly at the limit. This is a hard limit and cannot be increased. Fix: per-cluster namespaces. Each envname gets its own: main.local, stg1.local, dev1.local.
05
ALB listener rule ceilingload balancing
An ALB supports 100 listener rules per listener by default. If you share one ALB across non-prod environments using host-based routing (recommended for cost), you'll have roughly N environments × M services rules. At 8 environments × 12 services = 96 rules — right at the limit. The adjustable workaround (multiple listeners, multiple ALBs) adds cost and complexity. The simpler fix is dedicated listener rules per environment namespace rather than per service.
The environment inventory problem
At 3 environments everyone knows what's running. At 10, someone asks "is anyone still using usw2-dev-data1?" and nobody knows for certain.
There is no AWS-native tool that shows you all environments, their owners, their running task counts, their last deployment time, and their monthly cost in one view. What teams actually do — and why each falls short:
AWS Cost Explorer with tags✓Cost attribution if tagging is consistent✗No real-time status, no task counts, 24-hour lag on cost data
ECS console, cluster by cluster✓Real-time task counts✗No cost, no ownership, no cross-account view
Slack channel where people announce environments✓Ownership context✗Immediately out of date, no automation, ignored
Spreadsheet / wiki page✓Good intentions✗Stale within a week, nobody updates it after incidents
AWS's ECS Split Cost Allocation Data (launched 2023) partially closes the cost visibility gap — it attributes Fargate spend per task using aws:ecs:clusterName and aws:ecs:serviceName system tags as billing dimensions. This works well — but only if your cluster and service names are consistent. Which is why naming comes first.
The real cost of invisible environments
Orphaned environments — ones nobody is actively using but nobody has turned off — are the most expensive line in any ECS bill. At $85–100/month fixed overhead plus compute, a forgotten environment running 24/7 costs $200–400/month. Teams with 10+ environments typically have 1–3 orphaned environments at any given time. The inventory problem isn't just inconvenient — it's expensive.
Scheduling at fleet scale
Non-prod environments run 168 hours a week. Your team works ~55. Scheduling environments offline outside business hours cuts compute cost by 60–70%— for most teams it's the single largest ECS cost lever available.
The problem: AWS-native scheduling operates at the service level. To schedule one environment with 8 services, you need 16 Auto Scaling actions (stop + start per service). At 10 environments that's 160 actions to create, maintain, and update when schedules change.
EnvironmentsServices eachAuto Scaling actionsSchedule change cost
38488 updates
1081608–16 updates
201040010–20 updates
There are three additional problems that emerge specifically at fleet scale:
✗
Timezone complexity
EU teams want environments down at 18:00 CET. US East wants 20:00 EST. US West wants 20:00 PST. Each requires separate cron expressions that account for DST. At 10+ environments with multiple team timezones, maintaining these expressions is a part-time job.
✗
No developer overrides
A developer working late on a deadline wants to keep their environment up past the scheduled stop time. With AWS-native scheduling, that requires either platform engineer access or IAM permissions broad enough to be a security concern. The friction means developers stop requesting overrides — and start asking to remove scheduling entirely.
✗
Silent failed starts
The scheduled start fires. Lambda runs. Desired count updates. But a service fails to start — image pull error, IAM issue, resource limit. The cron job succeeded; the environment didn't come up. AWS doesn't surface this. You need separate health checking or developers start their morning debugging an environment that's half-running.
What teams actually end up doing
Teams start with EventBridge + Lambda at 3–5 environments. By 10 environments they're maintaining a scheduling codebase. By 15–20 environments, the maintenance burden outweighs the savings — and environments quietly go back to running 24/7. The savings disappear not because scheduling doesn't work, but because the tooling to maintain it at scale doesn't exist in AWS natively.
See what 10+ envs really cost: fortem.dev/ecs-cost-calculator
Top comments (0)