DEV Community

Cover image for ECS Multi-Environment Strategy: What Breaks at 10 That Worked Fine at 3
Matt
Matt

Posted on • Originally published at fortem.dev

ECS Multi-Environment Strategy: What Breaks at 10 That Worked Fine at 3

What Breaks When You Scale Past 10 ECS Environments?

Originally published at https://fortem.dev/blog/ecs-multi-environment-strategy
Naming conventions, cluster structure, and the five AWS limits that surface when environments scale past 10. Written by platform engineers running 100+ ECS environments.


Three ECS environments are manageable with AWS-native tooling and reasonable discipline. Ten environments expose every naming shortcut, every IAM approximation, and every missing inventory tool. This guide covers what actually changes — and what to get right before you hit the wall.

The overhead nobody puts in the spreadsheet

When engineers estimate ECS environment costs, they calculate compute: vCPU hours, memory hours, maybe RDS. What they miss is the fixed overhead that exists before a single container runs.

Every environment needs its own ALB, NAT Gateway (ideally in each AZ for HA), and CloudWatch log groups. These costs are flat — they don't scale with usage, they don't go away when you stop tasks at night, and they don't appear on the compute line in Cost Explorer.

ResourceMonthly costNotes — Application Load Balancer: $22/mo$0.0225/hr base + $0.008/LCU-hr

NAT Gateway (2 AZs)~$66/mo$0.045/hr × 2 AZs + $0.045/GB data

CloudWatch log retention$3–15/moDepends on log volume + retention days

SSM parameters, ECR storage$1–5/moUsually negligible, adds up at scale

Total fixed overhead$85–100/moBefore first task runs

At 3 environments, that's ~$300/month in overhead — noticeable but manageable. At 10 environments it's $850–1,000/monthbefore a single task runs. At 20 environments it's a $1,700–2,000/month line item that doesn't appear anywhere obvious.

What you can actually do about it

Share the ALB across non-prod environments using host-based routing rules (one ALB, multiple environments via different hostnames). This eliminates per-environment ALB cost for dev/staging. NAT Gateway is harder to share cleanly — teams that care about NAT cost switch non-prod environments to public subnet placement with no NAT. Slightly less secure, meaningfully cheaper. Prod always gets its own ALB and NAT.

Naming: the one convention that rules everything

At 3 environments you can get away with ad-hoc names. At 10 you can't — because every AWS resource name is a billing dimension, an IAM scope, and a CloudWatch filter. Inconsistent names mean you can't attribute cost, can't write scoped IAM policies, and can't build dashboards without a lookup table.

The convention that works at fleet scale encodes three things in every resource name: region, account (or account group), and environment name. In this order:

{region_short}-{account}-{envname}

# Examples
use1-prod-main         # us-east-1, prod account, primary production env
use1-prod-stg1         # us-east-1, prod account, staging env
usw2-dev-dev1          # us-west-2, dev account, first dev env
usw2-dev-qa1           # us-west-2, dev account, QA env
usw2-dev-demo          # us-west-2, dev account, demo env
Enter fullscreen mode Exit fullscreen mode

This prefix becomes the root of every resource name in that environment. One Terraform local generates everything:

locals {
  # e.g. "use1-prod-main" or "usw2-dev-qa1"
  env_prefix = "${var.region_short}-${var.account}-${var.envname}"
}

# ECS cluster
resource "aws_ecs_cluster" "main" {
  name = local.env_prefix
  # → "use1-prod-main"
}

# ECS service (env already in cluster name — service is just the component)
resource "aws_ecs_service" "api" {
  name    = "api"
  cluster = aws_ecs_cluster.main.id
}

# Task definition family (global per account — must carry full prefix)
resource "aws_ecs_task_definition" "api" {
  family = "${local.env_prefix}-api-td"
  # → "use1-prod-main-api-td"
}

# SSM paths (hierarchy enables per-service IAM scoping)
resource "aws_ssm_parameter" "db_host" {
  name = "/${local.env_prefix}/api/DB_HOST"
  # → "/use1-prod-main/api/DB_HOST"
}

# IAM roles (global per account — carry full prefix)
resource "aws_iam_role" "task_role" {
  name = "${local.env_prefix}-api-task-role"
  # → "use1-prod-main-api-task-role"
}

# CloudWatch log group
resource "aws_cloudwatch_log_group" "api" {
  name              = "/ecs/${local.env_prefix}-api"
  retention_in_days = var.log_retention_days
  # → "/ecs/use1-prod-main-api"
}
Enter fullscreen mode Exit fullscreen mode

Why SSM paths matter specifically: the hierarchy /use1-prod-main/api/* lets you write a single IAM policy statement that gives the API task access to exactly its own secrets — nothing else:

{
  "Effect": "Allow",
  "Action": ["ssm:GetParameter", "ssm:GetParameters"],
  "Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/use1-prod-main/api/*"
}
Enter fullscreen mode Exit fullscreen mode

Flat SSM names (USE1-PROD-MAIN-API-DB_HOST) lose this entirely. You end up with a wildcard Resource: "*" or a list of 40 individual parameter ARNs. One team's migration from flat to hierarchical SSM naming took two weeks and three deployment freezes.

ResourcePatternExample

ECS Cluster{env_prefix}use1-prod-main

ECS Service{service}api (inside cluster)

Task Def Family{env_prefix}-{service}-tduse1-prod-main-api-td

SSM Path/{env_prefix}/{service}/{PARAM}/use1-prod-main/api/DB_HOST

IAM Task Role{env_prefix}-{service}-task-roleuse1-prod-main-api-task-role

Log Group/ecs/{env_prefix}-{service}/ecs/use1-prod-main-api

Target Group{service}-{envname}-tgapi-main-tg ⚠ 32 chars

Service Connect NS{envname}.localmain.local

The 32-character ALB target group limit

This is the hardest constraint in the naming stack. A target group named use1-prod-main-payments-api-tg is 30 characters — just inside the limit. Add a longer service name and you blow it. The fix: drop the region and account from target group names (they're already implied by the ALB, which lives in one region and one account), and use only envname + service + tg. Plan your abbreviation table before your first service, not after your fifteenth.

Enforce naming in Terraform with a variable validation block — reject envnames that don't match your pattern before any resource gets created:

variable "envname" {
  type = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9]{1,7}$", var.envname))
    error_message = "envname must be 2–8 lowercase alphanumeric chars (e.g. main, dev1, qa2)"
  }
}
Enter fullscreen mode Exit fullscreen mode

Cluster structure at 10+ environments

With the {region}-{account}-{envname} scheme, the cluster structure decision is already mostly made: each envname gets its own ECS cluster. The cluster name is the environment identifier. Everything else in that environment — services, task definitions, log groups, IAM roles — inherits from it.

The practical question is how to organize these clusters across AWS accounts:

One AWS account per environment groupRecommended

Prod environments in one account, all non-prod in another. This is the most common pattern at 30–200 person companies. It keeps prod IAM boundaries hard, separates Fargate vCPU quota pools, and makes Cost Explorer attribution clean.

use1-prod-mainuse1-prod-stg1usw2-prod-main← prod account

usw2-dev-dev1usw2-dev-qa1usw2-dev-demousw2-dev-data1← dev account

Single account, all environments

All clusters in one AWS account. Simpler to start, but Fargate quota is shared — a dev load test can exhaust the regional quota and prevent prod from scaling. Works fine at 3 environments; becomes a risk at 10+.

use1-prod-mainuse1-dev-dev1use1-dev-qa1use1-dev-stg1← single account

ECS clusters are free. The cost of having more clusters is management overhead, not AWS billing. At 10+ environments that overhead is real — which is the case for using tooling that treats the environment as the unit of management, not individual services.

Five problems that appear at 10 environments

These don't show up at 3 environments. They all show up, roughly simultaneously, somewhere between environment 8 and environment 12.

01

Fargate quota exhaustion in prodquota

Fargate vCPU quota is per-region, per-account. Dev and prod share the same pool if they share an account. A developer running load tests against a dev environment can exhaust the regional Fargate quota and prevent production from scaling up during a traffic spike. AWS has no native mechanism to reserve quota for production — the only fix is account separation.

02

ENI exhaustion before compute limitsnetworking

Every Fargate task in awsvpc mode (the only Fargate mode) gets its own ENI. A fleet of 10 environments × 8 services × 2 tasks each = 160 ENIs. Default regional ENI limits can become a hard ceiling before you hit any compute limit. File a support ticket to raise the limit before you need it — AWS processes these routinely but not instantly.

03

IAM role proliferationIAM

The correct pattern — one task execution role + one task role per service per environment — generates 2 × N services × M environments IAM roles. At 10 services and 4 environments that's 80 IAM roles. The temptation is to share roles across environments to reduce the number. Don't. Sharing means a misconfigured dev task can access prod secrets. Generate roles programmatically from your Terraform module; the number stops being a problem when you stop counting them manually.

04

Cloud Map namespace limitservice discovery

AWS Cloud Map limits a single namespace to 100 ECS services. If you use ECS Service Connect and point multiple clusters at the same namespace (e.g., prod.local), you'll hit this ceiling sooner than expected. At 10 environments × 10 services = 100 services in one namespace — exactly at the limit. This is a hard limit and cannot be increased. Fix: per-cluster namespaces. Each envname gets its own: main.local, stg1.local, dev1.local.

05

ALB listener rule ceilingload balancing

An ALB supports 100 listener rules per listener by default. If you share one ALB across non-prod environments using host-based routing (recommended for cost), you'll have roughly N environments × M services rules. At 8 environments × 12 services = 96 rules — right at the limit. The adjustable workaround (multiple listeners, multiple ALBs) adds cost and complexity. The simpler fix is dedicated listener rules per environment namespace rather than per service.

The environment inventory problem

At 3 environments everyone knows what's running. At 10, someone asks "is anyone still using usw2-dev-data1?" and nobody knows for certain.

There is no AWS-native tool that shows you all environments, their owners, their running task counts, their last deployment time, and their monthly cost in one view. What teams actually do — and why each falls short:

AWS Cost Explorer with tags✓Cost attribution if tagging is consistent✗No real-time status, no task counts, 24-hour lag on cost data

ECS console, cluster by cluster✓Real-time task counts✗No cost, no ownership, no cross-account view

Slack channel where people announce environments✓Ownership context✗Immediately out of date, no automation, ignored

Spreadsheet / wiki page✓Good intentions✗Stale within a week, nobody updates it after incidents

AWS's ECS Split Cost Allocation Data (launched 2023) partially closes the cost visibility gap — it attributes Fargate spend per task using aws:ecs:clusterName and aws:ecs:serviceName system tags as billing dimensions. This works well — but only if your cluster and service names are consistent. Which is why naming comes first.

The real cost of invisible environments

Orphaned environments — ones nobody is actively using but nobody has turned off — are the most expensive line in any ECS bill. At $85–100/month fixed overhead plus compute, a forgotten environment running 24/7 costs $200–400/month. Teams with 10+ environments typically have 1–3 orphaned environments at any given time. The inventory problem isn't just inconvenient — it's expensive.

Scheduling at fleet scale

Non-prod environments run 168 hours a week. Your team works ~55. Scheduling environments offline outside business hours cuts compute cost by 60–70%— for most teams it's the single largest ECS cost lever available.

The problem: AWS-native scheduling operates at the service level. To schedule one environment with 8 services, you need 16 Auto Scaling actions (stop + start per service). At 10 environments that's 160 actions to create, maintain, and update when schedules change.

EnvironmentsServices eachAuto Scaling actionsSchedule change cost

38488 updates

1081608–16 updates

201040010–20 updates

There are three additional problems that emerge specifically at fleet scale:

Timezone complexity

EU teams want environments down at 18:00 CET. US East wants 20:00 EST. US West wants 20:00 PST. Each requires separate cron expressions that account for DST. At 10+ environments with multiple team timezones, maintaining these expressions is a part-time job.

No developer overrides

A developer working late on a deadline wants to keep their environment up past the scheduled stop time. With AWS-native scheduling, that requires either platform engineer access or IAM permissions broad enough to be a security concern. The friction means developers stop requesting overrides — and start asking to remove scheduling entirely.

Silent failed starts

The scheduled start fires. Lambda runs. Desired count updates. But a service fails to start — image pull error, IAM issue, resource limit. The cron job succeeded; the environment didn't come up. AWS doesn't surface this. You need separate health checking or developers start their morning debugging an environment that's half-running.

What teams actually end up doing

Teams start with EventBridge + Lambda at 3–5 environments. By 10 environments they're maintaining a scheduling codebase. By 15–20 environments, the maintenance burden outweighs the savings — and environments quietly go back to running 24/7. The savings disappear not because scheduling doesn't work, but because the tooling to maintain it at scale doesn't exist in AWS natively.


See what 10+ envs really cost: fortem.dev/ecs-cost-calculator

Top comments (0)