Matt

Posted on Jun 4 • Edited on Jun 30 • Originally published at fortem.dev

Managing ECS Fargate with Terraform: What Works and What Doesn't

#ecs #aws #terraform #fargate

ECS Fargate with Terraform: What Works and What Doesn't

Originally published at https://fortem.dev/blog/ecs-fargate-terraform
Terraform is right for ECS Fargate infrastructure. But at 10+ environments, state sprawl and the ops gap catch every team — here are the patterns that scale.

Guide

TL;DR

Terraform is the correct tool for provisioning ECS Fargate infrastructure — this article won't try to replace it.
Module-per-environment works for ≤10 environments; past that, Terragrunt or a layered directory structure become necessary.
A consistent tagging strategy (Environment, ManagedBy, Product, ManagedWith, Component) solves cost attribution and makes automation possible at any scale.
At 50+ environments, you'll write 1,500+ lines of custom code for scheduling, cloning, and self-service — or you can accept that Terraform needs an operations partner.
Fortem reads your Terraform-provisioned resources and adds the ops layer: scheduling, cloning, fleet visibility, and developer self-service — without touching your HCL.

What Terraform does well for ECS Fargate

Terraform is the right tool for ECS Fargate provisioning: one HCL module call creates networking, IAM, compute, and data stores, all versioned in git and reviewed like application code.

Terraform is the right tool for provisioning ECS Fargate infrastructure. It's declarative — you describe the desired state, and Terraform makes it happen. You get task definitions, ECS services, IAM roles, security groups, load balancers, and VPC configuration all in one place, versioned in git.

What matters more than the HCL syntax is the workflow it enables. Infrastructure changes go through the same PR process as application code. Your CI pipeline runs terraform plan on every pull request. A senior engineer reviews the diff before merge. If something goes wrong, you roll back by applying the previous commit. This is the gold standard for infrastructure management, and nothing below suggests replacing it.

A realistic module definition for an ECS environment — the basic building block your team is probably using or something close to it:

module "dev_ecs" {
  source = "./modules/ecs-environment"

  environment = "dev"
  region      = "us-east-1"

  vpc_cidr        = "10.1.0.0/16"
  public_subnets  = ["10.1.1.0/24", "10.1.2.0/24"]
  private_subnets = ["10.1.10.0/24", "10.1.11.0/24"]

  services = {
    api = {
      cpu    = 512
      memory = 1024
      image  = "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:latest"
      port   = 3000
      env_vars = {
        LOG_LEVEL = "debug"
      }
    }
    worker = {
      cpu    = 1024
      memory = 2048
      image  = "123456789012.dkr.ecr.us-east-1.amazonaws.com/worker:latest"
    }
  }

  rds_instance_class = "db.t3.micro"
  redis_node_type    = "cache.t3.micro"

  tags = {
    Environment = "dev"
    Team        = "backend"
    ManagedBy   = "terraform"
  }
}

This is clean, reviewable, and reproducible. One module call = one fully provisioned environment with networking, compute, and data stores. For a single environment or a handful, this is the right pattern.

Terraform patterns that scale

Three patterns handle ECS Fargate scale: module-per-environment (up to ~10 envs), Terragrunt with shared modules (15–50 envs), and a layered account/region/environment structure (50+ envs).

Teams adopt one of three patterns as they grow. There's also a fourth — Terraform workspaces per environment — but the community has largely moved past it. Workspaces aren't true state isolation, the naming is fragile (apply to the wrong workspace and you provision dev where staging should be), and HashiCorp themselves recommend against using them for environment separation. We'll skip it.

Pattern 1: Module per environment

A separate directory for each environment, each calling the same shared module with different variables.

terraform/
├── modules/
│   └── ecs-environment/     # shared module
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── dev/
│   └── main.tf             # module "dev_ecs" { ... }
├── staging/
│   └── main.tf             # module "staging_ecs" { ... }
├── qa/
│   └── main.tf
├── demo/
│   └── main.tf
└── prod/
    └── main.tf

Pros:dead simple. Anyone on the team can open a directory and understand what's deployed. No hidden state, no Terraform workspace tricks. CI can run plan/apply independently per environment — you can deploy dev without touching staging.

Cons: every new environment means copying a 15-line directory. At 30 environments, you have 30 almost-identical main.tf files. If you add a required variable to the shared module, you update 30 files. Teams outgrow this around 10–15 environments.

Pattern 2: Terragrunt + shared modules

Terragrunt wraps Terraform, keeping configurations DRY while maintaining separate state per environment. Each environment directory contains only a terragrunt.hcl file with environment-specific values — the module source points to a shared Git ref.

# terragrunt.hcl in environments/dev/
terraform {
  source = "git::git@github.com:acme/terraform-modules.git
            //ecs-environment?ref=v2.3.0"
}

inputs = {
  environment = "dev"
  vpc_cidr    = "10.1.0.0/16"
  services    = { api = { cpu = 512, memory = 1024 } }
}

remote_state {
  backend = "s3"
  config  = {
    bucket = "acme-terraform-state"
    key    = "ecs/dev/terraform.tfstate"
  }
}

Pros: explicit dependencies, multi-account-friendly, strong state isolation. Each environment has its own S3 state key — corruption stays contained. Pin modules to versioned Git tags for reproducible deploys.

Cons: another tool to learn and maintain. Your team now needs to understand both Terraform and Terragrunt. Debugging failures means tracing through two layers of indirection. Not worth it below 15 environments — the overhead outweighs the benefit.

Pattern 3: Layered (accounts → regions → environments)

The repo mirrors your cloud topology. Shared infrastructure lives at higher layers and cascades down. Each environment is a directory with subdirectories per resource type — datastores, ECS services, secrets — so a single environment change is a single terraform apply in one directory, not a full fleet-wide plan.

terraform/
├── deployment/
│   ├── accounts/
│   │   ├── dev/
│   │   │   ├── global/              # account-wide: IAM, S3, route53
│   │   │   └── regions/
│   │   │       ├── us-east-1/
│   │   │       │   ├── network/     # VPC, subnets, security groups
│   │   │       │   ├── shared/      # ECR, CloudTrail, ECS events
│   │   │       │   └── wenvs/       # environments
│   │   │       │       ├── api-dev/
│   │   │       │       │   ├── datastores/   # RDS, ElastiCache
│   │   │       │       │   ├── ecs/          # task defs, services
│   │   │       │       │   ├── secrets/      # Secrets Manager
│   │   │       │       │   └── services/     # SQS, SNS, Lambda
│   │   │       │       └── api-qa/
│   │   │       │           └── ...same layers
│   │   │       └── eu-west-2/
│   │   │           └── ...same structure
│   │   └── prod/
│   │       └── ...same structure
│   └── variables/
│       ├── accounts/{dev,prod}/     # per-account tfvars
│       └── global/                  # org-wide tfvars
└── lib/                             # shared Terraform modules

Pros:each layer owns its resources and nothing else. terraform apply runs against a single directory — a security group change doesn't trigger a plan across 60 environments. Adding a new environment copies a directory and overrides variables. The structure is self-documenting: anyone on the team can navigate the repo and understand the fleet topology without opening a diagram.

Cons:the repo itself is the configuration mechanism — there's no single file that describes what exists. New team members need to learn the directory tree. Some duplication between nearly-identical environments unless you lean on shared variables and modules. Best for 20+ environments where operational benefit of isolated state outweighs the duplication cost.

Approach	Scale limit	State isolation	Best for
Module per env	~10 envs	Strong (per-directory)	Getting started; small fleet
Terragrunt	15–50 envs	Strong (per-env key)	Multi-account; explicit deps
Layered	50+ envs	Strong (per-layer, per-env)	Fleet scale; multi-region
Workspaces	~5 envs	Weak (shared backend)	Not recommended

KEY INSIGHT: There's no universally correct pattern. A team of two managing 8 environments doesn't need Terragrunt. A team of eight managing 60 environments across three AWS accounts probably does. Pick the simplest structure your team can maintain at your current scale — you can refactor later when you need to.

The tagging strategy that makes everything easier

Set default_tags at the Terraform AWS provider level — five tags (Environment, ManagedBy, Product, ManagedWith, Component) cascade automatically to every ECS, RDS, and ALB resource.

Before scaling past 10 environments, the single most impactful thing you can do is standardize your tags. Tags feed AWS Cost Explorer, automation scripts, and every operations tool in the chain. If your tags are inconsistent, every downstream system that uses them produces wrong answers.

The simplest way to enforce tags is through the Terraform provider itself — apply them once at the provider level and every resource inherits them automatically:

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Environment = "dev"
      ManagedBy   = "platform-team"
      Product     = "acme-saas"
      ManagedWith = "terraform"
    }
  }
}

Tags set here cascade to every resource — ECS services, RDS instances, ALBs, security groups. No per-resource duplication. Override individual resources only when a specific resource genuinely needs a different value.

The minimal set that pays for itself the first time you open a bill:

Tag	Example	Purpose
Environment	dev, staging, qa, prod	Cost grouping; scheduling policy
ManagedBy	platform-team, backend	Who owns it; who to ping
Product	acme-saas, acme-ml	Bill attribution per product
ManagedWith	terraform, pulumi, cdk	IaC tool; filters what to automate
Component	ecs, rds, elasticache	AWS service type; per-service filtering

With these tags, Cost Explorer can answer any question: spend per environment, per team, per product, per AWS service. Without them, you get one aggregate compute number and a spreadsheet nobody maintains.

The naming convention matters too. A predictable pattern like {region}-{account}-{env} — e.g. use1-dev-qa1, usw2-prod-main — is both human-readable and machine-parseable. You can grep it in logs, script it in bash, and join it with billing data. The convention itself doesn't matter as much as the consistency: pick one and automate enforcement.

Terraform provisions.

An operations layer manages.

Terraform — provision

ECS services & task definitions
IAM roles & policies
VPC, subnets, security groups
ALB, target groups, listeners
RDS, ElastiCache, S3

→

adds

Operations layer — manage

Start/stop on a schedule
Clone to any region or account
One-screen fleet visibility
Developer self-service (RBAC)
Cost attribution per environment
AI diagnostics & anomaly detection

Where Terraform starts to break down at scale

At 15-20 environments: state sprawl (1,500+ resources, 4-minute plans), no scheduling, no cloning, no self-service, no cost-per-environment reporting, no orphan detection. Not a Terraform flaw — it was built for provisioning, not fleet operations. You need a separate layer.

Around 15–20 environments, teams hit the same walls. Not because Terraform is bad — because it was designed for provisioning, not operations. The distinction matters.

State sprawl

An ECS environment with VPC, subnets, security groups, ALB, target groups, ECS services, task definitions, IAM roles, RDS, and ElastiCache clocks in at about 30 resources. At 50 environments, that's 1,500 resources in state. A terraform plan across the full fleet takes 4+ minutes. Partial applies become necessary, and state drifts out of sync with reality.

"Each Fargate task in awsvpc mode consumes one elastic network interface. The default Fargate On-Demand vCPU quota is 6 vCPUs per region for new accounts — request an increase via Service Quotas before your first real workload. The ENI limit is 5,000 per region. Both grow in lockstep with your environment count."

— Amazon ECS service quotas, verified June 2026

The operations gap

Terraform provisions environments. It doesn't operate them. Every team eventually hits these six gaps and starts building:

Start/stop environments on a schedule — Write your own Lambda + EventBridge + CloudWatch cron, per environment, per timezone. Maintain it. Debug it when the Lambda silently fails.
Clone an environment — Write a new module call, copy all variable values, remember which 3 things are different between the source and the clone. Hope you didn't miss an env var.
Developer self-service — Build a web UI, or accept that developers will open PRs to the infra repo for restarts. Either way, you're now maintaining application code that isn't your product.
Cost per environment — Tag everything consistently. Wait 24 hours for Cost Explorer to update. Export to CSV. Build a spreadsheet. Repeat monthly.
Orphan detection — Write Cost Explorer queries, cross-reference with your Terraform state, and hope the tags on the orphaned resources are correct. They probably aren't — that's why the environment got orphaned.

The scheduling problem alone— per-environment, per-timezone, with manual override support — typically runs 400–600 lines of Lambda and EventBridge configuration before it's production-ready.

KEY INSIGHT: None of this is Terraform's fault. It's not what Terraform is for. The same way you wouldn't use Terraform to monitor application health or send Slack alerts, you shouldn't expect it to operate a fleet of running environments. You need a separate operations layer — built or bought.

What the operations layer needs to do

The ECS operations layer needs six things: per-environment scheduling, one-click cloning, fleet-wide visibility, RBAC developer self-service, per-environment cost attribution, and orphan detection.

To build the operations layer yourself — or evaluate something that provides it — the concrete specification for the layer that sits above Terraform, reads the resources it provisions, and manages what happens after terraform apply finishes:

Environment scheduling. Start and stop environments on a configurable schedule — per environment, per timezone, per team. Dev environments run Mon–Fri 9am–7pm. QA runs Mon–Fri 8am–8pm. Production ignores the scheduler. The system must handle the edge cases: what happens when someone manually starts a scheduled-off environment on a Saturday — does it auto-stop after the override period?

Environment cloning. Take any environment and create a copy in a different region or account, with variable overrides. Not a new Terraform module — a one-click operation that copies networking, compute, data stores, and external service config, then deploys. The mechanics of cloning an ECS environment go deeper than most teams expect when they start writing their own tooling. QA needs an isolated copy of EU production to test a compliance flow — that should be a 30-second operation, not a day of writing HCL.

Fleet visibility.One screen showing every environment: status (running/scheduled/stopped), region, services count, current monthly cost, CI/CD pipeline state, and last activity timestamp. No AWS Console tab switching. No ssh-ing into a box to find out what's running there.

Developer self-service. Developers can restart their environments, redeploy services, and view logs — for environments they own. They cannot touch production. They cannot see secrets. They cannot change infrastructure. This requires RBAC scoped to the environment level, not the AWS account level.

Cost attribution and savings tracking.Cost per environment, cost per team, total fleet savings from scheduling. Not an estimate — actual numbers from AWS billing data, updated daily. When the CTO asks “what are we spending on staging this quarter?” you answer in under 30 seconds.

How Fortem works with your existing Terraform

Fortem reads Terraform-provisioned ECS resources via AWS tags — no HCL access, no state writes, no repo permissions — adding scheduling, cloning, and cost visibility without changing terraform apply.

Fortem is the operations layer described above. It reads the resources Terraform provisions — ECS services, task definitions, IAM roles, RDS instances — through AWS tags and naming conventions. No HCL parsing. No access to your Terraform repository. No state modifications.

You run terraform apply. Fortem detects the new or changed resources, and the environment appears in the fleet view with its services, cost breakdown, and scheduling status. You didn't register anything — the tags your Terraform already applies are how Fortem discovers what exists.

Scheduling is opt-in: add a tag like schedule = "business-hours" to an environment, and Fortem stops it outside working hours and starts it before the workday begins. Remove the tag, scheduling stops. Your Terraform state was never involved.

Uninstall Fortem and everything keeps running. Your terraform apply still works. Your infrastructure was never dependent on the operations layer — it was reading it. Full IAM model on the security page.

Ready to use

Fortem runs inside your own AWS account and reads your Terraform-provisioned ECS fleet in under 30 minutes — no HCL changes, no repo access — and adds scheduling, cloning, and per-environment cost visibility on top of whatever pattern you're already using. Teams running 15–60 environments cut non-prod compute spend by 40–65% in the first month.

Book a 20-min call →

Common questions

How do you manage ECS Fargate with Terraform?

Define your ECS cluster, task definitions, services, IAM roles, ALB, and target groups as Terraform resources. Use a consistent naming convention (region-account-envname prefix) across all resources. Separate state files per environment to avoid blast radius issues — one terraform apply shouldn't affect production when you're changing staging.

What breaks when you scale ECS Terraform past 10 environments?

State file sprawl, inconsistent naming, manual tagging, and the operations gap (scheduling, cloning, cost visibility). Terraform provisions infrastructure but doesn't manage day-two operations. Teams end up maintaining Lambda scripts for scheduling and Cost Explorer queries for per-env visibility.

Does Fortem modify my Terraform state?

No. Fortem reads the resources Terraform provisions — it never writes to your state, pushes to your repo, or modifies HCL. Your infrastructure runs exactly the same whether Fortem is connected or not.

Can I still use terraform apply when using Fortem?

Yes. terraform apply and terraform destroy work exactly as before. Fortem detects the changes and updates its view automatically. You don't need to notify Fortem of infrastructure changes — it picks them up via tags and naming conventions.

How does Fortem handle Terragrunt, Pulumi, or CDK-provisioned environments?

Through tags and naming conventions. Fortem doesn't care which tool provisioned the resources. If an ECS cluster has the right tags, Fortem picks it up regardless of whether Terraform, Terragrunt, Pulumi, CDK, or CloudFormation created it.

Worth reading

LandingECS Environment SchedulingTerraform manages what runs. Scheduling manages when it runs. See how the two work together at fleet scale.Guide · How to Cut AWS ECS Fargate Costs by 65%Scheduling, right-sizing, Spot, and orphaned environments — the four methods that take a 12-environment fleet from $1,730 to $380/month.

See your fleet cost: fortem.dev/ecs-cost-calculator

DEV Community