DEV Community

Cover image for Managing ECS Fargate with Terraform: What Works and What Doesn't
Matt
Matt

Posted on • Originally published at fortem.dev

Managing ECS Fargate with Terraform: What Works and What Doesn't

Managing ECS Fargate with Terraform: What Works and What Doesn't

Originally published at https://fortem.dev/blog/ecs-fargate-terraform
Terraform is the right tool for provisioning ECS Fargate infrastructure. But at 10+ environments, state sprawl and the operations gap catch every team. Here's what to build, what to buy, and the patterns that scale.


Guide

TL;DR

  • Terraform is the correct tool for provisioning ECS Fargate infrastructure — this article won't try to replace it.
  • Module-per-environment works for ≤10 environments; past that, Terragrunt or a layered directory structure become necessary.
  • A consistent tagging strategy (Environment, ManagedBy, Product, ManagedWith, Component) solves cost attribution and makes automation possible at any scale.
  • At 50+ environments, you'll write 1,500+ lines of custom code for scheduling, cloning, and self-service — or you can accept that Terraform needs an operations partner.
  • Fortem reads your Terraform-provisioned resources and adds the ops layer: scheduling, cloning, fleet visibility, and developer self-service — without touching your HCL.

What Terraform does well for ECS Fargate

Terraform is the right tool for provisioning ECS Fargate infrastructure. It's declarative — you describe the desired state, and Terraform makes it happen. You get task definitions, ECS services, IAM roles, security groups, load balancers, and VPC configuration all in one place, versioned in git.

What matters more than the HCL syntax is the workflow it enables. Infrastructure changes go through the same PR process as application code. Your CI pipeline runs terraform plan on every pull request. A senior engineer reviews the diff before merge. If something goes wrong, you roll back by applying the previous commit. This is the gold standard for infrastructure management, and nothing in this article suggests replacing it.

Here's a realistic module definition for an ECS environment — the basic building block your team is probably using or something close to it:

module "dev_ecs" {
  source = "./modules/ecs-environment"

  environment = "dev"
  region      = "us-east-1"

  vpc_cidr        = "10.1.0.0/16"
  public_subnets  = ["10.1.1.0/24", "10.1.2.0/24"]
  private_subnets = ["10.1.10.0/24", "10.1.11.0/24"]

  services = {
    api = {
      cpu    = 512
      memory = 1024
      image  = "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:latest"
      port   = 3000
      env_vars = {
        LOG_LEVEL = "debug"
      }
    }
    worker = {
      cpu    = 1024
      memory = 2048
      image  = "123456789012.dkr.ecr.us-east-1.amazonaws.com/worker:latest"
    }
  }

  rds_instance_class = "db.t3.micro"
  redis_node_type    = "cache.t3.micro"

  tags = {
    Environment = "dev"
    Team        = "backend"
    ManagedBy   = "terraform"
  }
}
Enter fullscreen mode Exit fullscreen mode

This is clean, reviewable, and reproducible. One module call = one fully provisioned environment with networking, compute, and data stores. For a single environment or a handful, this is the right pattern.

Terraform patterns that actually scale

Teams adopt one of three patterns as they grow. There's also a fourth — Terraform workspaces per environment — but the community has largely moved past it. Workspaces aren't true state isolation, the naming is fragile (apply to the wrong workspace and you provision dev where staging should be), and HashiCorp themselves recommend against using them for environment separation. We'll skip it.

Pattern 1: Module per environment

A separate directory for each environment, each calling the same shared module with different variables.

terraform/
├── modules/
│   └── ecs-environment/     # shared module
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── dev/
│   └── main.tf             # module "dev_ecs" { ... }
├── staging/
│   └── main.tf             # module "staging_ecs" { ... }
├── qa/
│   └── main.tf
├── demo/
│   └── main.tf
└── prod/
    └── main.tf
Enter fullscreen mode Exit fullscreen mode

Pros:dead simple. Anyone on the team can open a directory and understand what's deployed. No hidden state, no Terraform workspace tricks. CI can run plan/apply independently per environment — you can deploy dev without touching staging.

Cons: every new environment means copying a 15-line directory. At 30 environments, you have 30 almost-identical main.tf files. If you add a required variable to the shared module, you update 30 files. Teams outgrow this around 10–15 environments.

Pattern 2: Terragrunt + shared modules

Terragrunt wraps Terraform, keeping configurations DRY while maintaining separate state per environment. Each environment directory contains only a terragrunt.hcl file with environment-specific values — the module source points to a shared Git ref.

# terragrunt.hcl in environments/dev/
terraform {
  source = "git::git@github.com:acme/terraform-modules.git
            //ecs-environment?ref=v2.3.0"
}

inputs = {
  environment = "dev"
  vpc_cidr    = "10.1.0.0/16"
  services    = { api = { cpu = 512, memory = 1024 } }
}

remote_state {
  backend = "s3"
  config  = {
    bucket = "acme-terraform-state"
    key    = "ecs/dev/terraform.tfstate"
  }
}
Enter fullscreen mode Exit fullscreen mode

Pros: explicit dependencies, multi-account-friendly, strong state isolation. Each environment has its own S3 state key — corruption stays contained. Pin modules to versioned Git tags for reproducible deploys.

Cons: another tool to learn and maintain. Your team now needs to understand both Terraform and Terragrunt. Debugging failures means tracing through two layers of indirection. Not worth it below 15 environments — the overhead outweighs the benefit.

Pattern 3: Layered (accounts → regions → environments)

The repo mirrors your cloud topology. Shared infrastructure lives at higher layers and cascades down. Each environment is a directory with subdirectories per resource type — datastores, ECS services, secrets — so a single environment change is a single terraform apply in one directory, not a full fleet-wide plan.

terraform/
├── deployment/
│   ├── accounts/
│   │   ├── dev/
│   │   │   ├── global/              # account-wide: IAM, S3, route53
│   │   │   └── regions/
│   │   │       ├── us-east-1/
│   │   │       │   ├── network/     # VPC, subnets, security groups
│   │   │       │   ├── shared/      # ECR, CloudTrail, ECS events
│   │   │       │   └── wenvs/       # environments
│   │   │       │       ├── api-dev/
│   │   │       │       │   ├── datastores/   # RDS, ElastiCache
│   │   │       │       │   ├── ecs/          # task defs, services
│   │   │       │       │   ├── secrets/      # Secrets Manager
│   │   │       │       │   └── services/     # SQS, SNS, Lambda
│   │   │       │       └── api-qa/
│   │   │       │           └── ...same layers
│   │   │       └── eu-west-2/
│   │   │           └── ...same structure
│   │   └── prod/
│   │       └── ...same structure
│   └── variables/
│       ├── accounts/{dev,prod}/     # per-account tfvars
│       └── global/                  # org-wide tfvars
└── lib/                             # shared Terraform modules
Enter fullscreen mode Exit fullscreen mode

Pros:each layer owns its resources and nothing else. terraform apply runs against a single directory — a security group change doesn't trigger a plan across 60 environments. Adding a new environment copies a directory and overrides variables. The structure is self-documenting: anyone on the team can navigate the repo and understand the fleet topology without opening a diagram.

Cons:the repo itself is the configuration mechanism — there's no single file that describes what exists. New team members need to learn the directory tree. Some duplication between nearly-identical environments unless you lean on shared variables and modules. Best for 20+ environments where operational benefit of isolated state outweighs the duplication cost.

Approach Scale limit State isolation Best for
Module per env ~10 envs Strong (per-directory) Getting started; small fleet
Terragrunt 15–50 envs Strong (per-env key) Multi-account; explicit deps
Layered 50+ envs Strong (per-layer, per-env) Fleet scale; multi-region
Workspaces ~5 envs Weak (shared backend) Not recommended

KEY INSIGHT: There's no universally correct pattern. A team of two managing 8 environments doesn't need Terragrunt. A team of eight managing 60 environments across three AWS accounts probably does. Pick the simplest structure your team can maintain at your current scale — you can refactor later when you need to.

The tagging strategy that makes everything easier

Before scaling past 10 environments, the single highest-leverage thing you can do is standardize your tags. Tags feed AWS Cost Explorer, automation scripts, and every operations tool in the chain. If your tags are inconsistent, every downstream system that uses them produces wrong answers.

The simplest way to enforce tags is through the Terraform provider itself — apply them once at the provider level and every resource inherits them automatically:

provider "aws" {
  region = "us-east-1"

  default_tags {
    tags = {
      Environment = "dev"
      ManagedBy   = "platform-team"
      Product     = "acme-saas"
      ManagedWith = "terraform"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Tags set here cascade to every resource — ECS services, RDS instances, ALBs, security groups. No per-resource duplication. Override individual resources only when a specific resource genuinely needs a different value.

Here's the minimal set that pays for itself the first time you open a bill:

Tag Example Purpose
Environment dev, staging, qa, prod Cost grouping; scheduling policy
ManagedBy platform-team, backend Who owns it; who to ping
Product acme-saas, acme-ml Bill attribution per product
ManagedWith terraform, pulumi, cdk IaC tool; filters what to automate
Component ecs, rds, elasticache AWS service type; per-service filtering

With these tags, Cost Explorer can answer any question: spend per environment, per team, per product, per AWS service. Without them, you get one aggregate compute number and a spreadsheet nobody maintains.

The naming convention matters too. A predictable pattern like {region}-{account}-{env} — e.g. use1-dev-qa1, usw2-prod-main — is both human-readable and machine-parseable. You can grep it in logs, script it in bash, and join it with billing data. The convention itself doesn't matter as much as the consistency: pick one and automate enforcement.

Terraform provisions.

An operations layer manages.

Terraform — provision

  • ECS services & task definitions
  • IAM roles & policies
  • VPC, subnets, security groups
  • ALB, target groups, listeners
  • RDS, ElastiCache, S3

adds

Operations layer — manage

  • Start/stop on a schedule
  • Clone to any region or account
  • One-screen fleet visibility
  • Developer self-service (RBAC)
  • Cost attribution per environment
  • AI diagnostics & anomaly detection

Where Terraform starts to break down at scale

Around 15–20 environments, teams hit the same walls. Not because Terraform is bad — because it was designed for provisioning, not operations. The distinction matters.

State sprawl

An ECS environment with VPC, subnets, security groups, ALB, target groups, ECS services, task definitions, IAM roles, RDS, and ElastiCache clocks in at about 30 resources. At 50 environments, that's 1,500 resources in state. A terraform plan across the full fleet takes 4+ minutes. Partial applies become necessary, and state drifts out of sync with reality.

The operations gap

Terraform provisions environments. It doesn't operate them. Every team eventually hits these six gaps and starts building:

  • Start/stop environments on a schedule — Write your own Lambda + EventBridge + CloudWatch cron, per environment, per timezone. Maintain it. Debug it when the Lambda silently fails.

  • Clone an environment — Write a new module call, copy all variable values, remember which 3 things are different between the source and the clone. Hope you didn't miss an env var.

  • Developer self-service — Build a web UI, or accept that developers will open PRs to the infra repo for restarts. Either way, you're now maintaining application code that isn't your product.

  • Cost per environment — Tag everything consistently. Wait 24 hours for Cost Explorer to update. Export to CSV. Build a spreadsheet. Repeat monthly.

  • Orphan detection — Write Cost Explorer queries, cross-reference with your Terraform state, and hope the tags on the orphaned resources are correct. They probably aren't — that's why the environment got orphaned.

KEY INSIGHT: None of this is Terraform's fault. It's not what Terraform is for. The same way you wouldn't use Terraform to monitor application health or send Slack alerts, you shouldn't expect it to operate a fleet of running environments. You need a separate operations layer — built or bought.

What the operations layer needs to do

If you're going to build the operations layer yourself — or evaluate something that provides it — here's the concrete list of what it needs to handle. This is the specification for the layer that sits above Terraform, reads the resources it provisions, and manages what happens after terraform apply finishes.

Environment scheduling. Start and stop environments on a configurable schedule — per environment, per timezone, per team. Dev environments run Mon–Fri 9am–7pm. QA runs Mon–Fri 8am–8pm. Production ignores the scheduler. The system must handle the edge cases: what happens when someone manually starts a scheduled-off environment on a Saturday — does it auto-stop after the override period?

Environment cloning. Take any environment and create a copy in a different region or account, with variable overrides. Not a new Terraform module — a one-click operation that copies networking, compute, data stores, and external service config, then deploys. QA needs an isolated copy of EU production to test a compliance flow. That should be a 30-second operation, not a day of writing HCL.

Fleet visibility.One screen showing every environment: status (running/scheduled/stopped), region, services count, current monthly cost, CI/CD pipeline state, and last activity timestamp. No AWS Console tab switching. No ssh-ing into a box to find out what's running there.

Developer self-service. Developers can restart their environments, redeploy services, and view logs — for environments they own. They cannot touch production. They cannot see secrets. They cannot change infrastructure. This requires RBAC scoped to the environment level, not the AWS account level.

Cost attribution and savings tracking.Cost per environment, cost per team, total fleet savings from scheduling. Not an estimate — actual numbers from AWS billing data, updated daily. When the CTO asks “what are we spending on staging this quarter?” you answer in under 30 seconds.

How Fortem works with your existing Terraform

Fortem is the operations layer described above. It reads the resources Terraform provisions — ECS services, task definitions, IAM roles, RDS instances — through AWS tags and naming conventions. No HCL parsing. No access to your Terraform repository. No state modifications.

You run terraform apply. Fortem detects the new or changed resources, and the environment appears in the fleet view with its services, cost breakdown, and scheduling status. You didn't register anything — the tags your Terraform already applies are how Fortem discovers what exists.

Scheduling is opt-in: add a tag like schedule = "business-hours" to an environment, and Fortem stops it outside working hours and starts it before the workday begins. Remove the tag, scheduling stops. Your Terraform state was never involved.

Uninstall Fortem and everything keeps running. Your terraform apply still works. Your infrastructure was never dependent on the operations layer — it was just reading it. Full IAM model on the security page.

Common questions

Does Fortem modify my Terraform state?

No. Fortem reads the resources Terraform provisions — it never writes to your state, pushes to your repo, or modifies HCL. Your infrastructure runs exactly the same whether Fortem is connected or not.

Can I still use terraform apply to change infrastructure when using Fortem?

Yes. terraform apply and terraform destroy work exactly as before. Fortem detects the changes and updates its view automatically. You don't need to notify Fortem of infrastructure changes — it picks them up via tags and naming conventions.

Does Fortem need access to my Terraform repository?

No. Fortem never touches your Terraform repo or state files. It connects to your AWS account through a cross-account IAM role and reads resources directly — the same resources your Terraform provisions.

What happens to Fortem if I destroy an environment with Terraform?

Fortem detects the resources are gone and removes the environment from its dashboard. No stuck state, no sync errors. One environment disappearing doesn't affect anything else in the Fortem fleet view.

How does Fortem handle environments provisioned by Terragrunt or Pulumi instead of vanilla Terraform?

The same way — through tags and naming conventions. Fortem doesn't care which tool provisioned the resources.

### See what the operations layer looks like for your Terraform-provisioned flee


See your fleet cost: fortem.dev/ecs-cost-calculator

Top comments (0)