How to Clone an ECS Environment Without Rewriting Terraform?
Originally published at https://fortem.dev/blog/ecs-environment-clone
Cloning 15 ECS services, an ALB, RDS, and SSM params is a 12-step manual process. Terraform workspaces break at 10+ services. Here's the template approach — and a working Terraform module.
What "clone an environment" actually means
The phrase "clone this environment" lands on you from three different directions, each with its own urgency:
- Compliance audit: "We need an isolated clone of EU production to test the GDPR flows." Need it by Friday. Cannot share production data.
- New engineer onboarding:"Can you spin up a copy of staging so the new hire can break things in private?" Need it today. Prefer it by lunch.
- QA isolation:"QA needs a clone of staging, but with the test Stripe key and a read-only RDS replica." Need it now and three more times this week.
In each case, "clone" does not mean copy one ECS service. It means copy the ensemble: 12-18 services, an Application Load Balancer with listener rules, target groups, RDS instances (or snapshots), SSM parameter paths, ECR repos, NAT Gateway routing, CloudWatch log groups, Secrets Manager entries. 15 things, each with its own copy strategy, no orchestration layer between them.
AWS has aws ecs copy-service — but it copies ONE service. Not the ALB. Not the RDS. Not the SSM params. The rest is a bundle of heterogeneous services that you hold in your head. No tool knows what belongs together except you and your Terraform.
KEY INSIGHT: An ECS environment is a bundle of services you deploy together, not a single resource. Copying a service copies one brick. Cloning a building means replicating the wiring, plumbing, and foundation too — and those live in different AWS namespaces.
The manual approach — 12 steps, 4 hours
Here's the full walkthrough. Real steps, real commands. This is what happens every time a clone request hits your Slack. The times are from doing this 20+ times.
Step 1. Copy Terraform, find-replace env names (15-20 minutes)
Open your Terraform repo. Copy the environment root module. Find environment = "production" in terraform.tfvars — change to "clone-gdpr". Then go through 8 files and change every reference to the old env name.
Step 2. Register cloned task definitions (10 minutes)
For each of the 15 services: aws ecs describe-task-definition on the source,aws ecs register-task-definitionwith a new family name. The family must include the clone env name so you don't accidentally deploy to production.
Step 3. Create the ECS cluster (2 minutes)
aws ecs create-cluster --cluster-name clone-gdpr. Trivial.
Step 4. Create 15 services (20-25 minutes)
For each service: aws ecs create-service with the cloned task def, target group ARN, subnets, security group, and service discovery namespace. If you get the VPC config wrong, the service launches in the wrong network. If you get the IAM role wrong, it launches and fails 10 minutes later.
Step 5. Copy ALB listener rules (25-30 minutes — hardest part)
The ALB has listener rules that route traffic by host header: production.api.example.com. The clone needs clone-gdpr.api.example.com. You need to:
- Add a certificate for the new subdomain in ACM (20 min for DNS validation)
- Create host-based routing rules for each service
- Point each rule at the cloned target group
- Verify priority ordering (rules are evaluated top-to-bottom; a wrong priority blocks traffic)
Steps 6-8. RDS, SSM params, Secrets Manager (20-30 minutes)
RDS: restore a snapshot or create a new instance with the same config. SSM: copy all /production/* params to /clone-gdpr/*. This is the step everyone forgets on the first attempt — the cloned services start but the ECS tasks can't read their config. They fail silently and report "running" for 5 minutes before cycling.
Steps 9-11. ECR, log groups, testing (15-20 minutes)
Update ECR repo policy if you use per-env repos. Create CloudWatch log groups for each service (ECS auto-creates them, but you want the right retention policy). Test: does service A connect to the right RDS? Does service B hit the right ElastiCache? Does the Stripe webhook fire in test mode, not production? The answer to one of these is usually "no" on the first try.
Step 12. Document the clone (10 minutes)
Write the env name, purpose, and expiry date in the team wiki. 50/50 chance this actually happens. Three months later, CTO asks "what's clone-gdpr?" and nobody remembers.
Total: 4-8 hours depending on how many things break, whether ACM decides to take 45 minutes to validate DNS, and how many SSM params you forget on the first pass.
Terraform workspaces — when they work, when they break
Terraform workspaces are the official answer for "same infrastructure, different instances." Each workspace has its own state file. When you run terraform apply, it provisions resources with the workspace name embedded in resource tags.
Workspaces work when:
- All your services are identical across environments (same task definitions, same desired counts)
- You use a single VPC for non-production
- You don't have external services (MongoDB Atlas, Vercel, Firebase) with environment-specific connection strings
Workspaces break when:
- You have 10+ services with different configurations per env.A staging service might run 2 replicas on 0.5 vCPU. A production clone of the same service runs 4 replicas on 2 vCPU. Workspaces reuse the same Terraform code, so you're writing conditional logic inside
localsblocks — which defeats the purpose. - You have 20+ environments. Terraform has a hard limit of 20 workspaces per configuration. A team with 15 environments and 5 compliance clones is at the limit.
- RDS cloning isn't a Terraform operation.It's an AWS operation that happens outside of Terraform's lifecycle. You run it separately, then update your Terraform variables to point at the clone. The orchestration gap is still there.
For the full picture of what Terraform handles well for ECS and what it doesn't, the Terraform-specific guide walks through the gaps in detail. Cloning is where the orchestration gap is most visible.
The template approach — 30 seconds, no Terraform
A template defines the environment once — services, configs, dependencies, env vars, secrets references, external service connections. Cloning from a template means: pick the source env → give it a name → pick a region → done.
Under the hood, the template engine calls 6 ECS APIs, copies task definitions, maps SSM parameter paths, sets up ALB listener rules, creates service connections, and points everything at the right resources. These are the same 12 manual steps from section 2. The difference is none of them are manual.
AWS Proton was supposed to be this — AWS's environment templating service. Proton let you define a CloudFormation template for an environment and deploy instances of it. It was the right idea: define once, clone as many times as you need.
Proton is deprecated October 7, 2026. The migration timeline is public. For teams with 15+ ECS environments and regular cloning needs, Fortem fills the gap — Proton's template engine, rebuilt for ECS Fargate specifically, without the CloudFormation layer.
KEY INSIGHT: The template approach works because it understands what an "environment" is — a bundle of services deployed together. The manual approach doesn't know this. You, the platform engineer, hold the bundle in your head. The script doesn't. When you clone an environment for the 30th time, the error rate converges to a function of how tired you are.
Ready-to-use: parameterized Terraform module
If you want a DIY approach that's better than fully manual but doesn't require a template engine, here's a parameterized Terraform module. It clones an ECS service — for the full env, run it 15 times with different service names.
# for a full environment clone.
# Usage: module "clone_service" { source = "./ecs-service" env = "clone-gdpr" ... }
variable "env" { type = string }
variable "service_name" { type = string }
variable "cluster_name" { type = string }
variable "task_family" { type = string }
variable "subnets" { type = list(string) }
variable "security_groups" { type = list(string) }
variable "container_image" { type = string }
variable "vpc_id" { type = string }
variable "alb_listener_arn" { type = string }
resource "aws_ecs_service" "service" {
name = "${var.env}-${var.service_name}"
cluster = var.cluster_name
task_definition = aws_ecs_task_definition.task.arn
desired_count = 1
launch_type = "FARGATE"
network_configuration {
subnets = var.subnets
security_groups = var.security_groups
assign_public_ip = false
}
}
resource "aws_ecs_task_definition" "task" {
family = "${var.env}-${var.task_family}"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "256"
memory = "512"
container_definitions = jsonencode([{
name = var.service_name
image = var.container_image
environment = [
{ name = "ENV_PREFIX", value = var.env },
{ name = "SSM_PATH", value = "/${var.env}/${var.service_name}/" }
]
}])
}
# ALB target group per service, host-based routing by env
resource "aws_lb_target_group" "tg" {
name = "${var.env}-${var.service_name}"
port = 80
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip"
}
resource "aws_lb_listener_rule" "rule" {
listener_arn = var.alb_listener_arn
priority = length(var.subnets) + index(var.subnets, element(var.subnets, 0)) + 30
action { target_group_arn = aws_lb_target_group.tg.arn; type = "forward" }
condition { host_header { values = ["${var.env}.${var.service_name}.example.com"] } }
}
This module works for standard ECS services. It requires you to parameterize your infrastructure — essentially, building a template engine by hand. For 3 services, this is clean. For 15 services, each with different vCPU/memory configs and different env vars, you end up with 15 copies of this module, each with its own values — and you've rebuilt Proton by accident.
When to stop cloning manually
At 5 environments:manual cloning is ~20 hours/year of copy-paste. The SSM params step is the bottleneck. You're still faster than setting up a template engine.
At 10 environments:manual cloning is ~40 hours/year. Terraform workspaces are hitting the 20-workspace limit. You're spending a work week per year on something a machine should do. The parameterized module above makes sense here.
At 20+ environments:you need a template engine. The module approach has become 15 near-identical copies of the same thing, each with slightly different values. You've rebuilt what Fortem ships, minus the UI, the RBAC, and the audit log of who cloned what and when.
When compliance is involved:stop immediately. A single forgotten SSM parameter — a single service pointing at production DB instead of the cloned DB — means a failed audit. Not an extra hour. A failed audit. Template engines don't forget SSM params. Human beings do.
Questions you probably have next
Not product FAQ. The things you actually wonder about after reading.
Can I clone between AWS accounts?
Yes, but it's not a single operation. Cross-account cloning requires copying ECR images across accounts (using cross-account pull permissions), restoring RDS snapshots into the target account, and copying SSM parameters manually. A template engine with cross-account IAM roles (like Fortem on the Scale plan) does it in one operation. Manually, it's 8-12 hours.
Does the clone include RDS data, or just the schema?
RDS cloning is a snapshot operation — it includes data. For GDPR compliance testing, you need the data to validate masking and access controls. For developer sandboxes, an empty schema is often enough. Template engines can do both (restore a snapshot OR create an empty instance with the same config). The manual Terraform module above creates an empty instance.
What about MongoDB Atlas / Vercel — can I clone external services too?
External services (Atlas, Firebase, Vercel, Cloudflare) have their own cloning APIs. No cloud templating engine can touch them directly — they're outside AWS. The Fortem approach stores the external service connection details as parameters and points the cloned services at them automatically. The actual clone of the Atlas cluster or Vercel project is separate.
How is cloning different from a Blue/Green deployment?
Blue/Green is deploying a NEW version of an EXISTING service alongside the old version — same env, different code. Cloning is copying an ENTIRE environment — different env, same infrastructure pattern. Blue/Green is about deployment safety. Cloning is about environment replication. The tools are different, the lifecycle is different. Don't confuse them.
Common questions
Does Fortem clone RDS instances?
Fortem clones ECS Fargate services and orchestrates the surrounding infrastructure (ALB rules, target groups, SSM parameter paths, Secrets Manager references, CloudWatch log groups). RDS cloning is a separate operation — you'd restore a snapshot or use pg_dump. Fortem handles the connection strings and env vars that point the cloned services at the right database instance.
How long does a clone take in Fortem?
The clone itself takes 30-60 seconds. Fortem copies task definitions, creates new ECS services with the same config, duplicates ALB listener rules with host-based routing for the new env name, maps SSM parameter paths, and sets IAM roles. RDS snapshots take longer (10-30 minutes depending on size) but are started by Fortem and completed by AWS.
Can I clone into a different AWS region?
Yes. Fortem's template engine supports region-aware parameters. The task definitions are region-agnostic. Services are created in the target region. RDS snapshots need to be copied to the target region first (Fortem initiates the copy). Cross-account cloning is also supported on the Scale and Enterprise plans.
What permissions does cloning need?
Fortem needs the same 6 read-only ECS permissions for discovery, plus ecs:RegisterTaskDefinition, ecs:CreateService, elasticloadbalancing:CreateRule, ssm:PutParameter, ecs:UpdateService, and iam:PassRole to create the cloned services. All scoped to the target environment's resources by IAM condition keys. The exact policy is published on the security page.
Map your fleet in 5 min: fortem.dev/audit
Top comments (0)