TL;DR: I led the migration of a fintech platform's entire AWS infrastructure — IAM roles, ECS services, networking, databases, CI/CD pipelines — from manually created "ClickOps" resources to Terraform-managed Infrastructure as Code. Here's what worked, what broke, and the framework I used to import 500+ existing resources without downtime.
Why We Migrated
Our infrastructure was created by clicking through the AWS Console over 3+ years. It worked, but:
- No reproducibility. If a region went down, we couldn't recreate the environment.
- No audit trail. Who changed that security group rule? When? Why? Nobody knew.
- Configuration drift. "Production" and "staging" had diverged in undocumented ways.
- Disaster recovery was impossible. Without IaC, spinning up a new region meant weeks of manual work.
- Multi-region architecture requires IaC. Our Active-Active strategy was dead on arrival without Terraform.
The Scale of the Problem
| Category | Resource Count |
|---|---|
| IAM Roles & Policies | 191 |
| ECS Services | 50 |
| EC2 Instances | 18 |
| Security Groups | 30+ |
| Load Balancers | 18 |
| RDS Databases | 22 |
| S3 Buckets | 15+ |
| CI/CD Pipelines | 20+ |
| Lambda Functions | 15+ |
| Route 53 Records | 50+ |
Two regions: Production in eu-west-2 (London), Staging in us-east-2 (Ohio).
The Migration Framework
Step 1: Resource Inventory
Before writing a single line of Terraform, I cataloged every AWS resource. I used:
- AWS CLI commands to list resources by service
- AWS Config for resource inventory
- Manual console review for resources that don't appear in standard APIs
Step 2: Bulk Import with Terraformer
For the initial heavy lifting, I used Terraformer to pull existing resource configurations:
terraformer import aws \
--resources=iam,ec2,ecs,alb \
--regions=eu-west-2,us-east-2
This generates .tf files and .tfstate that represent your current infrastructure. It's a massive time-saver but the output needs significant cleanup.
Step 3: Cleanup & Restructuring
Terraformer's output is verbose and flat. I restructured it:
- Removed redundant auto-generated arguments and defaults
- Moved IAM policies into
policies/*.jsonfiles and referenced them vialocals - Organized configurations into service-specific modules
- Standardized naming conventions
Step 4: Manual Imports
For resources that needed precision or were missed in bulk imports:
terraform import aws_iam_role.my_role my_role_name
terraform import aws_ecs_service.my_service my-cluster/my-service
terraform import aws_security_group.my_sg sg-xxxxxxxxx
Step 5: State Verification
The most critical step: terraform plan must show no changes.
If the plan wants to destroy and recreate resources, the configuration doesn't match reality. I iterated until the plan was clean — no deletions, no replacements.
Step 6: Remote State Backend
Migrated .tfstate to S3 with DynamoDB locking:
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "env/production/terraform.tfstate"
region = "eu-west-2"
encrypt = true
dynamodb_table = "terraform-locks"
}
}
Critical protections:
- S3 versioning enabled (rollback state if something goes wrong)
- AES256 encryption at rest
- DynamoDB locking prevents concurrent
terraform apply - Public access blocked on the bucket
-
.terraformdirectory excluded from Git
Migration Status by Service
Completed ✅
| AWS Service | Scope | Notes |
|---|---|---|
| IAM | Roles, policies, instance profiles | Standardized into reusable modules |
| Security Groups | All production + staging | Per-service isolation enforced |
| VPC & Networking | Subnets, route tables, gateways | Private subnet architecture |
In Progress ⏳
| AWS Service | Scope | Notes |
|---|---|---|
| ECS | Services, task definitions, clusters | Complex due to frequent deployments |
| ALB | Load balancers, target groups, listeners | Dependency on ECS completion |
| ACM | Certificates | Cutover risk — DNS validation |
| Route 53 | DNS records | Cutover risk — must be atomic |
Pending 📋
| AWS Service | Scope | Notes |
|---|---|---|
| RDS | Databases | Snapshot-first approach |
| S3 | Buckets | Encryption + versioning policies |
| CodePipeline | CI/CD pipelines | Artifact bucket dependencies |
| CodeBuild | Build projects | IAM role dependencies |
| ECR | Container registry | Lifecycle rules |
| CloudWatch | Logs, metrics, alarms | Retention policies |
| Lambda | Functions | Event source mappings |
| SNS | Notifications | Slack/email integrations |
What Broke (And How I Fixed It)
Problem 1: State Drift After Manual Changes
Someone modified a security group via the Console after it was imported into Terraform. Next terraform plan wanted to revert the change.
Fix: Establish a rule: once a resource is in Terraform, the Console is read-only. All changes go through code → PR → apply.
Problem 2: ECS Task Definitions Are Append-Only
ECS task definitions create new revisions on every change. Terraform wants to manage a specific revision, but ECS services reference the "latest" revision.
Fix: Use ignore_changes for task definition in the ECS service resource, and manage task definitions separately.
Problem 3: Import Order Matters
Importing an ALB listener rule before the ALB itself causes dependency errors.
Fix: Build a dependency graph and import in order: VPC → Subnets → Security Groups → ALB → Target Groups → Listener Rules → ECS.
Problem 4: The "Phantom Diff" Problem
terraform plan showed changes for arguments that were set to AWS defaults. Terraformer exports everything, including defaults that Terraform would normally infer.
Fix: Remove explicit default values from the HCL. If Terraform's default matches AWS's default, don't set it.
Module Structure
terraform/
├── backend.tf # S3 + DynamoDB backend config
├── .terraform.lock.hcl # Provider version lock
├── environments/
│ ├── production/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── staging/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── modules/
│ ├── iam/
│ ├── networking/
│ ├── ecs/
│ ├── alb/
│ ├── rds/
│ └── security/
└── policies/
├── codebuild-policy.json
├── codepipeline-policy.json
└── ecs-execution-policy.json
Key design decisions:
- Separate state files per environment (production and staging can be managed independently)
- Shared modules parameterized by environment
- IAM policies as JSON files referenced by
locals(easier to review and audit) - Standard tagging enforced via module defaults
Tagging Strategy
Every resource gets these tags (enforced by Terraform):
tags = {
Environment = var.environment # production, staging
Service = var.service_name # payment-service, auth-service
ManagedBy = "terraform" # distinguishes IaC from ClickOps
Owner = var.team # platform, backend, frontend
}
The ManagedBy = terraform tag is crucial. It instantly tells you whether a resource is safe to modify via Console (if it's not tagged) or must be changed via code (if it is).
Lessons Learned
Import before modify. Never recreate a production resource. Always import first, verify
planis clean, then start making changes.Start with IAM. Everything depends on IAM. Get roles and policies into Terraform first — they're the foundation for every other resource.
Separate state per environment. A single state file for production + staging is a disaster waiting to happen.
Plan for the long tail. The first 80% of resources are fast. The last 20% (Lambda event sources, CloudWatch alarms, SNS topics) take as long as the first 80%.
Document what's NOT migrated. Maintain a clear list of resources still managed via Console. This prevents the "is this in Terraform?" question.
Impact
Once complete, this migration enables:
- Disaster recovery: Spin up a new region from code in hours, not weeks
- Multi-region architecture: Active-Active requires identical infrastructure in both regions
- Audit trail: Every change is a Git commit with a PR, reviewer, and timestamp
- Compliance: PCI DSS requires documented, repeatable infrastructure processes
- Onboarding: New engineers can understand the infrastructure by reading code
Migrating from ClickOps to Terraform isn't glamorous, but it's the foundation that makes everything else possible — multi-region, DR, compliance, and team scale. If you're staring at a Console full of manually created resources, start with IAM and work outward.
Top comments (0)