Anthony Uketui

Posted on Jul 2

Migrating 500+ AWS Resources from ClickOps to Terraform: A Status Report

#terraform #devops #aws #infrastructure

TL;DR: I led the migration of a fintech platform's entire AWS infrastructure — IAM roles, ECS services, networking, databases, CI/CD pipelines — from manually created "ClickOps" resources to Terraform-managed Infrastructure as Code. Here's what worked, what broke, and the framework I used to import 500+ existing resources without downtime.

Why We Migrated

Our infrastructure was created by clicking through the AWS Console over 3+ years. It worked, but:

No reproducibility. If a region went down, we couldn't recreate the environment.
No audit trail. Who changed that security group rule? When? Why? Nobody knew.
Configuration drift. "Production" and "staging" had diverged in undocumented ways.
Disaster recovery was impossible. Without IaC, spinning up a new region meant weeks of manual work.
Multi-region architecture requires IaC. Our Active-Active strategy was dead on arrival without Terraform.

The Scale of the Problem

Category	Resource Count
IAM Roles & Policies	191
ECS Services	50
EC2 Instances	18
Security Groups	30+
Load Balancers	18
RDS Databases	22
S3 Buckets	15+
CI/CD Pipelines	20+
Lambda Functions	15+
Route 53 Records	50+

Two regions: Production in eu-west-2 (London), Staging in us-east-2 (Ohio).

The Migration Framework

Step 1: Resource Inventory

Before writing a single line of Terraform, I cataloged every AWS resource. I used:

AWS CLI commands to list resources by service
AWS Config for resource inventory
Manual console review for resources that don't appear in standard APIs

Step 2: Bulk Import with Terraformer

For the initial heavy lifting, I used Terraformer to pull existing resource configurations:

terraformer import aws \
  --resources=iam,ec2,ecs,alb \
  --regions=eu-west-2,us-east-2

This generates .tf files and .tfstate that represent your current infrastructure. It's a massive time-saver but the output needs significant cleanup.

Step 3: Cleanup & Restructuring

Terraformer's output is verbose and flat. I restructured it:

Removed redundant auto-generated arguments and defaults
Moved IAM policies into policies/*.json files and referenced them via locals
Organized configurations into service-specific modules
Standardized naming conventions

Step 4: Manual Imports

For resources that needed precision or were missed in bulk imports:

terraform import aws_iam_role.my_role my_role_name
terraform import aws_ecs_service.my_service my-cluster/my-service
terraform import aws_security_group.my_sg sg-xxxxxxxxx

Step 5: State Verification

The most critical step: terraform plan must show no changes.

If the plan wants to destroy and recreate resources, the configuration doesn't match reality. I iterated until the plan was clean — no deletions, no replacements.

Step 6: Remote State Backend

Migrated .tfstate to S3 with DynamoDB locking:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "env/production/terraform.tfstate"
    region         = "eu-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Critical protections:

S3 versioning enabled (rollback state if something goes wrong)
AES256 encryption at rest
DynamoDB locking prevents concurrent terraform apply
Public access blocked on the bucket
.terraform directory excluded from Git

Migration Status by Service

Completed ✅

AWS Service	Scope	Notes
IAM	Roles, policies, instance profiles	Standardized into reusable modules
Security Groups	All production + staging	Per-service isolation enforced
VPC & Networking	Subnets, route tables, gateways	Private subnet architecture

In Progress ⏳

AWS Service	Scope	Notes
ECS	Services, task definitions, clusters	Complex due to frequent deployments
ALB	Load balancers, target groups, listeners	Dependency on ECS completion
ACM	Certificates	Cutover risk — DNS validation
Route 53	DNS records	Cutover risk — must be atomic

Pending 📋

AWS Service	Scope	Notes
RDS	Databases	Snapshot-first approach
S3	Buckets	Encryption + versioning policies
CodePipeline	CI/CD pipelines	Artifact bucket dependencies
CodeBuild	Build projects	IAM role dependencies
ECR	Container registry	Lifecycle rules
CloudWatch	Logs, metrics, alarms	Retention policies
Lambda	Functions	Event source mappings
SNS	Notifications	Slack/email integrations

What Broke (And How I Fixed It)

Problem 1: State Drift After Manual Changes

Someone modified a security group via the Console after it was imported into Terraform. Next terraform plan wanted to revert the change.

Fix: Establish a rule: once a resource is in Terraform, the Console is read-only. All changes go through code → PR → apply.

Problem 2: ECS Task Definitions Are Append-Only

ECS task definitions create new revisions on every change. Terraform wants to manage a specific revision, but ECS services reference the "latest" revision.

Fix: Use ignore_changes for task definition in the ECS service resource, and manage task definitions separately.

Problem 3: Import Order Matters

Importing an ALB listener rule before the ALB itself causes dependency errors.

Fix: Build a dependency graph and import in order: VPC → Subnets → Security Groups → ALB → Target Groups → Listener Rules → ECS.

Problem 4: The "Phantom Diff" Problem

terraform plan showed changes for arguments that were set to AWS defaults. Terraformer exports everything, including defaults that Terraform would normally infer.

Fix: Remove explicit default values from the HCL. If Terraform's default matches AWS's default, don't set it.

Module Structure

terraform/
├── backend.tf                 # S3 + DynamoDB backend config
├── .terraform.lock.hcl        # Provider version lock
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── staging/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── modules/
│   ├── iam/
│   ├── networking/
│   ├── ecs/
│   ├── alb/
│   ├── rds/
│   └── security/
└── policies/
    ├── codebuild-policy.json
    ├── codepipeline-policy.json
    └── ecs-execution-policy.json

Key design decisions:

Separate state files per environment (production and staging can be managed independently)
Shared modules parameterized by environment
IAM policies as JSON files referenced by locals (easier to review and audit)
Standard tagging enforced via module defaults

Tagging Strategy

Every resource gets these tags (enforced by Terraform):

tags = {
  Environment = var.environment    # production, staging
  Service     = var.service_name   # payment-service, auth-service
  ManagedBy   = "terraform"        # distinguishes IaC from ClickOps
  Owner       = var.team           # platform, backend, frontend
}

The ManagedBy = terraform tag is crucial. It instantly tells you whether a resource is safe to modify via Console (if it's not tagged) or must be changed via code (if it is).

Lessons Learned

Import before modify. Never recreate a production resource. Always import first, verify plan is clean, then start making changes.
Start with IAM. Everything depends on IAM. Get roles and policies into Terraform first — they're the foundation for every other resource.
Separate state per environment. A single state file for production + staging is a disaster waiting to happen.
Plan for the long tail. The first 80% of resources are fast. The last 20% (Lambda event sources, CloudWatch alarms, SNS topics) take as long as the first 80%.
Document what's NOT migrated. Maintain a clear list of resources still managed via Console. This prevents the "is this in Terraform?" question.

Impact

Once complete, this migration enables:

Disaster recovery: Spin up a new region from code in hours, not weeks
Multi-region architecture: Active-Active requires identical infrastructure in both regions
Audit trail: Every change is a Git commit with a PR, reviewer, and timestamp
Compliance: PCI DSS requires documented, repeatable infrastructure processes
Onboarding: New engineers can understand the infrastructure by reading code

Migrating from ClickOps to Terraform isn't glamorous, but it's the foundation that makes everything else possible — multi-region, DR, compliance, and team scale. If you're staring at a Console full of manually created resources, start with IAM and work outward.

DEV Community