DEV Community

Cover image for Migrating 500+ AWS Resources from ClickOps to Terraform: A Status Report
Anthony Uketui
Anthony Uketui

Posted on

Migrating 500+ AWS Resources from ClickOps to Terraform: A Status Report

TL;DR: I led the migration of a fintech platform's entire AWS infrastructure — IAM roles, ECS services, networking, databases, CI/CD pipelines — from manually created "ClickOps" resources to Terraform-managed Infrastructure as Code. Here's what worked, what broke, and the framework I used to import 500+ existing resources without downtime.


Why We Migrated

Our infrastructure was created by clicking through the AWS Console over 3+ years. It worked, but:

  • No reproducibility. If a region went down, we couldn't recreate the environment.
  • No audit trail. Who changed that security group rule? When? Why? Nobody knew.
  • Configuration drift. "Production" and "staging" had diverged in undocumented ways.
  • Disaster recovery was impossible. Without IaC, spinning up a new region meant weeks of manual work.
  • Multi-region architecture requires IaC. Our Active-Active strategy was dead on arrival without Terraform.

The Scale of the Problem

Category Resource Count
IAM Roles & Policies 191
ECS Services 50
EC2 Instances 18
Security Groups 30+
Load Balancers 18
RDS Databases 22
S3 Buckets 15+
CI/CD Pipelines 20+
Lambda Functions 15+
Route 53 Records 50+

Two regions: Production in eu-west-2 (London), Staging in us-east-2 (Ohio).


The Migration Framework

Step 1: Resource Inventory

Before writing a single line of Terraform, I cataloged every AWS resource. I used:

  • AWS CLI commands to list resources by service
  • AWS Config for resource inventory
  • Manual console review for resources that don't appear in standard APIs

Step 2: Bulk Import with Terraformer

For the initial heavy lifting, I used Terraformer to pull existing resource configurations:

terraformer import aws \
  --resources=iam,ec2,ecs,alb \
  --regions=eu-west-2,us-east-2
Enter fullscreen mode Exit fullscreen mode

This generates .tf files and .tfstate that represent your current infrastructure. It's a massive time-saver but the output needs significant cleanup.

Step 3: Cleanup & Restructuring

Terraformer's output is verbose and flat. I restructured it:

  • Removed redundant auto-generated arguments and defaults
  • Moved IAM policies into policies/*.json files and referenced them via locals
  • Organized configurations into service-specific modules
  • Standardized naming conventions

Step 4: Manual Imports

For resources that needed precision or were missed in bulk imports:

terraform import aws_iam_role.my_role my_role_name
terraform import aws_ecs_service.my_service my-cluster/my-service
terraform import aws_security_group.my_sg sg-xxxxxxxxx
Enter fullscreen mode Exit fullscreen mode

Step 5: State Verification

The most critical step: terraform plan must show no changes.

If the plan wants to destroy and recreate resources, the configuration doesn't match reality. I iterated until the plan was clean — no deletions, no replacements.

Step 6: Remote State Backend

Migrated .tfstate to S3 with DynamoDB locking:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "env/production/terraform.tfstate"
    region         = "eu-west-2"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}
Enter fullscreen mode Exit fullscreen mode

Critical protections:

  • S3 versioning enabled (rollback state if something goes wrong)
  • AES256 encryption at rest
  • DynamoDB locking prevents concurrent terraform apply
  • Public access blocked on the bucket
  • .terraform directory excluded from Git

Migration Status by Service

Completed ✅

AWS Service Scope Notes
IAM Roles, policies, instance profiles Standardized into reusable modules
Security Groups All production + staging Per-service isolation enforced
VPC & Networking Subnets, route tables, gateways Private subnet architecture

In Progress ⏳

AWS Service Scope Notes
ECS Services, task definitions, clusters Complex due to frequent deployments
ALB Load balancers, target groups, listeners Dependency on ECS completion
ACM Certificates Cutover risk — DNS validation
Route 53 DNS records Cutover risk — must be atomic

Pending 📋

AWS Service Scope Notes
RDS Databases Snapshot-first approach
S3 Buckets Encryption + versioning policies
CodePipeline CI/CD pipelines Artifact bucket dependencies
CodeBuild Build projects IAM role dependencies
ECR Container registry Lifecycle rules
CloudWatch Logs, metrics, alarms Retention policies
Lambda Functions Event source mappings
SNS Notifications Slack/email integrations

What Broke (And How I Fixed It)

Problem 1: State Drift After Manual Changes

Someone modified a security group via the Console after it was imported into Terraform. Next terraform plan wanted to revert the change.

Fix: Establish a rule: once a resource is in Terraform, the Console is read-only. All changes go through code → PR → apply.

Problem 2: ECS Task Definitions Are Append-Only

ECS task definitions create new revisions on every change. Terraform wants to manage a specific revision, but ECS services reference the "latest" revision.

Fix: Use ignore_changes for task definition in the ECS service resource, and manage task definitions separately.

Problem 3: Import Order Matters

Importing an ALB listener rule before the ALB itself causes dependency errors.

Fix: Build a dependency graph and import in order: VPC → Subnets → Security Groups → ALB → Target Groups → Listener Rules → ECS.

Problem 4: The "Phantom Diff" Problem

terraform plan showed changes for arguments that were set to AWS defaults. Terraformer exports everything, including defaults that Terraform would normally infer.

Fix: Remove explicit default values from the HCL. If Terraform's default matches AWS's default, don't set it.


Module Structure

terraform/
├── backend.tf                 # S3 + DynamoDB backend config
├── .terraform.lock.hcl        # Provider version lock
├── environments/
│   ├── production/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── terraform.tfvars
│   └── staging/
│       ├── main.tf
│       ├── variables.tf
│       └── terraform.tfvars
├── modules/
│   ├── iam/
│   ├── networking/
│   ├── ecs/
│   ├── alb/
│   ├── rds/
│   └── security/
└── policies/
    ├── codebuild-policy.json
    ├── codepipeline-policy.json
    └── ecs-execution-policy.json
Enter fullscreen mode Exit fullscreen mode

Key design decisions:

  • Separate state files per environment (production and staging can be managed independently)
  • Shared modules parameterized by environment
  • IAM policies as JSON files referenced by locals (easier to review and audit)
  • Standard tagging enforced via module defaults

Tagging Strategy

Every resource gets these tags (enforced by Terraform):

tags = {
  Environment = var.environment    # production, staging
  Service     = var.service_name   # payment-service, auth-service
  ManagedBy   = "terraform"        # distinguishes IaC from ClickOps
  Owner       = var.team           # platform, backend, frontend
}
Enter fullscreen mode Exit fullscreen mode

The ManagedBy = terraform tag is crucial. It instantly tells you whether a resource is safe to modify via Console (if it's not tagged) or must be changed via code (if it is).


Lessons Learned

  1. Import before modify. Never recreate a production resource. Always import first, verify plan is clean, then start making changes.

  2. Start with IAM. Everything depends on IAM. Get roles and policies into Terraform first — they're the foundation for every other resource.

  3. Separate state per environment. A single state file for production + staging is a disaster waiting to happen.

  4. Plan for the long tail. The first 80% of resources are fast. The last 20% (Lambda event sources, CloudWatch alarms, SNS topics) take as long as the first 80%.

  5. Document what's NOT migrated. Maintain a clear list of resources still managed via Console. This prevents the "is this in Terraform?" question.


Impact

Once complete, this migration enables:

  • Disaster recovery: Spin up a new region from code in hours, not weeks
  • Multi-region architecture: Active-Active requires identical infrastructure in both regions
  • Audit trail: Every change is a Git commit with a PR, reviewer, and timestamp
  • Compliance: PCI DSS requires documented, repeatable infrastructure processes
  • Onboarding: New engineers can understand the infrastructure by reading code

Migrating from ClickOps to Terraform isn't glamorous, but it's the foundation that makes everything else possible — multi-region, DR, compliance, and team scale. If you're staring at a Console full of manually created resources, start with IAM and work outward.

Top comments (0)