augusthottie

Posted on Mar 5

I Broke My AWS Pipeline on Purpose and Codified Everything in Terraform

#aws #terraform #cloudnative #devops

Testing automatic rollback by deploying broken code, then codifying 30 AWS resources into Terraform so the entire CI/CD pipeline can be created with one command.

This is Part 2. Read Part 1 here: I Built a Full AWS CI/CD Pipeline with Blue/Green Deployments

Last week I built an AWS-native CI/CD pipeline with blue/green deployments. Someone on LinkedIn asked a great question: "Did you test that rollback works without manual intervention?"

I hadn't. So I did. Then I codified the entire infrastructure in Terraform.

Proving Rollback Actually Works

It's one thing to configure auto-rollback. It's another to watch it fire. I needed to deploy a broken version and confirm the system recovers on its own.

Two Layers of Safety

The pipeline has two layers of protection, and I accidentally discovered the first one.

Layer 1: Tests catch bad code in CodeBuild. My first attempt was changing the /health endpoint to return a 500. But bun test tests that endpoint, so the build failed before the code ever reached deployment. The pipeline stopped at the Build stage. Approval and Deploy never ran.

That's actually a great result, the first safety net caught the problem before it could go anywhere. But it wasn't what I wanted to test.

Layer 2: CodeDeploy catches failures during deployment. To test this, I needed code that passes tests but fails during deployment. I left the app healthy and instead broke the validate_service.sh script:

#!/bin/bash
set -e

echo "Simulating deployment validation failure"
exit 1

# ... rest of the script never executes

This time the build passed, tests passed, I approved the deployment, and CodeDeploy deployed to the green environment. Then ValidateService ran, hit the exit 1, and the deployment failed.

The Result

CodeDeploy detected the failure and triggered an automatic rollback:

Trigger: AutomatedRollback
Execution type: ROLLBACK
Status: Succeeded

No SSH. No console clicks. No human intervention. The green instances were terminated, traffic stayed on the original blue environment, and users were never affected.

The rollback cycle took about 15 minutes, mostly ASG provisioning and the 5-minute termination wait. In production you'd tune those timers, but the mechanics are sound.

One More IAM Surprise

The automated rollback initially failed because the CodePipeline service role was missing codedeploy:GetApplicationRevision. AWS doesn't tell you this is needed until rollback actually fires.

This is a theme with this project: you don't discover the permission you're missing until you trigger the specific action that needs it.

Infrastructure as Code with Terraform

After proving everything works, I codified the entire infrastructure. The goal: anyone should be able to recreate this pipeline with terraform apply.

The Structure

terraform/
├── main.tf              # Provider, default VPC data sources
├── variables.tf         # Configurable inputs
├── outputs.tf           # ALB URL, resource names
├── iam.tf               # All IAM roles + policies
├── s3.tf                # Artifact bucket with lifecycle rules
├── sns.tf               # Approval notification topic
├── security-groups.tf   # ALB + EC2 security groups
├── alb.tf               # Load balancer, target group, listener
├── asg.tf               # Launch template + Auto Scaling Group
├── codebuild.tf         # Build project with S3 caching
├── codedeploy.tf        # App + blue/green deployment group
└── codepipeline.tf      # Pipeline + CloudWatch Event trigger

30 resources. One command. The full pipeline, IAM roles, S3 bucket, SNS topic, security groups, ALB, ASG with launch template, CodeBuild project, CodeDeploy blue/green deployment group, CodePipeline with all four stages, and the CloudWatch Event rule to auto-trigger on pushes.

The IAM Battle (Round 2)

I thought I'd learned all the IAM lessons during the manual build. Terraform taught me new ones.

Blue/green needs broad Auto Scaling permissions. I started with a carefully scoped list of 17 specific autoscaling: actions. The deployment group wouldn't even create, it needed autoscaling:RecordLifecycleActionHeartbeat, which isn't in any AWS documentation for CodeDeploy. Even after adding that, deployments failed with the vague error "does not give you permission to perform operations in AmazonAutoScaling." The fix: autoscaling:*. AWS's blue/green implementation calls undocumented internal actions.

Same story with Elastic Load Balancing. Specific ELB permissions let CodeDeploy create the deployment group but failed during actual traffic shifting, the replacement instances couldn't register in the target group. The fix: elasticloadbalancing:*.

iam:PassRole needs wide scope. I initially scoped PassRole to specific role ARNs. Blue/green deployments pass roles to service-linked roles with unpredictable ARN patterns. The practical approach is Resource: "*" with a condition restricting which services can receive the role:

{
  Effect   = "Allow"
  Action   = "iam:PassRole"
  Resource = "*"
  Condition = {
    StringLike = {
      "iam:PassedToService" = [
        "ec2.amazonaws.com",
        "autoscaling.amazonaws.com",
        "codedeploy.amazonaws.com"
      ]
    }
  }
}

IAM propagation causes race conditions. The pipeline triggered immediately after Terraform created it, before IAM policies had propagated. The Source stage failed with "Insufficient permissions" even though the policy was correct. Retrying a minute later worked fine. This is a Terraform-specific issue, manual builds don't hit it because there's a human-speed delay between creating roles and using them.

CodeDeploy Agent Gotchas on Amazon Linux 2023

The launch template user data needs careful handling:

Add a 30-second sleep before installing packages. The instance needs time to initialize before package managers work reliably.
Use /tmp for downloads, not /home/ec2-user, the home directory may not exist during early boot.
Log everything with exec > /var/log/user-data.log 2>&1 so you can debug via SSM if the agent doesn't start.

Without these, instances boot but the CodeDeploy agent silently fails to install, and deployments time out with "CodeDeploy agent was not able to receive the lifecycle event."

Orphaned ASGs, The Undocumented Gotcha

When blue/green deployments fail mid-way, they leave behind orphaned Auto Scaling Groups with names like CodeDeploy_my-dg_d-ABC123. These block subsequent deployments because instances from the orphaned ASG are still registered in the target group as unhealthy.

The fix: check for and delete them before retrying.

# Find orphaned ASGs
aws autoscaling describe-auto-scaling-groups \
  --query "AutoScalingGroups[?contains(AutoScalingGroupName, 'CodeDeploy')].AutoScalingGroupName" \
  --output text

# Delete them
aws autoscaling delete-auto-scaling-group \
  --auto-scaling-group-name "NAME_HERE" --force-delete

This cost me hours of debugging. I'm writing it down so it doesn't cost you the same.

The Cost Advantage of Terraform

With everything in Terraform, the economics change:

# Done for the day? Tear it all down.
terraform destroy

# Ready to work again? Bring it all back.
terraform apply

The ALB alone costs ~$16/month. When you're learning and not actively using the pipeline, that adds up. Terraform lets you pay only for the hours you're actually working.

What I'd Do Differently Next Time

Start with autoscaling:* and elasticloadbalancing:*, then tighten permissions later using CloudTrail logs to see which actions are actually called. Fighting undocumented permissions one at a time wastes hours.

Add CloudWatch alarms for post-deploy monitoring. Right now rollback only triggers if lifecycle hooks fail during deployment. In production, you'd want alarms monitoring error rates and latency after traffic shifts, with automatic rollback if metrics spike.

Implement canary deployments. Instead of shifting 100% of traffic at once, shift 10% first, monitor for a few minutes, then complete the rollout. CodeDeploy supports this with custom deployment configurations.

Key Takeaways

Test your failure modes. Don't assume rollback works, prove it! Deploy a broken version and watch it recover. That confidence is worth more than any passing test.
Terraform turns a project into a product. A manually-built pipeline is a demo. A Terraform-codified pipeline is something a team can use. And terraform destroy changes the economics of learning on AWS.
Start broad with IAM, tighten later. For blue/green deployments specifically, AWS calls undocumented internal actions. Scoped permissions fail silently with vague errors. Get it working first, then use CloudTrail to scope down.
Document the gotchas nobody else does. Orphaned ASGs, agent installation timing, IAM propagation race conditions, these are the problems you'll actually hit in production, and they're not in any tutorial.

DEV Community