Mary Mutua

Posted on Apr 16

A Workflow for Deploying Infrastructure Code with Terraform

#terraform

Day 21 of my Terraform journey focused on a workflow that looks familiar at first, but carries a much bigger blast radius than normal application code deployment.

In Chapter 10 of Terraform: Up & Running, Yevgeniy Brikman explains that infrastructure code can follow the same release shape as application code:

branch, review, test, merge, release, deploy.

But infrastructure code has extra risks.

A bad application deploy might return a 500 error.

A bad infrastructure deploy can break networking, remove a load balancer, corrupt state, or destroy critical resources.

That is why Terraform needs more than good code.

It needs a safe deployment workflow around the code.

For Day 21, I deployed a real infrastructure change end-to-end: adding a CloudWatch alarm to a webserver Auto Scaling Group.

GitHub reference:

👉 Day 21 Code

The Infrastructure Change

I created a standalone Day 21 environment and added a CloudWatch alarm that watches the number of healthy in-service instances in my Auto Scaling Group.

The alarm monitors:

AWS/AutoScaling
GroupInServiceInstances

It fires when the ASG has fewer in-service instances than expected.

This was a good infrastructure workflow change because it was useful, but controlled. It improved observability without changing routing, instance capacity, databases, or production resources.

Step 1: Version Control

I started with a feature branch:

git checkout -b add-cloudwatch-alarms-day21

This keeps infrastructure changes reviewable before they touch real cloud resources.

One key lesson from today: branches are good for review, but shared environments should be applied only after merge. Applying stale feature branches to real environments can accidentally undo someone else’s infrastructure change.

Step 2: Run Locally With Terraform Plan

For application code, “run locally” might mean starting the app.

For Terraform, the equivalent is running a plan.

You are not running the infrastructure locally. You are asking Terraform:

“What would you change in the real cloud environment?”

I ran:

terraform -chdir=day_21/live/dev plan -out=day21.tfplan

The plan showed:

Plan: 16 to add, 0 to change, 0 to destroy.

That summary matters.

It told me Terraform would only create new Day 21 dev resources. No existing resources would be changed or destroyed.

Step 3: Make the Infrastructure Change

The main change was this CloudWatch alarm:

resource "aws_cloudwatch_metric_alarm" "asg_capacity_low" {
  alarm_name          = "${var.cluster_name}-${random_id.suffix.hex}-asg-capacity-low"
  alarm_description   = "Alerts when the ASG has fewer in-service instances than expected."
  namespace           = "AWS/AutoScaling"
  metric_name         = "GroupInServiceInstances"
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 1
  threshold           = var.desired_capacity
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [module.webserver_cluster.alerts_topic_arn]

  dimensions = {
    AutoScalingGroupName = module.webserver_cluster.asg_name
  }

  tags = merge(local.common_tags, {
    Name = "${var.cluster_name}-asg-capacity-low"
  })
}

This detects when the ASG is not maintaining the expected number of healthy instances.

Step 4: Submit for Review

This is where infrastructure code differs from application code.

For app code, reviewers often focus on the code diff.

For Terraform, reviewers need the code diff and the plan output.

The plan is the real infrastructure preview. It shows what Terraform will actually create, change, or destroy.

In my PR, I included the plan summary:

Plan: 16 to add, 0 to change, 0 to destroy.

I also documented the two sections that are easy to skip but very important: blast radius and rollback plan.

Blast Radius

Blast radius explains what could be affected if the apply goes wrong.

For this change, the blast radius was low:

This creates a standalone Day 21 dev webserver stack and adds a CloudWatch alarm for ASG capacity. It does not modify older day folders, production resources, existing routing, existing security groups, existing databases, or existing capacity.

If the apply failed halfway, the likely impact would be partial Day 21 dev resources such as:

ALB
ASG
security groups
CloudWatch alarms

Existing infrastructure should not be affected because the lab used separate Day 21 names, tags, and Terraform workspace state.

This is one of the biggest Day 21 lessons for me: an infrastructure PR should not only say what changed. It should also say what could break.

Rollback Plan

The rollback plan was documented before deployment:

If the apply causes issues, revert the PR and apply the reviewed rollback plan.

Because this was a standalone dev lab, the fastest cleanup path was:

terraform -chdir=day_21/live/dev destroy

Rollback planning matters because infrastructure does not always roll back cleanly like application code. Sometimes rollback means reverting code. Sometimes it means restoring state. Sometimes it means manual cleanup.

The worst time to decide that is during an incident.

Step 5: Run Automated Tests

The PR ran GitHub Actions checks.

For Day 21, the workflow ran:

terraform fmt -check -recursive day_21
terraform validate
terraform test

The native Terraform tests passed:

4 passed, 0 failed

These tests were intentionally fast. Day 21 was not about re-running expensive end-to-end tests. It was about building a safe infrastructure deployment workflow.

The real AWS verification happened after the reviewed apply.

Step 6: Merge and Release

After the PR checks passed, I merged the PR and tagged the release:

git tag -a "v1.4.0" -m "Add ASG capacity alarm for Day 21 infrastructure workflow"
git push origin v1.4.0

For application code, the release artifact might be a Docker image.

For Terraform, the Git commit or module tag becomes the release artifact. That makes infrastructure changes traceable.

Step 7: Deploy

The safest approach is to apply a saved plan:

terraform apply day21.tfplan

But I hit an important real-world Terraform lesson:

Error: Saved plan is stale

This means the plan was created against an older state snapshot. Terraform could no longer guarantee that applying it would match the current state.

That is a safety feature, not a failure.

The fix was to create a new saved plan from the current state:

terraform -chdir=day_21/live/dev plan -out=day21-reviewed.tfplan

I confirmed the new plan still matched the reviewed intent:

Plan: 16 to add, 0 to change, 0 to destroy.

Then I applied the regenerated saved plan:

terraform -chdir=day_21/live/dev apply day21-reviewed.tfplan

That kept the workflow disciplined: even after the stale-plan issue, I still applied a reviewed saved plan instead of running a plain terraform apply.

Verification

After apply, I verified the application through the ALB:

curl -s http://$(terraform -chdir=day_21/live/dev output -raw alb_dns_name)

The response confirmed the app was live:

Hello from Day 21 infrastructure workflow

Then I verified the CloudWatch alarm existed:

aws cloudwatch describe-alarms \
  --alarm-names "$(terraform -chdir=day_21/live/dev output -raw asg_capacity_alarm_name)" \
  --query "MetricAlarms[*].[AlarmName,Namespace,MetricName,Threshold,ComparisonOperator,StateValue]" \
  --output table

The alarm was created against:

AWS/AutoScaling
GroupInServiceInstances
LessThanThreshold

Finally, I ran another plan:

terraform -chdir=day_21/live/dev plan

Terraform returned:

No changes. Your infrastructure matches the configuration.

That confirmed Terraform state and AWS matched after apply.

Cleanup

Because this deployed real AWS resources, cleanup was part of the workflow.

terraform -chdir=day_21/live/dev destroy

After destroy, I verified that no active Day 21 resources remained.

Cleanup is not an afterthought. It is part of testing infrastructure safely.

Infrastructure Safeguards That Application Code Usually Does Not Need

Day 21 made the infrastructure-specific safeguards very clear.

Saved plans

A saved plan helps ensure that what gets applied is what was reviewed.

terraform plan -out=reviewed.tfplan
terraform apply reviewed.tfplan

State protection

Terraform state represents real infrastructure. If state is wrong, Terraform’s view of the world is wrong.

That is why remote state, locking, and state backups matter.

Destructive-change approval

Any plan with destroys deserves extra attention.

Plan: 0 to add, 2 to change, 1 to destroy

That should trigger a deeper review before apply.

Blast radius documentation

Every infrastructure PR should explain what could be affected if the change fails.

This is especially important for shared resources like:

VPCs
security groups
IAM roles
databases
load balancers

Rollback plan

A rollback plan should exist before apply.

If something breaks, the team should already know whether the rollback is:

revert the PR
apply a previous plan
restore state
restore from backup
manually clean up partial resources

Sentinel as the Enforcement Layer

Sentinel is Terraform Cloud’s policy-as-code framework.

terraform validate checks whether Terraform code is valid.

Sentinel checks whether a valid plan is allowed by policy.

For Day 21, I added a Sentinel policy that only allows approved EC2 instance types:

import "tfplan/v2" as tfplan

allowed_instance_types = ["t2.micro", "t2.small", "t2.medium", "t3.micro", "t3.small"]

main = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is not "aws_instance" or
    rc.change.after.instance_type in allowed_instance_types
  }
}

This means a plan could be syntactically correct, but still blocked because it violates infrastructure policy.

That is the difference between validation and governance.

My Main Takeaway

The infrastructure workflow looks similar to the application workflow:

branch
plan
review
test
merge
release
deploy

But infrastructure needs stronger safeguards because the consequences are different.

The biggest lesson from Day 21 was this:

the Terraform plan is not just a command output.

It is the review artifact, the safety checkpoint, and the bridge between code and real cloud changes.

Good infrastructure deployment is not just terraform apply.

It is plan, review, blast-radius thinking, rollback planning, policy checks, verification, and cleanup.

DEV Community