DEV Community

Mary Mutua
Mary Mutua

Posted on

A Workflow for Deploying Infrastructure Code with Terraform

Day 21 of my Terraform journey focused on a workflow that looks familiar at first, but carries a much bigger blast radius than normal application code deployment.

In Chapter 10 of Terraform: Up & Running, Yevgeniy Brikman explains that infrastructure code can follow the same release shape as application code:

branch, review, test, merge, release, deploy.

But infrastructure code has extra risks.

A bad application deploy might return a 500 error.

A bad infrastructure deploy can break networking, remove a load balancer, corrupt state, or destroy critical resources.

That is why Terraform needs more than good code.

It needs a safe deployment workflow around the code.

For Day 21, I deployed a real infrastructure change end-to-end: adding a CloudWatch alarm to a webserver Auto Scaling Group.

GitHub reference:

👉 Day 21 Code

The Infrastructure Change

I created a standalone Day 21 environment and added a CloudWatch alarm that watches the number of healthy in-service instances in my Auto Scaling Group.

The alarm monitors:

AWS/AutoScaling
GroupInServiceInstances
Enter fullscreen mode Exit fullscreen mode

It fires when the ASG has fewer in-service instances than expected.

This was a good infrastructure workflow change because it was useful, but controlled. It improved observability without changing routing, instance capacity, databases, or production resources.

Step 1: Version Control

I started with a feature branch:

git checkout -b add-cloudwatch-alarms-day21
Enter fullscreen mode Exit fullscreen mode

This keeps infrastructure changes reviewable before they touch real cloud resources.

One key lesson from today: branches are good for review, but shared environments should be applied only after merge. Applying stale feature branches to real environments can accidentally undo someone else’s infrastructure change.

Step 2: Run Locally With Terraform Plan

For application code, “run locally” might mean starting the app.

For Terraform, the equivalent is running a plan.

You are not running the infrastructure locally. You are asking Terraform:

“What would you change in the real cloud environment?”

I ran:

terraform -chdir=day_21/live/dev plan -out=day21.tfplan
Enter fullscreen mode Exit fullscreen mode

The plan showed:

Plan: 16 to add, 0 to change, 0 to destroy.
Enter fullscreen mode Exit fullscreen mode

That summary matters.

It told me Terraform would only create new Day 21 dev resources. No existing resources would be changed or destroyed.

Step 3: Make the Infrastructure Change

The main change was this CloudWatch alarm:

resource "aws_cloudwatch_metric_alarm" "asg_capacity_low" {
  alarm_name          = "${var.cluster_name}-${random_id.suffix.hex}-asg-capacity-low"
  alarm_description   = "Alerts when the ASG has fewer in-service instances than expected."
  namespace           = "AWS/AutoScaling"
  metric_name         = "GroupInServiceInstances"
  statistic           = "Average"
  period              = 60
  evaluation_periods  = 1
  threshold           = var.desired_capacity
  comparison_operator = "LessThanThreshold"
  treat_missing_data  = "breaching"
  alarm_actions       = [module.webserver_cluster.alerts_topic_arn]

  dimensions = {
    AutoScalingGroupName = module.webserver_cluster.asg_name
  }

  tags = merge(local.common_tags, {
    Name = "${var.cluster_name}-asg-capacity-low"
  })
}
Enter fullscreen mode Exit fullscreen mode

This detects when the ASG is not maintaining the expected number of healthy instances.

Step 4: Submit for Review

This is where infrastructure code differs from application code.

For app code, reviewers often focus on the code diff.

For Terraform, reviewers need the code diff and the plan output.

The plan is the real infrastructure preview. It shows what Terraform will actually create, change, or destroy.

In my PR, I included the plan summary:

Plan: 16 to add, 0 to change, 0 to destroy.
Enter fullscreen mode Exit fullscreen mode

I also documented the two sections that are easy to skip but very important: blast radius and rollback plan.

Blast Radius

Blast radius explains what could be affected if the apply goes wrong.

For this change, the blast radius was low:

This creates a standalone Day 21 dev webserver stack and adds a CloudWatch alarm for ASG capacity. It does not modify older day folders, production resources, existing routing, existing security groups, existing databases, or existing capacity.
Enter fullscreen mode Exit fullscreen mode

If the apply failed halfway, the likely impact would be partial Day 21 dev resources such as:

  • ALB
  • ASG
  • security groups
  • CloudWatch alarms

Existing infrastructure should not be affected because the lab used separate Day 21 names, tags, and Terraform workspace state.

This is one of the biggest Day 21 lessons for me: an infrastructure PR should not only say what changed. It should also say what could break.

Rollback Plan

The rollback plan was documented before deployment:

If the apply causes issues, revert the PR and apply the reviewed rollback plan.
Enter fullscreen mode Exit fullscreen mode

Because this was a standalone dev lab, the fastest cleanup path was:

terraform -chdir=day_21/live/dev destroy
Enter fullscreen mode Exit fullscreen mode

Rollback planning matters because infrastructure does not always roll back cleanly like application code. Sometimes rollback means reverting code. Sometimes it means restoring state. Sometimes it means manual cleanup.

The worst time to decide that is during an incident.

Step 5: Run Automated Tests

The PR ran GitHub Actions checks.

For Day 21, the workflow ran:

terraform fmt -check -recursive day_21
terraform validate
terraform test
Enter fullscreen mode Exit fullscreen mode

The native Terraform tests passed:

4 passed, 0 failed
Enter fullscreen mode Exit fullscreen mode

These tests were intentionally fast. Day 21 was not about re-running expensive end-to-end tests. It was about building a safe infrastructure deployment workflow.

The real AWS verification happened after the reviewed apply.

Step 6: Merge and Release

After the PR checks passed, I merged the PR and tagged the release:

git tag -a "v1.4.0" -m "Add ASG capacity alarm for Day 21 infrastructure workflow"
git push origin v1.4.0
Enter fullscreen mode Exit fullscreen mode

For application code, the release artifact might be a Docker image.

For Terraform, the Git commit or module tag becomes the release artifact. That makes infrastructure changes traceable.

Step 7: Deploy

The safest approach is to apply a saved plan:

terraform apply day21.tfplan
Enter fullscreen mode Exit fullscreen mode

But I hit an important real-world Terraform lesson:

Error: Saved plan is stale
Enter fullscreen mode Exit fullscreen mode

This means the plan was created against an older state snapshot. Terraform could no longer guarantee that applying it would match the current state.

That is a safety feature, not a failure.

The fix was to create a new saved plan from the current state:

terraform -chdir=day_21/live/dev plan -out=day21-reviewed.tfplan
Enter fullscreen mode Exit fullscreen mode

I confirmed the new plan still matched the reviewed intent:

Plan: 16 to add, 0 to change, 0 to destroy.
Enter fullscreen mode Exit fullscreen mode

Then I applied the regenerated saved plan:

terraform -chdir=day_21/live/dev apply day21-reviewed.tfplan
Enter fullscreen mode Exit fullscreen mode

That kept the workflow disciplined: even after the stale-plan issue, I still applied a reviewed saved plan instead of running a plain terraform apply.

Verification

After apply, I verified the application through the ALB:

curl -s http://$(terraform -chdir=day_21/live/dev output -raw alb_dns_name)
Enter fullscreen mode Exit fullscreen mode

The response confirmed the app was live:

Hello from Day 21 infrastructure workflow
Enter fullscreen mode Exit fullscreen mode

Then I verified the CloudWatch alarm existed:

aws cloudwatch describe-alarms \
  --alarm-names "$(terraform -chdir=day_21/live/dev output -raw asg_capacity_alarm_name)" \
  --query "MetricAlarms[*].[AlarmName,Namespace,MetricName,Threshold,ComparisonOperator,StateValue]" \
  --output table
Enter fullscreen mode Exit fullscreen mode

The alarm was created against:

AWS/AutoScaling
GroupInServiceInstances
LessThanThreshold
Enter fullscreen mode Exit fullscreen mode

Finally, I ran another plan:

terraform -chdir=day_21/live/dev plan
Enter fullscreen mode Exit fullscreen mode

Terraform returned:

No changes. Your infrastructure matches the configuration.
Enter fullscreen mode Exit fullscreen mode

That confirmed Terraform state and AWS matched after apply.

Cleanup

Because this deployed real AWS resources, cleanup was part of the workflow.

terraform -chdir=day_21/live/dev destroy
Enter fullscreen mode Exit fullscreen mode

After destroy, I verified that no active Day 21 resources remained.

Cleanup is not an afterthought. It is part of testing infrastructure safely.

Infrastructure Safeguards That Application Code Usually Does Not Need

Day 21 made the infrastructure-specific safeguards very clear.

Saved plans

A saved plan helps ensure that what gets applied is what was reviewed.

terraform plan -out=reviewed.tfplan
terraform apply reviewed.tfplan
Enter fullscreen mode Exit fullscreen mode

State protection

Terraform state represents real infrastructure. If state is wrong, Terraform’s view of the world is wrong.

That is why remote state, locking, and state backups matter.

Destructive-change approval

Any plan with destroys deserves extra attention.

Plan: 0 to add, 2 to change, 1 to destroy
Enter fullscreen mode Exit fullscreen mode

That should trigger a deeper review before apply.

Blast radius documentation

Every infrastructure PR should explain what could be affected if the change fails.

This is especially important for shared resources like:

  • VPCs
  • security groups
  • IAM roles
  • databases
  • load balancers

Rollback plan

A rollback plan should exist before apply.

If something breaks, the team should already know whether the rollback is:

  • revert the PR
  • apply a previous plan
  • restore state
  • restore from backup
  • manually clean up partial resources

Sentinel as the Enforcement Layer

Sentinel is Terraform Cloud’s policy-as-code framework.

terraform validate checks whether Terraform code is valid.

Sentinel checks whether a valid plan is allowed by policy.

For Day 21, I added a Sentinel policy that only allows approved EC2 instance types:

import "tfplan/v2" as tfplan

allowed_instance_types = ["t2.micro", "t2.small", "t2.medium", "t3.micro", "t3.small"]

main = rule {
  all tfplan.resource_changes as _, rc {
    rc.type is not "aws_instance" or
    rc.change.after.instance_type in allowed_instance_types
  }
}
Enter fullscreen mode Exit fullscreen mode

This means a plan could be syntactically correct, but still blocked because it violates infrastructure policy.

That is the difference between validation and governance.

My Main Takeaway

The infrastructure workflow looks similar to the application workflow:

  • branch
  • plan
  • review
  • test
  • merge
  • release
  • deploy

But infrastructure needs stronger safeguards because the consequences are different.

The biggest lesson from Day 21 was this:

the Terraform plan is not just a command output.

It is the review artifact, the safety checkpoint, and the bridge between code and real cloud changes.

Good infrastructure deployment is not just terraform apply.

It is plan, review, blast-radius thinking, rollback planning, policy checks, verification, and cleanup.

That is what turns Terraform code into safe infrastructure engineering.


GitHub reference:

👉 Day 21 Code

Follow My Journey

This is Day 21 of my 30-Day Terraform Challenge.

See you on Day 22.

Top comments (0)