DEV Community

Mary Mutua
Mary Mutua

Posted on

Mastering Zero-Downtime Deployments with Terraform

Day 12 of my Terraform journey focused on one of the most practical infrastructure problems: how to deploy changes without taking an application offline.

This is the kind of Terraform topic that matters immediately in real systems. It is one thing to provision infrastructure. It is another thing entirely to update that infrastructure safely while users are still depending on it.

Today I worked through:

  • why Terraformโ€™s default replacement behavior can cause downtime
  • how create_before_destroy changes the order of operations
  • why Auto Scaling Group naming becomes a problem
  • how blue/green deployment works with a load balancer
  • how to test the difference between a rolling replacement and an atomic traffic switch

Why Default Terraform Can Cause Downtime

By default, if Terraform needs to replace a resource that cannot be updated in place, it often destroys the old resource before creating the new one.

For a live service, that is risky.

In a setup using an Auto Scaling Group and a load balancer, the default replacement flow can look like this:

  • old ASG is destroyed
  • instances are terminated
  • traffic has nowhere healthy to go
  • new ASG is created
  • new instances boot and pass health checks
  • the app finally comes back

That gap between old capacity disappearing and new capacity becoming healthy is the downtime window.

For production workloads, that is not acceptable.

The Fix: create_before_destroy

Terraform provides a lifecycle rule for this exact problem:

lifecycle {
  create_before_destroy = true
}
Enter fullscreen mode Exit fullscreen mode

This changes the replacement order:

  • create the new resource first
  • let the replacement become ready
  • destroy the old resource only after that

In my Day 12 lab, I used this on both:

  • launch templates
  • Auto Scaling Groups

That let Terraform attempt a safer rolling replacement instead of immediately dropping the old infrastructure.

The ASG Naming Problem

There is an important catch.

When create_before_destroy = true is enabled, Terraform needs the old and new versions of the resource to exist at the same time for a short period.

That becomes a problem with Auto Scaling Groups because AWS does not allow two ASGs with the same name to exist at once.

If the name is hardcoded, the deployment fails even though the lifecycle rule is correct.

Solving It with random_id

The solution is to give the replacement ASG a unique name.

I used a random_id resource tied to the application version:

resource "random_id" "rolling" {
  keepers = {
    app_version = var.app_version
  }

  byte_length = 4
}
Enter fullscreen mode Exit fullscreen mode

Then I used that value in the rolling launch template and ASG naming.

That solved a real deployment problem:

  • the old ASG could stay alive
  • the new ASG could be created beside it
  • Terraform could shift traffic without a name collision

This was one of the most important practical lessons of the day.

Rolling Zero-Downtime Deployment

To test the rolling replacement flow, I deployed a versioned response behind an Application Load Balancer.

The app initially returned:

  • Hello World v1

Then I changed the Terraform input to:

  • Hello World v2

and ran terraform apply again while continuously hitting the ALB.

The goal was simple:

  • traffic should continue working throughout the apply
  • at some point the response should switch from v1 to v2
  • there should be no outage during the transition

That is the essence of a zero-downtime rolling update.

What I Observed During Testing

One important real-world lesson came up during testing: zero-downtime often requires temporary extra capacity.

Because the old and new ASGs briefly coexist, AWS may need to run both old and new instances at the same time. On a small AWS quota, that can hit vCPU limits.

That happened in my lab, so I had to test the rolling and blue/green parts separately instead of running everything at once.

That is actually a useful lesson in itself:

  • zero-downtime is safer
  • but it can temporarily cost more capacity

Blue/Green Deployment

I also implemented a blue/green deployment pattern.

Instead of gradually replacing one environment, blue/green keeps two separate environments:

  • blue = currently live
  • green = next version

Traffic is controlled by the load balancer listener rule:

action {
  type             = "forward"
  target_group_arn = var.active_environment == "blue" ? aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn
}
Enter fullscreen mode Exit fullscreen mode

That means traffic switching is not done by rebuilding the environment live. It is done by changing a single routing decision.

In practice:

  • blue served traffic first
  • I changed active_environment to green
  • ran terraform apply
  • traffic switched to green

This made the cutover feel much cleaner and more atomic than a normal replacement.

One Testing Detail That Matters

During the blue/green test, I initially thought the switch had not worked because the browser kept showing the old environment.

The real issue was browser caching.

Using curl or a hard refresh made it clear that Terraform had updated the listener rule correctly and the green environment was serving traffic.

That was a small but very real operational lesson:

  • always verify deployment changes with tools that are less likely to hide the truth behind cached responses

Rolling vs Blue/Green

The two patterns solve similar problems, but in different ways.

Rolling zero-downtime:

  • replaces the existing infrastructure safely
  • uses create_before_destroy
  • depends on overlap and health checks

Blue/green:

  • keeps two environments ready
  • switches traffic at the load balancer
  • makes the cutover more explicit and atomic

Both are valuable. Blue/green feels cleaner, but it also requires more duplicate infrastructure.

My Main Takeaway

Day 12 made one thing very clear: zero-downtime is not just about writing Terraform syntax correctly.

It depends on:

  • resource lifecycle order
  • naming strategy
  • healthy load balancer targets
  • enough temporary capacity
  • good testing discipline

Terraform gives you the tools, but understanding how those tools behave during replacement is what makes a production deployment safe.

Full Code

GitHub reference:
๐Ÿ‘‰ Github Link

Terminal Transition

Example transition to include from your terminal testing:

Hello World v1
Hello World v1
Hello World v1
Hello World v2
Hello World v2
Hello World v2
Enter fullscreen mode Exit fullscreen mode

And for blue/green:

Blue environment v1
Blue environment v1
Green environment v2
Green environment v2
Enter fullscreen mode Exit fullscreen mode

Follow My Journey

This is Day 12 of my 30-Day Terraform Challenge.

See you on Day 13 ๐Ÿš€

Top comments (0)