Mary Mutua

Posted on Apr 4

Mastering Zero-Downtime Deployments with Terraform

#devops #cicd #tutorial #terraform

Day 12 of my Terraform journey focused on one of the most practical infrastructure problems: how to deploy changes without taking an application offline.

This is the kind of Terraform topic that matters immediately in real systems. It is one thing to provision infrastructure. It is another thing entirely to update that infrastructure safely while users are still depending on it.

Today I worked through:

why Terraform’s default replacement behavior can cause downtime
how create_before_destroy changes the order of operations
why Auto Scaling Group naming becomes a problem
how blue/green deployment works with a load balancer
how to test the difference between a rolling replacement and an atomic traffic switch

Why Default Terraform Can Cause Downtime

By default, if Terraform needs to replace a resource that cannot be updated in place, it often destroys the old resource before creating the new one.

For a live service, that is risky.

In a setup using an Auto Scaling Group and a load balancer, the default replacement flow can look like this:

old ASG is destroyed
instances are terminated
traffic has nowhere healthy to go
new ASG is created
new instances boot and pass health checks
the app finally comes back

That gap between old capacity disappearing and new capacity becoming healthy is the downtime window.

For production workloads, that is not acceptable.

The Fix: `create_before_destroy`

Terraform provides a lifecycle rule for this exact problem:

lifecycle {
  create_before_destroy = true
}

This changes the replacement order:

create the new resource first
let the replacement become ready
destroy the old resource only after that

In my Day 12 lab, I used this on both:

launch templates
Auto Scaling Groups

That let Terraform attempt a safer rolling replacement instead of immediately dropping the old infrastructure.

The ASG Naming Problem

There is an important catch.

When create_before_destroy = true is enabled, Terraform needs the old and new versions of the resource to exist at the same time for a short period.

That becomes a problem with Auto Scaling Groups because AWS does not allow two ASGs with the same name to exist at once.

If the name is hardcoded, the deployment fails even though the lifecycle rule is correct.

Solving It with `random_id`

The solution is to give the replacement ASG a unique name.

I used a random_id resource tied to the application version:

resource "random_id" "rolling" {
  keepers = {
    app_version = var.app_version
  }

  byte_length = 4
}

Then I used that value in the rolling launch template and ASG naming.

That solved a real deployment problem:

the old ASG could stay alive
the new ASG could be created beside it
Terraform could shift traffic without a name collision

This was one of the most important practical lessons of the day.

Rolling Zero-Downtime Deployment

To test the rolling replacement flow, I deployed a versioned response behind an Application Load Balancer.

The app initially returned:

Hello World v1

Then I changed the Terraform input to:

Hello World v2

and ran terraform apply again while continuously hitting the ALB.

The goal was simple:

traffic should continue working throughout the apply
at some point the response should switch from v1 to v2
there should be no outage during the transition

That is the essence of a zero-downtime rolling update.

What I Observed During Testing

One important real-world lesson came up during testing: zero-downtime often requires temporary extra capacity.

Because the old and new ASGs briefly coexist, AWS may need to run both old and new instances at the same time. On a small AWS quota, that can hit vCPU limits.

That happened in my lab, so I had to test the rolling and blue/green parts separately instead of running everything at once.

That is actually a useful lesson in itself:

zero-downtime is safer
but it can temporarily cost more capacity

Blue/Green Deployment

I also implemented a blue/green deployment pattern.

Instead of gradually replacing one environment, blue/green keeps two separate environments:

blue = currently live
green = next version

Traffic is controlled by the load balancer listener rule:

action {
  type             = "forward"
  target_group_arn = var.active_environment == "blue" ? aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn
}

That means traffic switching is not done by rebuilding the environment live. It is done by changing a single routing decision.

In practice:

blue served traffic first
I changed active_environment to green
ran terraform apply
traffic switched to green

This made the cutover feel much cleaner and more atomic than a normal replacement.

One Testing Detail That Matters

During the blue/green test, I initially thought the switch had not worked because the browser kept showing the old environment.

The real issue was browser caching.

Using curl or a hard refresh made it clear that Terraform had updated the listener rule correctly and the green environment was serving traffic.

That was a small but very real operational lesson:

always verify deployment changes with tools that are less likely to hide the truth behind cached responses

Rolling vs Blue/Green

The two patterns solve similar problems, but in different ways.

Rolling zero-downtime:

replaces the existing infrastructure safely
uses create_before_destroy
depends on overlap and health checks

Blue/green:

keeps two environments ready
switches traffic at the load balancer
makes the cutover more explicit and atomic

Both are valuable. Blue/green feels cleaner, but it also requires more duplicate infrastructure.

My Main Takeaway

Day 12 made one thing very clear: zero-downtime is not just about writing Terraform syntax correctly.

It depends on:

resource lifecycle order
naming strategy
healthy load balancer targets
enough temporary capacity
good testing discipline

Terraform gives you the tools, but understanding how those tools behave during replacement is what makes a production deployment safe.

Full Code

GitHub reference:
👉 Github Link

Terminal Transition

Example transition to include from your terminal testing:

Hello World v1
Hello World v1
Hello World v1
Hello World v2
Hello World v2
Hello World v2

And for blue/green:

Blue environment v1
Blue environment v1
Green environment v2
Green environment v2

Follow My Journey

This is Day 12 of my 30-Day Terraform Challenge.

See you on Day 13 🚀

DEV Community

Mastering Zero-Downtime Deployments with Terraform

Why Default Terraform Can Cause Downtime

The Fix: `create_before_destroy`

The ASG Naming Problem

Solving It with `random_id`

Rolling Zero-Downtime Deployment

What I Observed During Testing

Blue/Green Deployment

One Testing Detail That Matters

Rolling vs Blue/Green

My Main Takeaway

Full Code

Terminal Transition

Follow My Journey

Top comments (0)

Why Default Terraform Can Cause Downtime

The Fix: create_before_destroy

The ASG Naming Problem

Solving It with random_id

Rolling Zero-Downtime Deployment

What I Observed During Testing

Blue/Green Deployment

One Testing Detail That Matters

Rolling vs Blue/Green

My Main Takeaway

Full Code

Terminal Transition

Follow My Journey

The Fix: `create_before_destroy`

Solving It with `random_id`