Mukami

Posted on Mar 28

Mastering Zero-Downtime Deployments with Terraform

#terraform #aws #beginners

How I Updated Production Without Taking My App Offline (Almost)

Day 12 of the 30-Day Terraform Challenge — and today I learned the difference between "it works" and "it works without anyone noticing."

Deploying infrastructure updates without downtime is one of the hardest problems in operations. It's also one of the most valuable skills you can have. Today I made it happen.

Well, almost. Let me show you what worked, what broke, and what I learned.

The Problem: Terraform's Default Behavior is Destructive

When you update a Launch Template or Auto Scaling Group, Terraform does something terrifying by default:

1. Destroy old ASG → All instances terminate → Website goes down ❌
2. Create new ASG → New instances spin up → Website comes back

Between step 1 and step 2, there's a window of silence. Sometimes seconds. Sometimes minutes. In production, that's a disaster.

The fix? A little lifecycle rule called create_before_destroy.

The Solution: create_before_destroy

This simple lifecycle block flips the order:

lifecycle {
  create_before_destroy = true
}

Instead of destroy → create, it does:

1. Create new resources
2. Wait for them to be healthy
3. Destroy old resources

No gap. No downtime.

The ASG Naming Problem

There's a catch. AWS won't let two Auto Scaling Groups have the same name. When create_before_destroy creates the new ASG before destroying the old one, they'd both have the same name and Terraform would error.

The fix: Use name_prefix instead of a fixed name:

name_prefix = "web-asg-${random_id.asg.hex}-"

Add a random_id that changes when your code changes:

resource "random_id" "asg" {
  keepers = {
    # New ID when user_data changes
    user_data = base64encode(file("${path.module}/user-data.sh"))
  }
  byte_length = 4
}

Now each deployment gets a unique name. Problem solved.

The Test: Updating from v1 to v2

I set up a traffic monitor in one terminal:

while true; do
  curl http://my-alb-dns-name
  sleep 2
done

In another terminal, I updated my user_data script and ran terraform apply.

Here's what happened:

<h1>Version 1: Hello World!</h1> - 14:34:04
<h1>Version 1: Hello World!</h1> - 14:34:06
<h1>Version 1: Hello World!</h1> - 14:34:08
...
502 Bad Gateway - 14:35:08    # Uh oh!
502 Bad Gateway - 14:35:10
502 Bad Gateway - 14:35:12
...
<h1>Version 2: Updated! Zero Downtime Works!</h1> - 14:35:25
<h1>Version 2: Updated! Zero Downtime Works!</h1> - 14:35:27

What happened? I got 17 seconds of 502 errors. Not quite zero downtime.

Why Did I Get Errors?

The new instances took longer to become healthy than the old instances took to terminate. The ALB had no healthy targets for a brief window.

The fix: Increase health_check_grace_period and add wait_for_capacity_timeout:

resource "aws_autoscaling_group" "web" {
  health_check_grace_period = 300   # 5 minutes to boot
  wait_for_capacity_timeout = "10m" # Wait for healthy instances

  lifecycle {
    create_before_destroy = true
  }
}

With these settings, Terraform waits until the new instances pass health checks before destroying the old ones. No more 502s.

The Blue/Green Alternative

create_before_destroy is great, but it has limitations. For mission-critical apps, teams use blue/green deployments:

Blue = current live environment
Green = new version, ready to go

Switch traffic instantly with a listener rule:

resource "aws_lb_listener_rule" "blue_green" {
  action {
    type = "forward"
    target_group_arn = var.active_environment == "blue" ? 
      aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn
  }
}

Change active_environment = "green", run terraform apply, and traffic switches in a single API call. Zero downtime. Instant rollback.

What I Learned

The default is dangerous. Never assume Terraform will keep your app online. Always test.

create_before_destroy is your friend. But you need to handle ASG naming with name_prefix and random_id.

Health checks matter. Give your instances time to boot. health_check_grace_period = 300 is your safety net.

Blue/green is cleaner. More setup, but atomic switches and instant rollbacks.

502s taught me more than success would have. I saw exactly where my setup failed and how to fix it.

The Code That Worked

# Unique ASG name
resource "random_id" "asg" {
  keepers = { user_data = base64encode(file("user-data.sh")) }
  byte_length = 4
}

resource "aws_autoscaling_group" "web" {
  name_prefix = "web-asg-${random_id.asg.hex}-"
  health_check_grace_period = 300
  wait_for_capacity_timeout = "10m"

  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_launch_template" "web" {
  lifecycle {
    create_before_destroy = true
  }
}

The Bottom Line

Zero-downtime deployments aren't magic. They're a combination of:

The right lifecycle rules (create_before_destroy)
Smart naming (name_prefix + random_id)
Patience (health_check_grace_period, wait_for_capacity_timeout)
The right strategy (blue/green for critical apps)

Today I learned that "it works" isn't enough. It has to work when nobody's watching. And now I know how to make that happen.

P.S. Those 502 errors were humbling. But they taught me more than a perfect run ever would. Sometimes failure is the best teacher.

DEV Community