Stephanie Makori

Posted on Mar 30

Mastering Zero-Downtime Deployments with Terraform

#devops #aws #devchallenge #terraform

Day 12 of my 30-Day Terraform Challenge was all about one of the most practical and important DevOps concepts: deploying updates without taking an application offline.

Today I learned how Terraform handles infrastructure replacement, why its default behavior can cause downtime, and how to solve that using:

create_before_destroy
Auto Scaling Groups
Application Load Balancers
a full blue/green deployment strategy

This was one of the most exciting challenge days so far because it felt much closer to how production systems are actually managed.

Why Default Terraform Can Cause Downtime

Terraform is great at managing infrastructure, but not every resource can be updated in place.

Some resources — especially those tied to EC2 instance configuration, such as Launch Templates or Launch Configurations — often require replacement instead of modification.

By default, Terraform follows this sequence:

Destroy the old resource
Create the new resource

That is a problem for live infrastructure.

If an Auto Scaling Group is destroyed before the replacement is ready, then:

the old EC2 instances are terminated
the application stops serving traffic
users experience downtime
only after that does the new infrastructure come online

That gap is the outage window.

In real systems, even a short outage like that can be unacceptable.

The `create_before_destroy` Solution

Terraform provides a lifecycle rule called create_before_destroy to fix this.

Instead of deleting the old resource first, Terraform reverses the order:

Create the new resource
Wait for it to become healthy
Destroy the old resource

That simple change makes a huge difference.

It allows new infrastructure to come online before the old infrastructure disappears, which is exactly what you want during updates.

This was the key pattern behind today’s zero-downtime deployment.

The Auto Scaling Group Naming Problem

There is an important catch with create_before_destroy.

When Terraform creates the replacement resource before destroying the original one, both resources must exist at the same time.

That becomes a problem with Auto Scaling Groups because AWS does not allow two ASGs with the same name to exist simultaneously.

So if your ASG name is fixed, the deployment fails.

The Fix

The correct solution is to avoid hardcoding the name and instead use something like:

name_prefix
or a generated unique suffix

I used name_prefix so AWS could generate unique names for replacement Auto Scaling Groups during deployment.

That solved the conflict and made zero-downtime replacement possible.

Deploying Version 1

For the first deployment, I launched my infrastructure with a simple application response showing version 1.

Once the Auto Scaling Group instances passed health checks and the Load Balancer became healthy, I verified the application in the browser.

This gave me my initial “live” environment.

At that point, traffic was being served successfully by the original deployment.

Verifying Zero Downtime with a Traffic Loop

To actually prove that my deployment was zero-downtime, I opened a second terminal and ran a continuous traffic check against the ALB.

The goal was simple:

keep sending requests during deployment
observe whether traffic ever failed
confirm that the application stayed available the whole time

This is a very practical way to test infrastructure changes because it simulates a real user continuously accessing the app while updates are happening.

Deploying Version 2

Next, I updated the application response from v1 to v2.

This forced Terraform to replace the underlying instance configuration.

Normally, this kind of change could cause downtime.

But because I had configured Terraform with create_before_destroy, the update behaved differently:

the replacement infrastructure was created first
the new instances came online
health checks passed
traffic remained available
then the old resources were removed

What I Observed

The traffic loop kept returning responses during the entire deployment.

At one point, the output changed from:

to:

The important part is that there were:

no connection errors
no timeouts
no interruption in service

That was my clearest proof that the deployment was truly zero-downtime.

This was honestly one of the most satisfying Terraform experiments I’ve done so far.

Taking It Further with Blue/Green Deployment

After proving a rolling zero-downtime update, I moved on to a more advanced deployment strategy:

Blue/Green Deployment

Blue/green deployment keeps two complete environments running at the same time:

Blue = current live environment
Green = new environment

Instead of replacing the live environment directly, you deploy the new version separately and then shift traffic when it is ready.

This pattern is extremely powerful because it makes rollbacks much easier and reduces deployment risk.

How the Traffic Switch Works

I implemented blue/green using:

two target groups
two Auto Scaling Groups
one ALB listener rule

A variable called active_environment controlled which target group received traffic.

That meant I could switch between:

blue
green

just by changing one Terraform variable and applying again.

Why This Is So Effective

The traffic switch happens at the load balancer level.

That means Terraform does not need to tear down one environment before enabling the other.

Instead, the ALB updates where traffic is routed.

This makes the cutover:

fast
clean
effectively instantaneous

That is what makes blue/green deployment such a strong production deployment pattern.

Testing the Blue/Green Switch

After both environments were deployed, I tested switching traffic from:

blue → green
then green → blue

Each time, the Terraform apply completed successfully and traffic moved to the selected environment.

There was no observable interruption while switching.

This was a really useful demonstration because it showed how infrastructure can support safe application rollouts and fast rollback paths.

That is exactly the kind of deployment flexibility teams want in production systems.

Limitations of `create_before_destroy`

Even though create_before_destroy is powerful, it has tradeoffs.

Some limitations include:

You temporarily run duplicate infrastructure, which increases cost
Some AWS resources still have naming or uniqueness constraints
It does not provide fine-grained traffic control
It is still tied to replacement behavior rather than explicit traffic switching

This is where blue/green becomes more powerful.

Tradeoffs of Blue/Green Deployment

Blue/green solves several of the limitations of create_before_destroy, but it introduces its own tradeoffs.

Advantages:

safer rollouts
fast rollback
cleaner traffic switching
better deployment isolation

Tradeoffs:

higher cost because two environments run at once
more infrastructure complexity
more moving parts to manage

So while blue/green is more powerful, it also requires more operational discipline.

What I Learned from Day 12

Today taught me that infrastructure updates are not just about changing code — they are about changing systems safely.

The biggest lessons for me were:

why Terraform’s default replacement behavior can be dangerous
how create_before_destroy avoids downtime
why naming strategy matters in AWS
how blue/green deployment gives even more control
how to verify deployment safety using real traffic

This was one of the most production-relevant challenge days so far.

Final Thoughts

Day 12 made Terraform feel much closer to real-world DevOps work.

It moved beyond simply provisioning infrastructure and into something more valuable:

deploying live updates safely and predictably

That is one of the most important skills in cloud and infrastructure engineering.

And honestly, seeing traffic switch from v1 → v2 and blue → green without downtime was very satisfying.

Connect With Me

I’m documenting my journey through the 30-Day Terraform Challenge as I continue learning more about:

Terraform
AWS
Infrastructure as Code
DevOps best practices

If you are also learning cloud or IaC, feel free to connect.

Terraform #IaC #AWS #DevOps #ZeroDowntime #CloudComputing #30DayTerraformChallenge #TerraformChallenge #InfrastructureAsCode

Top comments (2)

Mads Hansen • Mar 31

Blue/green with Terraform is solid but state management during cutover is where most teams hit friction. The pattern that’s saved us: use a separate state file per environment (blue.tfstate, green.tfstate) with a pointer resource that’s the only thing that changes during swap. Avoids the situation where a failed deployment leaves state pointing at the wrong target.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.