DEV Community

How I Eliminated Lambda Deployment Downtime with Terraform and Blue-Green Deployments

I Got Tired of Lambda Deployment Headaches, So I Built My Own Blue-Green Deployment System

We've all had that moment.

You push a new Lambda version to production, hit deploy, and then spend the next few minutes refreshing dashboards and logs hoping nothing breaks.

I've been there more times than I'd like to admit.

Sometimes deployments worked perfectly. Other times I would see unexpected 503 errors, traffic routing issues, or situations where rolling back wasn't as fast as I wanted. The Lambda deployment itself wasn't usually the problem. The challenge was everything around it—traffic routing, validation, and rollback.

After hitting the same deployment issues multiple times, I decided to build a deployment model that matched how I actually wanted releases to work.

The result was a Terraform-driven blue-green deployment framework for AWS Lambda that gives me:

  • Zero-downtime deployments
  • IP-based canary testing
  • Weighted traffic shifting
  • Instant rollback
  • No CodeDeploy
  • No Route53 dependency
  • No hardcoded VPC IDs

Everything is controlled through Terraform.


The Problem I Wanted to Solve

AWS already provides deployment options for Lambda through aliases and CodeDeploy.

For many teams, that's enough.

But in my case, there were a few limitations that kept bothering me.

1. I Wanted to Test Production Before Users Did

When deploying a new version, I wanted a way to access the new release from my own machine while everyone else continued using the stable version.

With standard Lambda deployments, that wasn't straightforward.

2. Rollbacks Needed To Be Instant

If something goes wrong after deployment, I don't want to wait for another deployment process.

I want traffic to switch back immediately.

3. I Wanted Full Traffic Control

Sometimes I want:

  • 10% rollout
  • 25% rollout
  • 50% rollout
  • 75% rollout

Not just predefined deployment strategies.

4. I Wanted Less Moving Parts

CodeDeploy works, but it introduces additional services, configurations, hooks, and deployment definitions.

For this project, I wanted something simpler.


The Idea

Instead of thinking about Lambda deployments as version updates, I started thinking about them the same way ECS blue-green deployments work.

What if two Lambda environments always existed?

One would be stable.

One would contain the new release.

Traffic could simply move between them.

That led to this architecture:

  • Blue Lambda (current production)
  • Green Lambda (new version)
  • Blue Target Group
  • Green Target Group
  • ALB Listener Rules controlling traffic

The important detail is that both environments always exist.

Nothing is destroyed.

Nothing is recreated.

Traffic is simply redirected.

That's what eliminates deployment interruptions.

Architecture Diagram


Everything Is Controlled Through Terraform

The deployment workflow is driven through a simple Terraform configuration.

nginx-lambda = {
  family          = "nginx-svc"

  blue_image      = ".../nginx-lambda:blue"
  green_image     = ".../nginx-lambda:green"

  active_color    = "blue"

  activate_canary = false
  green_weight    = 0
}
Enter fullscreen mode Exit fullscreen mode

Whenever I want to change deployment behavior, I update the values and run:

terraform apply
Enter fullscreen mode Exit fullscreen mode

That's it.

No deployment pipelines.

No manual traffic switching.

No complicated release process.


Phase 1: Canary Testing

The first thing I do after deploying a new version is test it myself.

I enable canary mode:

activate_canary = true
green_weight    = 0
Enter fullscreen mode Exit fullscreen mode

Canary Configuration

This creates a higher-priority ALB rule.

If requests come from my whitelisted IP address:

My IP → Green Version
Enter fullscreen mode Exit fullscreen mode

Everyone else still reaches:

Users → Blue Version
Enter fullscreen mode Exit fullscreen mode

Canary Test Result

This allows me to validate:

  • Application functionality
  • API responses
  • Headers
  • Authentication flows
  • Database interactions

All in production.

Without exposing the new version to actual users.


Phase 2: Gradual Traffic Shifting

Once canary testing looks good, I begin moving real traffic.

For example:

green_weight = 50
Enter fullscreen mode Exit fullscreen mode

Weighted Rollout Configuration

Now:

50% → Blue
50% → Green
Enter fullscreen mode Exit fullscreen mode

50-50 Traffic Distribution

The Application Load Balancer handles traffic distribution automatically.

Because both target groups already exist and already have Lambda functions attached, traffic shifting happens smoothly.

No downtime.

No target registration delays.

No deployment gaps.


Phase 3: Full Promotion

When confidence is high enough, I move all traffic to Green.

green_weight = 100
Enter fullscreen mode Exit fullscreen mode

Full Promotion Configuration

At this point:

100% Users → Green
Enter fullscreen mode Exit fullscreen mode

Full Promotion Result

But here's the part I like most.

Blue still exists.

It's sitting there untouched.

Ready for rollback.


Phase 4: Instant Rollback

Every deployment system claims to support rollback.

The real question is:

How fast?

With this setup, rollback is literally:

green_weight = 0
Enter fullscreen mode Exit fullscreen mode

Rollback Configuration

Apply Terraform.

Traffic immediately returns to Blue.

Instant Rollback

No redeployment.

No waiting for Lambda versions.

No waiting for target group registration.

No waiting for containers to start.

The previous stable environment never disappeared.


What About The Next Release?

This was actually one of the hardest design problems.

Let's say:

  • Blue = v1
  • Green = v2

After promoting v2, what happens when v3 arrives?

The answer is surprisingly simple.

Rotate the images.

blue_image  = ".../nginx-lambda:green"
green_image = ".../nginx-lambda:yellow"
Enter fullscreen mode Exit fullscreen mode

v3 Deployment Configuration

Now:

Blue  = v2 (stable)
Green = v3 (candidate)
Enter fullscreen mode Exit fullscreen mode

v3 Canary Testing

The deployment cycle repeats.

Canary.

Weighted rollout.

Promotion.

Rollback if needed.

Forever.


Why I Stopped Seeing 503 Errors

Before building this, I experimented with several approaches.

Lambda Alias Switching

This occasionally introduced timing issues during updates.

Dynamic Target Groups

Using conditional resources sometimes caused Terraform to destroy resources before listener rules updated.

Permission Propagation Delays

Lambda permissions occasionally weren't fully propagated before attachments occurred.

The common problem was simple:

Resources were being created and destroyed during deployment.

Eventually I realized the best deployment is often the one that changes the least.

So I stopped changing infrastructure during releases.

Instead:

  • Blue always exists
  • Green always exists
  • Target groups always exist
  • Attachments always exist

The only thing that changes is traffic routing.

That's why the process is so reliable.


GitHub Repository

https://github.com/jayakrishnayadav24/lambda-blue-green-deployment


Final Thoughts

The biggest lesson from building this wasn't about Lambda or Terraform.

It was about reducing moving parts.

Instead of making deployments smarter, I made them simpler.

Both environments stay alive.

Traffic moves gradually.

Rollback is immediate.

And I can personally validate a release before customers ever touch it.

After several production deployments using this pattern, it's become one of my favorite deployment strategies for Lambda workloads.

If you've been fighting deployment-related downtime or complicated release workflows, you might find this approach useful too.

Top comments (0)