POTHURAJU JAYAKRISHNA YADAV for AWS Community Builders

Posted on Jun 6

How I Eliminated Lambda Deployment Downtime with Terraform and Blue-Green Deployments

#aws #terraform #lambda #devops

I Got Tired of Lambda Deployment Headaches, So I Built My Own Blue-Green Deployment System

We've all had that moment.

You push a new Lambda version to production, hit deploy, and then spend the next few minutes refreshing dashboards and logs hoping nothing breaks.

I've been there more times than I'd like to admit.

Sometimes deployments worked perfectly. Other times I would see unexpected 503 errors, traffic routing issues, or situations where rolling back wasn't as fast as I wanted. The Lambda deployment itself wasn't usually the problem. The challenge was everything around it—traffic routing, validation, and rollback.

After hitting the same deployment issues multiple times, I decided to build a deployment model that matched how I actually wanted releases to work.

The result was a Terraform-driven blue-green deployment framework for AWS Lambda that gives me:

Zero-downtime deployments
IP-based canary testing
Weighted traffic shifting
Instant rollback
No CodeDeploy
No Route53 dependency
No hardcoded VPC IDs

Everything is controlled through Terraform.

The Problem I Wanted to Solve

AWS already provides deployment options for Lambda through aliases and CodeDeploy.

For many teams, that's enough.

But in my case, there were a few limitations that kept bothering me.

1. I Wanted to Test Production Before Users Did

When deploying a new version, I wanted a way to access the new release from my own machine while everyone else continued using the stable version.

With standard Lambda deployments, that wasn't straightforward.

2. Rollbacks Needed To Be Instant

If something goes wrong after deployment, I don't want to wait for another deployment process.

I want traffic to switch back immediately.

3. I Wanted Full Traffic Control

Sometimes I want:

10% rollout
25% rollout
50% rollout
75% rollout

Not just predefined deployment strategies.

4. I Wanted Less Moving Parts

CodeDeploy works, but it introduces additional services, configurations, hooks, and deployment definitions.

For this project, I wanted something simpler.

The Idea

Instead of thinking about Lambda deployments as version updates, I started thinking about them the same way ECS blue-green deployments work.

What if two Lambda environments always existed?

One would be stable.

One would contain the new release.

Traffic could simply move between them.

That led to this architecture:

Blue Lambda (current production)
Green Lambda (new version)
Blue Target Group
Green Target Group
ALB Listener Rules controlling traffic

The important detail is that both environments always exist.

Nothing is destroyed.

Nothing is recreated.

Traffic is simply redirected.

That's what eliminates deployment interruptions.

Everything Is Controlled Through Terraform

The deployment workflow is driven through a simple Terraform configuration.

nginx-lambda = {
  family          = "nginx-svc"

  blue_image      = ".../nginx-lambda:blue"
  green_image     = ".../nginx-lambda:green"

  active_color    = "blue"

  activate_canary = false
  green_weight    = 0
}

Whenever I want to change deployment behavior, I update the values and run:

terraform apply

That's it.

No deployment pipelines.

No manual traffic switching.

No complicated release process.

Phase 1: Canary Testing

The first thing I do after deploying a new version is test it myself.

I enable canary mode:

activate_canary = true
green_weight    = 0

This creates a higher-priority ALB rule.

If requests come from my whitelisted IP address:

My IP → Green Version

Everyone else still reaches:

Users → Blue Version

This allows me to validate:

Application functionality
API responses
Headers
Authentication flows
Database interactions

All in production.

Without exposing the new version to actual users.

Phase 2: Gradual Traffic Shifting

Once canary testing looks good, I begin moving real traffic.

For example:

green_weight = 50

Now:

50% → Blue
50% → Green

The Application Load Balancer handles traffic distribution automatically.

Because both target groups already exist and already have Lambda functions attached, traffic shifting happens smoothly.

No downtime.

No target registration delays.

No deployment gaps.

Phase 3: Full Promotion

When confidence is high enough, I move all traffic to Green.

green_weight = 100

At this point:

100% Users → Green

But here's the part I like most.

Blue still exists.

It's sitting there untouched.

Ready for rollback.

Phase 4: Instant Rollback

Every deployment system claims to support rollback.

The real question is:

How fast?

With this setup, rollback is literally:

green_weight = 0

Apply Terraform.

Traffic immediately returns to Blue.

No redeployment.

No waiting for Lambda versions.

No waiting for target group registration.

No waiting for containers to start.

The previous stable environment never disappeared.

What About The Next Release?

This was actually one of the hardest design problems.

Let's say:

Blue = v1
Green = v2

After promoting v2, what happens when v3 arrives?

The answer is surprisingly simple.

Rotate the images.

blue_image  = ".../nginx-lambda:green"
green_image = ".../nginx-lambda:yellow"

Now:

Blue  = v2 (stable)
Green = v3 (candidate)

The deployment cycle repeats.

Canary.

Weighted rollout.

Promotion.

Rollback if needed.

Forever.

Why I Stopped Seeing 503 Errors

Before building this, I experimented with several approaches.

Lambda Alias Switching

This occasionally introduced timing issues during updates.

Dynamic Target Groups

Using conditional resources sometimes caused Terraform to destroy resources before listener rules updated.

Permission Propagation Delays

Lambda permissions occasionally weren't fully propagated before attachments occurred.

The common problem was simple:

Resources were being created and destroyed during deployment.

Eventually I realized the best deployment is often the one that changes the least.

So I stopped changing infrastructure during releases.

Instead:

Blue always exists
Green always exists
Target groups always exist
Attachments always exist

The only thing that changes is traffic routing.

That's why the process is so reliable.

GitHub Repository

https://github.com/jayakrishnayadav24/lambda-blue-green-deployment

Final Thoughts

The biggest lesson from building this wasn't about Lambda or Terraform.

It was about reducing moving parts.

Instead of making deployments smarter, I made them simpler.

Both environments stay alive.

Traffic moves gradually.

Rollback is immediate.

And I can personally validate a release before customers ever touch it.

After several production deployments using this pattern, it's become one of my favorite deployment strategies for Lambda workloads.

If you've been fighting deployment-related downtime or complicated release workflows, you might find this approach useful too.

DEV Community