Your Deployment Just Took Down Production. Again. Here's How to Never Let That Happen.
It was a Thursday afternoon. The kind where you're mentally halfway out the door, maybe already thinking about the weekend.
Then Slack lights up.
"Hey… the app is down."
You deployed thirty minutes ago. A "small" hotfix. Two lines. "It'll be fine."
If you've been in production engineering long enough, you've lived this story. If you haven't yet — you will. The question isn't whether a bad deployment will happen. The question is: when it does, how fast can you recover?
That's the problem Blue-Green deployment solves. And today I'm walking you through exactly how to implement it on AWS using Elastic Beanstalk and Terraform — zero downtime, instant rollbacks, infrastructure-as-code from day one.
What Blue-Green Actually Means (And Why Most Explanations Miss the Point)
Most articles define it like this:
"You have two identical environments. Blue is live. Green is staging. You swap them."
Technically correct. Completely useless without context.
Here's the mental model that actually sticks:
Imagine your production environment is a patient on an operating table — heart beating, users connected, traffic flowing. Every deployment you push to that live environment is open-heart surgery while the heart is still running. One wrong cut and the patient flatlines. 3am pages. Slack on fire. The works.
Blue-Green says: stop operating on the live patient.
Instead, spin up an identical second patient — your green environment. Do all your surgery there. Test it. Benchmark it. Validate every edge case. When you're 100% confident, flip a switch. One DNS record change. Traffic moves from blue to green. The old patient sits warm and healthy as your fallback.
Something goes wrong in production with the new version? Flip the switch back. Your previous version was never touched.
That's the real power — not just "two environments." It's the ability to deploy with confidence because your escape hatch is always one click away.
The Architecture: What We're Building
Two fully independent Elastic Beanstalk environments, each with its own ALB, Auto Scaling group, health monitoring, and application version stored in S3:
┌──────────────────────────────────────────────────────────────┐
│ Elastic Beanstalk Application │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ Blue Environment │ │ Green Environment │ │
│ │ (Live Production) │ │ (Staging / Next) │ │
│ │ Version 1.0 │ │ Version 2.0 │ │
│ │ ALB + Auto Scaling │ │ ALB + Auto Scaling │ │
│ │ Health Checks │ │ Health Checks │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │ │ │
│ └─────────────┬─────────────┘ │
│ ▼ │
│ CNAME Swap ← this is the magic │
└──────────────────────────────────────────────────────────────┘
The "swap" is literally swapping two DNS CNAME records. Elastic Beanstalk handles this natively — one API call, no custom load balancer gymnastics needed.
The Terraform Setup
If it's not in code, it doesn't exist. Let's walk through the meaningful parts.
IAM: The Foundation Nobody Talks About
Before a single instance spins up, Beanstalk needs two distinct IAM roles — and confusing them is the #1 reason I see environments fail to provision.
# Role for EC2 instances (so Beanstalk can manage them)
resource "aws_iam_role" "eb_ec2_role" {
name = "${var.app_name}-eb-ec2-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = { Service = "ec2.amazonaws.com" }
}]
})
}
# Attach the three managed policies Beanstalk needs
resource "aws_iam_role_policy_attachment" "eb_web_tier" {
role = aws_iam_role.eb_ec2_role.name
policy_arn = "arn:aws:iam::aws:policy/AWSElasticBeanstalkWebTier"
}
The eb_ec2_role lets instances do their job. The service role (separate) lets Beanstalk itself make AWS API calls on your behalf — health reporting, managed updates, scaling events. Both are required. Most tutorials only mention one.
S3: Your App's Artifact Store
resource "aws_s3_bucket" "app_versions" {
# Account ID in the name = globally unique without hardcoding
bucket = "${var.app_name}-versions-${data.aws_caller_identity.current.account_id}"
}
resource "aws_s3_bucket_public_access_block" "app_versions" {
bucket = aws_s3_bucket.app_versions.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Your deployment artifacts are not public content. Lock the bucket down from day one. The aws_caller_identity data source ensures the bucket name is account-scoped — no manual uniqueness wrangling.
The Blue Environment (Production)
resource "aws_elastic_beanstalk_environment" "blue" {
name = "${var.app_name}-blue"
application = aws_elastic_beanstalk_application.app.name
version_label = aws_elastic_beanstalk_application_version.v1.name
tier = "WebServer"
# Rolling deploys: only redeploy 50% of instances at a time
setting {
namespace = "aws:elasticbeanstalk:command"
name = "DeploymentPolicy"
value = "Rolling"
}
setting {
namespace = "aws:elasticbeanstalk:command"
name = "BatchSize"
value = "50"
}
# Never fly blind
setting {
namespace = "aws:elasticbeanstalk:healthreporting:system"
name = "SystemType"
value = "enhanced"
}
tags = merge(var.tags, {
Environment = "blue"
Role = "production"
})
}
The green environment is structurally identical — same ALB, same scaling config, same health checks — with one difference: it points to v2 of the application. That's the whole point. Production parity is not optional.
The Swap: One Command, Zero Downtime
You've validated green. Smoke tests pass. Load tests pass. You've slept on it. Time to ship.
AWS CLI:
aws elasticbeanstalk swap-environment-cnames \
--source-environment-name my-app-blue \
--destination-environment-name my-app-green \
--region us-east-1
Console: Elastic Beanstalk → App → Blue Environment → Actions → Swap Environment URLs → Select Green → Swap.
Beanstalk modifies the Route 53 configuration. Within 60-90 seconds, traffic that was hitting your blue URL is now served by your green environment. The environment names stay the same. The URLs swap. Users experience nothing — no error pages, no dropped connections, no 502s.
The Rollback That Doesn't Require a Hero
Here's what makes this strategy genuinely production-grade: your rollback is identical to your deployment.
Green is now production. Something's wrong — a memory leak that only appears under real user load, a third-party integration that behaves differently, anything. Run the swap again. Your v1 environment is still running, still healthy, still warm. That's a 30-second rollback with zero redeployment.
No terraform apply. No container rebuild. Just a DNS flip.
When to Use Blue-Green (And When Not To)
Blue-Green is not a universal answer. Part of being a senior engineer is knowing which tool fits the job.
Reach for Blue-Green when:
- Zero downtime is a hard requirement
- You need instant rollback capability (regulated industries, payment systems, healthcare)
- Your app is stateful or tightly coupled to a DB schema — gradual rollouts get complicated fast
Consider Canary Deployments when:
- You want to validate with 5-10% of real traffic before full rollout
- You're doing ML model deployments or high-risk feature releases
- You have enough traffic volume to get statistically meaningful signal from a subset
Consider Rolling when:
- Cost is a hard constraint — Blue-Green effectively doubles your infrastructure spend during deployment windows
- Your background jobs make "two live versions simultaneously" operationally complex
The Cleanup Reminder (Seriously, Don't Skip This)
Two full Elastic Beanstalk environments with load balancers run roughly $50-100/month. For a learning exercise, spin it up, validate the swap, tear it down.
terraform destroy
The Terraform code is your infrastructure. You can recreate the whole thing in under 20 minutes. That's the point of infrastructure-as-code — your environment is disposable. Your knowledge of it isn't.
The Real Takeaway
Blue-Green deployments aren't about Elastic Beanstalk. Or ECS. Or Kubernetes. The platform changes. The principle doesn't.
The real takeaway is this: production deployments should be boring.
The most dangerous deployment is the one that "should be fine." The hotfix at 4pm on a Thursday. The one-liner that touches the payments table. The change a developer calls "trivial."
Boring means predictable. Boring means you have a plan when things go wrong — not if. Boring means your on-call engineer isn't doing open-heart surgery on a live patient at 2am.
Blue-Green gives you boring deployments. In production, boring is the highest compliment you can receive.
Full Terraform source code in the repo linked below. Questions? Drop them in the comments — I read everything.
Top comments (1)
Solid guide. One thing I'd flag though - after you run
swap-environment-cnamesoutside of Terraform (CLI or console), your state has no idea the swap happened. So nextterraform applymight try to "fix" the version labels back to what's in your .tf files, which can silently undo the swap or cause drift. Been bitten by this one before. Either slapignore_changes = [version_label]on both environments if swaps happen outside TF, or manage the whole thing through a variable flag likeis_green_activeso state stays in sync. Alsoprevent_destroyon the active env is worth considering - one wrongterraform destroyon the wrong workspace and you're having a very bad day.