DEV Community

Zero-Downtime Blue-Green and IP-Based Canary Deployments on ECS Fargate

Most ECS blue-green deployment tutorials eventually lead to the same stack:

  • AWS CodeDeploy
  • Deployment groups
  • AppSpec files
  • Lifecycle hooks
  • Weighted traffic shifting
  • Complex rollback orchestration

And while CodeDeploy works, I kept running into one practical limitation during real deployments:

I couldn’t let my internal team validate a new release on the actual production URL before exposing it to customers.

That became the entire motivation behind this setup.

I didn’t want:

  • separate staging domains
  • duplicate ALBs
  • temporary preview environments
  • “almost production” testing

I wanted something much simpler:

  • Internal users should see the new version first
  • Customers should continue seeing the stable version
  • Both should use the same production domain
  • Rollback should be immediate
  • Deployments should remain fully zero downtime

So I built a Terraform-driven deployment workflow using:

  • ECS Fargate
  • Application Load Balancer (ALB)
  • ALB listener priorities
  • Source IP routing
  • Terraform

without using CodeDeploy.

After running this setup in practice, I ended up preferring it for many ECS workloads.


The Core Idea

Both BLUE and GREEN environments run behind the same ALB.

Internal office/VPN IPs get routed to GREEN first.

Everyone else continues hitting BLUE.

That means QA and internal teams can validate the new release directly on the real production infrastructure before public rollout begins.

Same:

  • domain
  • SSL certificate
  • ALB
  • authentication flow
  • redirects
  • networking path

No “staging surprises” later.

A lot of deployment issues only appear on the real production routing path.


Real Example

Internal users open:

https://nginx.jayakrishnayadav.cloud
Enter fullscreen mode Exit fullscreen mode

…and immediately see the GREEN version.

Meanwhile, public users continue seeing BLUE.

No DNS switching.

No duplicate infrastructure.

Just ALB listener routing.


Architecture Overview

The deployment flow looks like this:

                ┌────────────────────┐
                │   Application LB   │
                └─────────┬──────────┘
                          │
         ┌────────────────┴────────────────┐
         │                                 │
 Internal Office/VPN IPs             Public Users
         │                                 │
         ▼                                 ▼
   GREEN Target Group               BLUE Target Group
         │                                 │
    ECS GREEN Tasks                  ECS BLUE Tasks
Enter fullscreen mode Exit fullscreen mode

The canary routing rule gets evaluated first.

If the request source IP matches internal CIDRs, traffic goes to GREEN.

Everything else falls back to BLUE.


Terraform Structure

I kept the Terraform layout modular so it could be reused across multiple services.

.
├── main.tf
├── variables.tf
├── outputs.tf
├── env/
│   ├── backend.hcl
│   └── terraform.tfvars
├── modules/
│   ├── vpc/
│   ├── iam/
│   ├── alb/
│   ├── ecs-cluster/
│   └── ecs-blue-green-service/
└── scripts/
    └── zero-downtime-test.sh
Enter fullscreen mode Exit fullscreen mode

Each ECS service gets:

  • BLUE ECS service
  • GREEN ECS service
  • BLUE target group
  • GREEN target group
  • production listener rule
  • optional canary listener rule

ALB Listener Rule Logic

The entire deployment behavior depends on ALB listener priorities.

The canary listener rule gets evaluated first.

If the request source IP matches internal CIDRs, traffic gets forwarded to GREEN.

resource "aws_lb_listener_rule" "canary" {
  count    = var.activate_canary ? 1 : 0
  priority = 99

  condition {
    source_ip {
      values = var.canary_source_ips
    }
  }

  condition {
    host_header {
      values = ["nginx.jayakrishnayadav.cloud"]
    }
  }

  action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.green.arn
  }
}
Enter fullscreen mode Exit fullscreen mode

The production rule remains below it:

resource "aws_lb_listener_rule" "production" {
  priority = 100

  condition {
    host_header {
      values = ["nginx.jayakrishnayadav.cloud"]
    }
  }

  action {
    type             = "forward"
    target_group_arn = local.active_target_group
  }
}
Enter fullscreen mode Exit fullscreen mode

That’s it.

No weighted routing.

No lifecycle hooks.

Just listener priorities.


Real Deployment Workflow

This wasn’t built as a theoretical architecture exercise.

I tested the rollout flow directly from Terraform while continuously validating traffic behavior against live ECS Fargate services.

Terraform initialization:

terraform init -backend-config=env/backend.hcl
Enter fullscreen mode Exit fullscreen mode

Deployment apply:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve
Enter fullscreen mode Exit fullscreen mode

During canary validation, I continuously verified my public IP:

curl ifconfig.me
Enter fullscreen mode Exit fullscreen mode

That mattered because the ALB source-IP rule decides whether traffic reaches:

  • BLUE
  • GREEN

Once my IP matched the configured canary CIDRs, traffic immediately started routing to GREEN.


Deployment Flow

The nice part about this setup is that everything becomes variable-driven.


Step 1 — Normal Production State

BLUE handles all production traffic.

GREEN remains scaled down.

enable_canary   = false
activate_canary = false
promote_to_all  = false
Enter fullscreen mode Exit fullscreen mode

Apply:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve
Enter fullscreen mode Exit fullscreen mode

Result:

  • BLUE active
  • GREEN inactive
  • minimal Fargate cost

Step 2 — Start GREEN Tasks

Now we start the GREEN environment.

enable_canary   = true
activate_canary = false
promote_to_all  = false
Enter fullscreen mode Exit fullscreen mode

Apply again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve
Enter fullscreen mode Exit fullscreen mode

At this stage:

  • GREEN tasks start
  • ECS health checks complete
  • ALB target registration completes
  • no production traffic reaches GREEN yet

Users never hit partially starting containers.


Step 3 — Internal Canary Validation

Now we enable canary routing.

enable_canary   = true
activate_canary = true
promote_to_all  = false
Enter fullscreen mode Exit fullscreen mode

Apply again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve
Enter fullscreen mode Exit fullscreen mode

Now:

  • internal office/VPN users hit GREEN
  • public users continue hitting BLUE

This became the most valuable phase of the deployment workflow.

Because now:

  • QA validates production behavior
  • developers inspect logs
  • authentication flows get tested
  • sessions and redirects get verified

while customers remain completely unaffected.


Internal Canary Routing

This is the ALB listener rules view while canary routing is enabled.

The priority 99 rule matches internal source IPs and forwards them to GREEN, while everyone else continues hitting BLUE.

ALB Canary Routing


Step 4 — Promote GREEN to Production

Once validation looks good:

enable_canary   = true
activate_canary = false
promote_to_all  = true
Enter fullscreen mode Exit fullscreen mode

Apply again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve
Enter fullscreen mode Exit fullscreen mode

Now:

  • production listener switches to GREEN
  • BLUE scales down
  • all users see the new version

No downtime occurs.

Traffic simply moves from one target group to another.


Verifying Zero Downtime

I didn’t want to assume the deployment was safe.

I wanted to verify it continuously during rollout.

So I used a simple curl-based validation script that continuously hit both applications while traffic shifted between BLUE and GREEN.

for i in {1..100}
do
  for url in \
    "https://nginx.jayakrishnayadav.cloud/" \
    "https://apache.jayakrishnayadav.cloud/"
  do
    response=$(curl -k -s -w " HTTPSTATUS:%{http_code}" "$url")

    body=${response% HTTPSTATUS:*}
    status=${response##*HTTPSTATUS:}

    if [[ $body == *"BLUE - v"* ]]; then
      color="BLUE"
    elif [[ $body == *"GREEN - v"* ]]; then
      color="GREEN"
    else
      color="UNKNOWN"
    fi

    echo "Run: $i | URL: $url | Status: $status | Version: $color"
  done
done
Enter fullscreen mode Exit fullscreen mode

Output during deployment:

Zero Downtime Validation

You can clearly see:

  • HTTP 200 responses throughout deployment
  • no failed requests
  • no 503s
  • clean traffic movement from BLUE to GREEN

That confirmed the deployment was genuinely zero downtime.


Production Promotion View

After promotion:

  • the canary rule disappears
  • the production listener points directly to GREEN
  • all traffic reaches the new version
  • BLUE scales down to zero

Clean and simple.

Production Listener Switch

Final Traffic Flow


Rollback

Rollback became extremely simple.

I just reverted the Terraform variables:

enable_canary   = false
activate_canary = false
promote_to_all  = false
Enter fullscreen mode Exit fullscreen mode

Apply Terraform again:

terraform apply \
  -var-file=env/terraform.tfvars \
  -lock=false \
  -auto-approve
Enter fullscreen mode Exit fullscreen mode

ALB immediately routes traffic back to BLUE.

The rollback process stays predictable because traffic switching is entirely controlled through ALB listener rules.


HTTPS Configuration

The ALB uses ACM certificates for HTTPS.

Listeners:

  • Port 80 → redirect to HTTPS
  • Port 443 → production traffic
  • optional internal listener → restricted to internal CIDRs

Example:

test_listener_allowed_cidrs = [
  "160.30.39.198/32"
]
Enter fullscreen mode Exit fullscreen mode

That keeps internal preview traffic private while still using the same production infrastructure.


Cost Optimization

One thing I specifically wanted to avoid was permanently doubling infrastructure cost.

Normal state:

  • only BLUE tasks run

Deployment window:

  • BLUE + GREEN both run temporarily

After promotion:

  • BLUE scales down again

So infrastructure cost only increases briefly during deployments.


Final Thoughts

This project started because I wanted a very practical deployment workflow:

Internal users should validate the new version on the actual production URL before customers ever see it.

Once I implemented that using ALB listener priorities and source IP routing, I realized I no longer really needed CodeDeploy for this workflow.

The end result became:

  • simpler
  • easier to operate
  • easier to rollback
  • easier to debug
  • easier to reason about
  • fully zero downtime

And because everything is Terraform-driven, the deployment process stays reproducible and predictable.


GitHub Repository

Full Terraform implementation:

https://github.com/jayakrishnayadav24/ecs-blue-green-deployment/tree/canary

Top comments (0)