Sherifdeen Adebayo

Posted on Dec 9, 2025

Blue/Green Deployment with Nginx Auto-Failover

#architecture #cicd #devops

What is Blue/Green Deployment?

Imagine you're running a restaurant. You have two identical kitchens: Kitchen Blue (currently serving customers) and Kitchen Green (on standby). When you want to update the menu or equipment, you:

Update Kitchen Green while Kitchen Blue serves customers
Test Kitchen Green thoroughly
Switch all orders to Kitchen Green
Now Kitchen Blue is on standby for the next update

That's exactly how blue/green deployment works in software! You maintain two identical production environments and can switch between them instantly.

Why Use This Approach?

Traditional deployment problems:

Downtime during updates (users see "Site under maintenance")
No easy rollback if something breaks
Risk of breaking production with untested changes

Blue/Green deployment benefits:

Zero downtime - users never notice the switch
Instant rollback - just switch back if issues arise
Safe testing - new version runs in production environment before switch

Project Architecture

                          ┌─────────────┐
                          │   Nginx     │
                          │   :8080     │
                          └──────┬──────┘
                                 │
          ┌──────────────────────┴──────────────────────┐
          │                                              │
┌─────────▼──────────┐                        ┌─────────▼──────────┐
│   Blue (Primary)   │                        │  Green (Backup)    │
│      :8081         │                        │      :8082         │
└────────────────────┘                        └────────────────────┘

How it works:

Nginx sits at port 8080 (your main entry point)
Blue service at port 8081 (currently handling all traffic)
Green service at port 8082 (backup, ready to take over)
When Blue fails, Nginx automatically routes traffic to Green
No requests are dropped during the transition

Setting Up the Project

Step 1: Project Structure

Create your project directory:

mkdir blue-green-deployment
cd blue-green-deployment

Your final structure will look like this:

blue-green-deployment/
├── docker-compose.yml      # Orchestrates all services
├── nginx.conf.template     # Nginx configuration
├── entrypoint.sh          # Nginx startup script
├── env.example            # Environment variables template
├── .env                   # Your actual environment variables
├── Makefile               # Convenient commands
├── test-failover.sh       # Automated testing script
├── README.md              # Documentation
├── DECISION.md            # Technical decisions
└── PART_B_RESEARCH.md     # Infrastructure research

Step 2: Docker Compose Configuration

The docker-compose.yml file is the heart of this setup. Let's break it down:

services:
  # Nginx acts as the load balancer with automatic failover
  nginx:
    image: nginx:alpine
    container_name: nginx-lb
    ports:
      - "8080:80"           # Main entry point for users
    volumes:
      - ./nginx.conf.template:/etc/nginx/nginx.conf:ro
      - ./entrypoint.sh:/entrypoint.sh:ro
    environment:
      - ACTIVE_POOL=${ACTIVE_POOL:-blue}    # Which pool is primary?
      - BLUE_UPSTREAM=app_blue:${PORT:-3000}
      - GREEN_UPSTREAM=app_green:${PORT:-3000}
    depends_on:
      - app_blue
      - app_green
    entrypoint: ["/bin/sh", "/entrypoint.sh"]
    networks:
      - app-network
    restart: unless-stopped

  # Blue pool - primary by default
  app_blue:
    image: ${BLUE_IMAGE}
    container_name: app_blue
    ports:
      - "8081:${PORT:-3000}"    # Direct access for testing
    environment:
      - APP_POOL=blue
      - RELEASE_ID=${RELEASE_ID_BLUE:-v1.0.0-blue}
      - PORT=${PORT:-3000}
    networks:
      - app-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
             "http://localhost:${PORT:-3000}/healthz"]
      interval: 5s
      timeout: 3s
      retries: 3

  # Green pool - backup by default
  app_green:
    image: ${GREEN_IMAGE}
    container_name: app_green
    ports:
      - "8082:${PORT:-3000}"    # Direct access for testing
    environment:
      - APP_POOL=green
      - RELEASE_ID=${RELEASE_ID_GREEN:-v1.0.0-green}
      - PORT=${PORT:-3000}
    networks:
      - app-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
             "http://localhost:${PORT:-3000}/healthz"]
      interval: 5s
      timeout: 3s
      retries: 3

networks:
  app-network:
    driver: bridge

Key points to understand:

Port Strategy:
- Port 8080: Public-facing Nginx (users hit this)
- Port 8081: Direct access to Blue (for chaos testing)
- Port 8082: Direct access to Green (for chaos testing)
Health Checks:
- Every 5 seconds, Docker checks if the service is healthy
- Uses the /healthz endpoint
- After 3 failed attempts (3s timeout each), marks as unhealthy
Environment Variables:
- ACTIVE_POOL: Which service gets traffic first (blue/green)
- RELEASE_ID: Tracks which version is running
- PORT: Application port (default: 3000)

Step 3: Nginx Configuration Magic

The nginx.conf.template is where the failover magic happens:

events {
    worker_connections 1024;
}

http {
    # Logging to help debug issues
    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log warn;

    # Combined upstream with failover logic
    upstream app_backend {
        # Active pool (primary)
        server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;

        # Backup pool - only used if primary fails
        server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
    }

    server {
        listen 80;
        server_name localhost;

        # Aggressive timeouts for quick failover detection
        proxy_connect_timeout 2s;
        proxy_send_timeout 3s;
        proxy_read_timeout 3s;

        # Retry logic - crucial for zero-downtime failover
        proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 10s;

        # Don't buffer - we want real-time responses
        proxy_buffering off;

        # Forward original client info
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        location / {
            proxy_pass http://app_backend;
            proxy_pass_request_headers on;
        }
    }
}

Let's decode this configuration:

1. Upstream Block:

upstream app_backend {
    server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
    server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}

max_fails=2: After 2 failed requests, mark server as down
fail_timeout=5s: Server is down for 5 seconds before retry
backup: This server only receives traffic when primary fails

2. Timeout Settings:

proxy_connect_timeout 2s;    # Can't connect? Fail fast
proxy_read_timeout 3s;       # No response? Move to backup

These are aggressive timeouts for quick failure detection. In production, you might increase these based on your app's response times.

3. Retry Logic:

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;

This tells Nginx: "If you get an error, timeout, or 5xx status code, try the next server (backup) automatically."

Step 4: The Entrypoint Script

Nginx doesn't support environment variables natively in its config. We use envsubst to template it at runtime:

#!/bin/sh
set -e

# Figure out which upstream is active and which is backup
if [ "$ACTIVE_POOL" = "blue" ]; then
    ACTIVE_UPSTREAM="$BLUE_UPSTREAM"
    BACKUP_UPSTREAM="$GREEN_UPSTREAM"
else
    ACTIVE_UPSTREAM="$GREEN_UPSTREAM"
    BACKUP_UPSTREAM="$BLUE_UPSTREAM"
fi

echo "==> Setting up Nginx with active pool: $ACTIVE_POOL"
echo "    Active upstream: $ACTIVE_UPSTREAM"
echo "    Backup upstream: $BACKUP_UPSTREAM"

# Use envsubst to replace variables in the template
export ACTIVE_UPSTREAM
export BACKUP_UPSTREAM

envsubst '${ACTIVE_UPSTREAM} ${BACKUP_UPSTREAM}' \
    < /etc/nginx/nginx.conf > /tmp/nginx.conf

# Test the config before starting (ALWAYS do this!)
nginx -t -c /tmp/nginx.conf

# Start nginx in foreground
echo "==> Starting Nginx..."
exec nginx -g 'daemon off;' -c /tmp/nginx.conf

What this script does:

Reads the ACTIVE_POOL environment variable
Sets up which upstream is active vs backup
Replaces placeholders in nginx.conf
Tests the configuration (prevents broken configs from starting)
Starts Nginx with the processed config

Step 5: Environment Configuration

Create your .env file from the example:

cp env.example .env

The .env file contents:

# Docker images for blue and green pools
BLUE_IMAGE=your-registry/your-app:blue
GREEN_IMAGE=your-registry/your-app:green

# Which pool should be active? (blue or green)
ACTIVE_POOL=blue

# Release identifiers - show up in X-Release-Id header
RELEASE_ID_BLUE=v1.0.0-blue-20250129
RELEASE_ID_GREEN=v1.0.0-green-20250129

# Application port
PORT=3000

Pro tip: In a real deployment, BLUE_IMAGE and GREEN_IMAGE might point to different versions of your app:

Blue: myapp:v1.2.3
Green: myapp:v1.2.4 (new version being tested)

Step 6: Makefile for Convenience

The Makefile provides friendly commands:

.PHONY: help up down restart logs test clean

help:  ## Show available commands
    @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
    awk 'BEGIN {FS = ":.*?## "}; {printf "  %-15s %s\n", $$1, $$2}'

up:  ## Start all services
    @if [ ! -f .env ]; then \
        echo "Creating .env from env.example..."; \
        cp env.example .env; \
    fi
    docker-compose up -d
    @echo "Services started!"
    @echo "Nginx:  http://localhost:8080"
    @echo "Blue:   http://localhost:8081"
    @echo "Green:  http://localhost:8082"

down:  ## Stop all services
    docker-compose down

test:  ## Run failover test
    ./test-failover.sh

blue:  ## Switch to blue as active
    @sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=blue/' .env
    @$(MAKE) restart

green:  ## Switch to green as active
    @sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=green/' .env
    @$(MAKE) restart

Usage:

make help      # See all commands
make up        # Start everything
make test      # Run failover tests
make green     # Switch to green pool

Running the Project

Start the Stack

Option 1: Using Make (recommended)

make up

Option 2: Manual

cp env.example .env
docker-compose up -d

Verify It's Working

# Check service status
docker-compose ps

# Hit the main endpoint
curl -i http://localhost:8080/version

You should see headers like:

HTTP/1.1 200 OK
X-App-Pool: blue
X-Release-Id: v1.0.0-blue-20250129

Testing the Failover

This is where it gets exciting! Let's break the blue service and watch Nginx automatically switch to green.

Automated Test Script

The test-failover.sh script automates the entire test:

#!/bin/bash
set -e

echo "🔵 Blue/Green Failover Test"
echo "============================"
echo ""

# Step 1: Baseline check
echo "📊 Step 1: Checking baseline (should be blue)..."
for i in {1..3}; do
    response=$(curl -s -i http://localhost:8080/version)
    pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
    echo "  Request $i: Pool=$pool"
done

# Step 2: Trigger chaos on blue
echo "💥 Step 2: Triggering chaos on blue..."
curl -s -X POST "http://localhost:8081/chaos/start?mode=error"
sleep 1

# Step 3: Test failover
echo "🔄 Step 3: Testing failover (should switch to green)..."
success_count=0
green_count=0
for i in {1..10}; do
    http_code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/version)
    if [ "$http_code" = "200" ]; then
        success_count=$((success_count + 1))
        response=$(curl -s -i http://localhost:8080/version)
        pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
        if [ "$pool" = "green" ]; then
            green_count=$((green_count + 1))
        fi
        echo "  Request $i: HTTP $http_code, Pool=$pool ✓"
    else
        echo "  Request $i: HTTP $http_code ✗ FAILED"
    fi
    sleep 0.5
done

# Step 4: Stop chaos
echo "🛑 Step 4: Stopping chaos..."
curl -s -X POST "http://localhost:8081/chaos/stop"

# Results
echo "📈 Results:"
echo "  ├─ Total requests: 10"
echo "  ├─ Successful (200): $success_count"
echo "  └─ Routed to green: $green_count"

if [ $success_count -eq 10 ] && [ $green_count -ge 9 ]; then
    echo "✅ Test PASSED - Failover working correctly!"
else
    echo "❌ Test FAILED - Check the logs"
fi

Run the test:

chmod +x test-failover.sh
./test-failover.sh

Expected output:

🔵 Blue/Green Failover Test
============================

📊 Step 1: Checking baseline (should be blue)...
  Request 1: Pool=blue
  Request 2: Pool=blue
  Request 3: Pool=blue

💥 Step 2: Triggering chaos on blue...
  Chaos initiated: {"status":"chaos_started","mode":"error"}

🔄 Step 3: Testing failover (should switch to green)...
  Request 1: HTTP 200, Pool=green ✓
  Request 2: HTTP 200, Pool=green ✓
  Request 3: HTTP 200, Pool=green ✓
  ...
  Request 10: HTTP 200, Pool=green ✓

📈 Results:
  ├─ Total requests: 10
  ├─ Successful (200): 10
  └─ Routed to green: 10

✅ Test PASSED - Failover working correctly!

What just happened?

All requests initially went to blue
We triggered chaos mode (blue starts returning 500 errors)
Nginx detected blue was failing
Zero requests failed - Nginx automatically retried on green
All subsequent requests went to green

Manual Testing (Understanding Each Step)

Let's do it manually to understand what's happening:

1. Check baseline:

for i in {1..5}; do
  curl -s http://localhost:8080/version | grep -E "pool|release"
done

All responses show "pool": "blue".

2. Break blue service:

# Trigger chaos mode - makes blue return 500 errors
curl -X POST "http://localhost:8081/chaos/start?mode=error"

3. Watch the magic:

# Keep hitting the endpoint
for i in {1..10}; do
  curl -s -w "\nStatus: %{http_code}\n" http://localhost:8080/version | \
    grep -E "pool|release|Status"
  sleep 1
done

You'll see:

All requests still return 200 (no failures!)
Pool changes from "blue" to "green"
Headers now show "pool": "green"

4. Fix blue:

curl -X POST "http://localhost:8081/chaos/stop"

After a few seconds, traffic goes back to blue (it's the primary).

Key Concepts Explained

Why Aggressive Timeouts?

proxy_connect_timeout 2s;
proxy_read_timeout 3s;

Scenario: Your blue service starts hanging (taking 10+ seconds to respond).

With loose timeouts (10s):

User makes request → Nginx waits 10s on blue → Fails → Retries on green
User waited 10+ seconds (bad experience)

With tight timeouts (2-3s):

User makes request → Nginx waits 2s on blue → Fails fast → Retries on green
User gets response in ~2-3 seconds (much better!)

Trade-off: If your app legitimately takes 5 seconds to respond, 3s timeouts will cause false failures. Tune based on your app's performance.

The Backup Directive

server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;

The backup keyword is crucial:

Without it: Nginx load-balances 50/50 between blue and green
With it: Nginx sends all traffic to blue, green only gets traffic when blue is down

This is what makes it true blue/green deployment (not load balancing).

Health Checks vs Failover

Docker health checks:

healthcheck:
  test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/healthz"]
  interval: 5s

Nginx failover:

max_fails=2 fail_timeout=5s

These serve different purposes:

Docker health checks: Tell Docker orchestration layer if container is healthy
Nginx failover: Actual traffic routing decisions

Both are important, but Nginx failover is what keeps users happy during failures.

Production Considerations

What Would Change in Production?

1. Timeouts would be tuned:

Measure your app's real response times
Set timeouts slightly above p95 latency
Balance fast failover vs false positives

2. Monitoring & Alerting:

# Add Prometheus metrics
location /metrics {
    stub_status on;
}

3. Structured Logging:

log_format json escape=json '{'
  '"time":"$time_iso8601",'
  '"remote_addr":"$remote_addr",'
  '"request":"$request",'
  '"status":$status,'
  '"upstream":"$upstream_addr"'
'}';

access_log /var/log/nginx/access.log json;

4. SSL/TLS:

server {
    listen 443 ssl http2;
    ssl_certificate /etc/ssl/certs/cert.pem;
    ssl_certificate_key /etc/ssl/private/key.pem;
    # ... rest of config
}

5. Rate Limiting:

limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location / {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://app_backend;
}

Troubleshooting Guide

Problem: All requests failing

Check if containers are running:

docker-compose ps

Expected output:

NAME        STATE    PORTS
nginx-lb    Up       0.0.0.0:8080->80/tcp
app_blue    Up       0.0.0.0:8081->3000/tcp
app_green   Up       0.0.0.0:8082->3000/tcp

If containers are down:

docker-compose logs

Problem: Not seeing failover

Check Nginx logs:

docker logs nginx-lb

# Or live tail:
docker logs -f nginx-lb

Look for:

[error] ... upstream timed out ... while connecting to upstream
[warn] ... marking server app_blue:3000 as down

Problem: Services won't start

Check if ports are already in use:

lsof -i :8080
lsof -i :8081
lsof -i :8082

If something is using these ports:

# Kill the process
kill -9 <PID>

# Or change ports in docker-compose.yml
ports:
  - "9080:80"  # Use 9080 instead of 8080

Problem: Changes to .env not taking effect

You need to recreate containers:

docker-compose down
docker-compose up -d

Just restarting isn't enough - environment variables are set at container creation time.

Real-World Use Cases

Scenario 1: Deploying a New Version

Without blue/green:

# Take site down
docker-compose down

# Update to new version
docker-compose up -d

# Hope nothing broke (users see downtime)

With blue/green:

# Update green to new version while blue serves traffic
sed -i 's/GREEN_IMAGE=.*/GREEN_IMAGE=myapp:v2.0.0/' .env

# Start green with new version
docker-compose up -d app_green

# Test green directly
curl http://localhost:8082/version

# Switch traffic to green (instant, zero downtime)
make green

# If something's wrong, instant rollback
make blue

Scenario 2: Database Migration

Challenge: New version needs schema changes.

Approach:

Make schema changes backward-compatible
Deploy new version to green (with migration)
Test thoroughly
Switch traffic
Keep blue running for quick rollback
After validation, update blue too

Example:

# Update green with new code + migrations
GREEN_IMAGE=myapp:v2.0.0 docker-compose up -d app_green

# Run migrations on green
docker exec app_green npm run migrate

# Test green
curl http://localhost:8082/test-endpoint

# Switch traffic
make green

# Monitor for issues
docker logs -f app_green

# If problems occur, instant rollback
make blue

DEV Community