DEV Community

Sherifdeen Adebayo
Sherifdeen Adebayo

Posted on

Blue/Green Deployment with Nginx Auto-Failover

What is Blue/Green Deployment?

Imagine you're running a restaurant. You have two identical kitchens: Kitchen Blue (currently serving customers) and Kitchen Green (on standby). When you want to update the menu or equipment, you:

  1. Update Kitchen Green while Kitchen Blue serves customers
  2. Test Kitchen Green thoroughly
  3. Switch all orders to Kitchen Green
  4. Now Kitchen Blue is on standby for the next update

That's exactly how blue/green deployment works in software! You maintain two identical production environments and can switch between them instantly.

Why Use This Approach?

Traditional deployment problems:

  • Downtime during updates (users see "Site under maintenance")
  • No easy rollback if something breaks
  • Risk of breaking production with untested changes

Blue/Green deployment benefits:

  • Zero downtime - users never notice the switch
  • Instant rollback - just switch back if issues arise
  • Safe testing - new version runs in production environment before switch

Project Architecture

                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚   Nginx     β”‚
                          β”‚   :8080     β”‚
                          β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚                                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Blue (Primary)   β”‚                        β”‚  Green (Backup)    β”‚
β”‚      :8081         β”‚                        β”‚      :8082         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

How it works:

  1. Nginx sits at port 8080 (your main entry point)
  2. Blue service at port 8081 (currently handling all traffic)
  3. Green service at port 8082 (backup, ready to take over)
  4. When Blue fails, Nginx automatically routes traffic to Green
  5. No requests are dropped during the transition

Setting Up the Project

Step 1: Project Structure

Create your project directory:

mkdir blue-green-deployment
cd blue-green-deployment
Enter fullscreen mode Exit fullscreen mode

Your final structure will look like this:

blue-green-deployment/
β”œβ”€β”€ docker-compose.yml      # Orchestrates all services
β”œβ”€β”€ nginx.conf.template     # Nginx configuration
β”œβ”€β”€ entrypoint.sh          # Nginx startup script
β”œβ”€β”€ env.example            # Environment variables template
β”œβ”€β”€ .env                   # Your actual environment variables
β”œβ”€β”€ Makefile               # Convenient commands
β”œβ”€β”€ test-failover.sh       # Automated testing script
β”œβ”€β”€ README.md              # Documentation
β”œβ”€β”€ DECISION.md            # Technical decisions
└── PART_B_RESEARCH.md     # Infrastructure research
Enter fullscreen mode Exit fullscreen mode

Step 2: Docker Compose Configuration

The docker-compose.yml file is the heart of this setup. Let's break it down:

services:
  # Nginx acts as the load balancer with automatic failover
  nginx:
    image: nginx:alpine
    container_name: nginx-lb
    ports:
      - "8080:80"           # Main entry point for users
    volumes:
      - ./nginx.conf.template:/etc/nginx/nginx.conf:ro
      - ./entrypoint.sh:/entrypoint.sh:ro
    environment:
      - ACTIVE_POOL=${ACTIVE_POOL:-blue}    # Which pool is primary?
      - BLUE_UPSTREAM=app_blue:${PORT:-3000}
      - GREEN_UPSTREAM=app_green:${PORT:-3000}
    depends_on:
      - app_blue
      - app_green
    entrypoint: ["/bin/sh", "/entrypoint.sh"]
    networks:
      - app-network
    restart: unless-stopped

  # Blue pool - primary by default
  app_blue:
    image: ${BLUE_IMAGE}
    container_name: app_blue
    ports:
      - "8081:${PORT:-3000}"    # Direct access for testing
    environment:
      - APP_POOL=blue
      - RELEASE_ID=${RELEASE_ID_BLUE:-v1.0.0-blue}
      - PORT=${PORT:-3000}
    networks:
      - app-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
             "http://localhost:${PORT:-3000}/healthz"]
      interval: 5s
      timeout: 3s
      retries: 3

  # Green pool - backup by default
  app_green:
    image: ${GREEN_IMAGE}
    container_name: app_green
    ports:
      - "8082:${PORT:-3000}"    # Direct access for testing
    environment:
      - APP_POOL=green
      - RELEASE_ID=${RELEASE_ID_GREEN:-v1.0.0-green}
      - PORT=${PORT:-3000}
    networks:
      - app-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
             "http://localhost:${PORT:-3000}/healthz"]
      interval: 5s
      timeout: 3s
      retries: 3

networks:
  app-network:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

Key points to understand:

  1. Port Strategy:

    • Port 8080: Public-facing Nginx (users hit this)
    • Port 8081: Direct access to Blue (for chaos testing)
    • Port 8082: Direct access to Green (for chaos testing)
  2. Health Checks:

    • Every 5 seconds, Docker checks if the service is healthy
    • Uses the /healthz endpoint
    • After 3 failed attempts (3s timeout each), marks as unhealthy
  3. Environment Variables:

    • ACTIVE_POOL: Which service gets traffic first (blue/green)
    • RELEASE_ID: Tracks which version is running
    • PORT: Application port (default: 3000)

Step 3: Nginx Configuration Magic

The nginx.conf.template is where the failover magic happens:

events {
    worker_connections 1024;
}

http {
    # Logging to help debug issues
    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log warn;

    # Combined upstream with failover logic
    upstream app_backend {
        # Active pool (primary)
        server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;

        # Backup pool - only used if primary fails
        server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
    }

    server {
        listen 80;
        server_name localhost;

        # Aggressive timeouts for quick failover detection
        proxy_connect_timeout 2s;
        proxy_send_timeout 3s;
        proxy_read_timeout 3s;

        # Retry logic - crucial for zero-downtime failover
        proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 10s;

        # Don't buffer - we want real-time responses
        proxy_buffering off;

        # Forward original client info
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        location / {
            proxy_pass http://app_backend;
            proxy_pass_request_headers on;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Let's decode this configuration:

1. Upstream Block:

upstream app_backend {
    server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
    server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}
Enter fullscreen mode Exit fullscreen mode
  • max_fails=2: After 2 failed requests, mark server as down
  • fail_timeout=5s: Server is down for 5 seconds before retry
  • backup: This server only receives traffic when primary fails

2. Timeout Settings:

proxy_connect_timeout 2s;    # Can't connect? Fail fast
proxy_read_timeout 3s;       # No response? Move to backup
Enter fullscreen mode Exit fullscreen mode

These are aggressive timeouts for quick failure detection. In production, you might increase these based on your app's response times.

3. Retry Logic:

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;
Enter fullscreen mode Exit fullscreen mode

This tells Nginx: "If you get an error, timeout, or 5xx status code, try the next server (backup) automatically."

Step 4: The Entrypoint Script

Nginx doesn't support environment variables natively in its config. We use envsubst to template it at runtime:

#!/bin/sh
set -e

# Figure out which upstream is active and which is backup
if [ "$ACTIVE_POOL" = "blue" ]; then
    ACTIVE_UPSTREAM="$BLUE_UPSTREAM"
    BACKUP_UPSTREAM="$GREEN_UPSTREAM"
else
    ACTIVE_UPSTREAM="$GREEN_UPSTREAM"
    BACKUP_UPSTREAM="$BLUE_UPSTREAM"
fi

echo "==> Setting up Nginx with active pool: $ACTIVE_POOL"
echo "    Active upstream: $ACTIVE_UPSTREAM"
echo "    Backup upstream: $BACKUP_UPSTREAM"

# Use envsubst to replace variables in the template
export ACTIVE_UPSTREAM
export BACKUP_UPSTREAM

envsubst '${ACTIVE_UPSTREAM} ${BACKUP_UPSTREAM}' \
    < /etc/nginx/nginx.conf > /tmp/nginx.conf

# Test the config before starting (ALWAYS do this!)
nginx -t -c /tmp/nginx.conf

# Start nginx in foreground
echo "==> Starting Nginx..."
exec nginx -g 'daemon off;' -c /tmp/nginx.conf
Enter fullscreen mode Exit fullscreen mode

What this script does:

  1. Reads the ACTIVE_POOL environment variable
  2. Sets up which upstream is active vs backup
  3. Replaces placeholders in nginx.conf
  4. Tests the configuration (prevents broken configs from starting)
  5. Starts Nginx with the processed config

Step 5: Environment Configuration

Create your .env file from the example:

cp env.example .env
Enter fullscreen mode Exit fullscreen mode

The .env file contents:

# Docker images for blue and green pools
BLUE_IMAGE=your-registry/your-app:blue
GREEN_IMAGE=your-registry/your-app:green

# Which pool should be active? (blue or green)
ACTIVE_POOL=blue

# Release identifiers - show up in X-Release-Id header
RELEASE_ID_BLUE=v1.0.0-blue-20250129
RELEASE_ID_GREEN=v1.0.0-green-20250129

# Application port
PORT=3000
Enter fullscreen mode Exit fullscreen mode

Pro tip: In a real deployment, BLUE_IMAGE and GREEN_IMAGE might point to different versions of your app:

  • Blue: myapp:v1.2.3
  • Green: myapp:v1.2.4 (new version being tested)

Step 6: Makefile for Convenience

The Makefile provides friendly commands:

.PHONY: help up down restart logs test clean

help:  ## Show available commands
    @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
    awk 'BEGIN {FS = ":.*?## "}; {printf "  %-15s %s\n", $$1, $$2}'

up:  ## Start all services
    @if [ ! -f .env ]; then \
        echo "Creating .env from env.example..."; \
        cp env.example .env; \
    fi
    docker-compose up -d
    @echo "Services started!"
    @echo "Nginx:  http://localhost:8080"
    @echo "Blue:   http://localhost:8081"
    @echo "Green:  http://localhost:8082"

down:  ## Stop all services
    docker-compose down

test:  ## Run failover test
    ./test-failover.sh

blue:  ## Switch to blue as active
    @sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=blue/' .env
    @$(MAKE) restart

green:  ## Switch to green as active
    @sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=green/' .env
    @$(MAKE) restart
Enter fullscreen mode Exit fullscreen mode

Usage:

make help      # See all commands
make up        # Start everything
make test      # Run failover tests
make green     # Switch to green pool
Enter fullscreen mode Exit fullscreen mode

Running the Project

Start the Stack

Option 1: Using Make (recommended)

make up
Enter fullscreen mode Exit fullscreen mode

Option 2: Manual

cp env.example .env
docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Verify It's Working

# Check service status
docker-compose ps

# Hit the main endpoint
curl -i http://localhost:8080/version
Enter fullscreen mode Exit fullscreen mode

You should see headers like:

HTTP/1.1 200 OK
X-App-Pool: blue
X-Release-Id: v1.0.0-blue-20250129
Enter fullscreen mode Exit fullscreen mode

Testing the Failover

This is where it gets exciting! Let's break the blue service and watch Nginx automatically switch to green.

Automated Test Script

The test-failover.sh script automates the entire test:

#!/bin/bash
set -e

echo "πŸ”΅ Blue/Green Failover Test"
echo "============================"
echo ""

# Step 1: Baseline check
echo "πŸ“Š Step 1: Checking baseline (should be blue)..."
for i in {1..3}; do
    response=$(curl -s -i http://localhost:8080/version)
    pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
    echo "  Request $i: Pool=$pool"
done

# Step 2: Trigger chaos on blue
echo "πŸ’₯ Step 2: Triggering chaos on blue..."
curl -s -X POST "http://localhost:8081/chaos/start?mode=error"
sleep 1

# Step 3: Test failover
echo "πŸ”„ Step 3: Testing failover (should switch to green)..."
success_count=0
green_count=0
for i in {1..10}; do
    http_code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/version)
    if [ "$http_code" = "200" ]; then
        success_count=$((success_count + 1))
        response=$(curl -s -i http://localhost:8080/version)
        pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
        if [ "$pool" = "green" ]; then
            green_count=$((green_count + 1))
        fi
        echo "  Request $i: HTTP $http_code, Pool=$pool βœ“"
    else
        echo "  Request $i: HTTP $http_code βœ— FAILED"
    fi
    sleep 0.5
done

# Step 4: Stop chaos
echo "πŸ›‘ Step 4: Stopping chaos..."
curl -s -X POST "http://localhost:8081/chaos/stop"

# Results
echo "πŸ“ˆ Results:"
echo "  β”œβ”€ Total requests: 10"
echo "  β”œβ”€ Successful (200): $success_count"
echo "  └─ Routed to green: $green_count"

if [ $success_count -eq 10 ] && [ $green_count -ge 9 ]; then
    echo "βœ… Test PASSED - Failover working correctly!"
else
    echo "❌ Test FAILED - Check the logs"
fi
Enter fullscreen mode Exit fullscreen mode

Run the test:

chmod +x test-failover.sh
./test-failover.sh
Enter fullscreen mode Exit fullscreen mode

Expected output:

πŸ”΅ Blue/Green Failover Test
============================

πŸ“Š Step 1: Checking baseline (should be blue)...
  Request 1: Pool=blue
  Request 2: Pool=blue
  Request 3: Pool=blue

πŸ’₯ Step 2: Triggering chaos on blue...
  Chaos initiated: {"status":"chaos_started","mode":"error"}

πŸ”„ Step 3: Testing failover (should switch to green)...
  Request 1: HTTP 200, Pool=green βœ“
  Request 2: HTTP 200, Pool=green βœ“
  Request 3: HTTP 200, Pool=green βœ“
  ...
  Request 10: HTTP 200, Pool=green βœ“

πŸ“ˆ Results:
  β”œβ”€ Total requests: 10
  β”œβ”€ Successful (200): 10
  └─ Routed to green: 10

βœ… Test PASSED - Failover working correctly!
Enter fullscreen mode Exit fullscreen mode

What just happened?

  1. All requests initially went to blue
  2. We triggered chaos mode (blue starts returning 500 errors)
  3. Nginx detected blue was failing
  4. Zero requests failed - Nginx automatically retried on green
  5. All subsequent requests went to green

Manual Testing (Understanding Each Step)

Let's do it manually to understand what's happening:

1. Check baseline:

for i in {1..5}; do
  curl -s http://localhost:8080/version | grep -E "pool|release"
done
Enter fullscreen mode Exit fullscreen mode

All responses show "pool": "blue".

2. Break blue service:

# Trigger chaos mode - makes blue return 500 errors
curl -X POST "http://localhost:8081/chaos/start?mode=error"
Enter fullscreen mode Exit fullscreen mode

3. Watch the magic:

# Keep hitting the endpoint
for i in {1..10}; do
  curl -s -w "\nStatus: %{http_code}\n" http://localhost:8080/version | \
    grep -E "pool|release|Status"
  sleep 1
done
Enter fullscreen mode Exit fullscreen mode

You'll see:

  • All requests still return 200 (no failures!)
  • Pool changes from "blue" to "green"
  • Headers now show "pool": "green"

4. Fix blue:

curl -X POST "http://localhost:8081/chaos/stop"
Enter fullscreen mode Exit fullscreen mode

After a few seconds, traffic goes back to blue (it's the primary).

Key Concepts Explained

Why Aggressive Timeouts?

proxy_connect_timeout 2s;
proxy_read_timeout 3s;
Enter fullscreen mode Exit fullscreen mode

Scenario: Your blue service starts hanging (taking 10+ seconds to respond).

With loose timeouts (10s):

  • User makes request β†’ Nginx waits 10s on blue β†’ Fails β†’ Retries on green
  • User waited 10+ seconds (bad experience)

With tight timeouts (2-3s):

  • User makes request β†’ Nginx waits 2s on blue β†’ Fails fast β†’ Retries on green
  • User gets response in ~2-3 seconds (much better!)

Trade-off: If your app legitimately takes 5 seconds to respond, 3s timeouts will cause false failures. Tune based on your app's performance.

The Backup Directive

server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
Enter fullscreen mode Exit fullscreen mode

The backup keyword is crucial:

  • Without it: Nginx load-balances 50/50 between blue and green
  • With it: Nginx sends all traffic to blue, green only gets traffic when blue is down

This is what makes it true blue/green deployment (not load balancing).

Health Checks vs Failover

Docker health checks:

healthcheck:
  test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/healthz"]
  interval: 5s
Enter fullscreen mode Exit fullscreen mode

Nginx failover:

max_fails=2 fail_timeout=5s
Enter fullscreen mode Exit fullscreen mode

These serve different purposes:

  • Docker health checks: Tell Docker orchestration layer if container is healthy
  • Nginx failover: Actual traffic routing decisions

Both are important, but Nginx failover is what keeps users happy during failures.

Production Considerations

What Would Change in Production?

1. Timeouts would be tuned:

  • Measure your app's real response times
  • Set timeouts slightly above p95 latency
  • Balance fast failover vs false positives

2. Monitoring & Alerting:

# Add Prometheus metrics
location /metrics {
    stub_status on;
}
Enter fullscreen mode Exit fullscreen mode

3. Structured Logging:

log_format json escape=json '{'
  '"time":"$time_iso8601",'
  '"remote_addr":"$remote_addr",'
  '"request":"$request",'
  '"status":$status,'
  '"upstream":"$upstream_addr"'
'}';

access_log /var/log/nginx/access.log json;
Enter fullscreen mode Exit fullscreen mode

4. SSL/TLS:

server {
    listen 443 ssl http2;
    ssl_certificate /etc/ssl/certs/cert.pem;
    ssl_certificate_key /etc/ssl/private/key.pem;
    # ... rest of config
}
Enter fullscreen mode Exit fullscreen mode

5. Rate Limiting:

limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location / {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://app_backend;
}
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Guide

Problem: All requests failing

Check if containers are running:

docker-compose ps
Enter fullscreen mode Exit fullscreen mode

Expected output:

NAME        STATE    PORTS
nginx-lb    Up       0.0.0.0:8080->80/tcp
app_blue    Up       0.0.0.0:8081->3000/tcp
app_green   Up       0.0.0.0:8082->3000/tcp
Enter fullscreen mode Exit fullscreen mode

If containers are down:

docker-compose logs
Enter fullscreen mode Exit fullscreen mode

Problem: Not seeing failover

Check Nginx logs:

docker logs nginx-lb

# Or live tail:
docker logs -f nginx-lb
Enter fullscreen mode Exit fullscreen mode

Look for:

[error] ... upstream timed out ... while connecting to upstream
[warn] ... marking server app_blue:3000 as down
Enter fullscreen mode Exit fullscreen mode

Problem: Services won't start

Check if ports are already in use:

lsof -i :8080
lsof -i :8081
lsof -i :8082
Enter fullscreen mode Exit fullscreen mode

If something is using these ports:

# Kill the process
kill -9 <PID>

# Or change ports in docker-compose.yml
ports:
  - "9080:80"  # Use 9080 instead of 8080
Enter fullscreen mode Exit fullscreen mode

Problem: Changes to .env not taking effect

You need to recreate containers:

docker-compose down
docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Just restarting isn't enough - environment variables are set at container creation time.

Real-World Use Cases

Scenario 1: Deploying a New Version

Without blue/green:

# Take site down
docker-compose down

# Update to new version
docker-compose up -d

# Hope nothing broke (users see downtime)
Enter fullscreen mode Exit fullscreen mode

With blue/green:

# Update green to new version while blue serves traffic
sed -i 's/GREEN_IMAGE=.*/GREEN_IMAGE=myapp:v2.0.0/' .env

# Start green with new version
docker-compose up -d app_green

# Test green directly
curl http://localhost:8082/version

# Switch traffic to green (instant, zero downtime)
make green

# If something's wrong, instant rollback
make blue
Enter fullscreen mode Exit fullscreen mode

Scenario 2: Database Migration

Challenge: New version needs schema changes.

Approach:

  1. Make schema changes backward-compatible
  2. Deploy new version to green (with migration)
  3. Test thoroughly
  4. Switch traffic
  5. Keep blue running for quick rollback
  6. After validation, update blue too

Example:

# Update green with new code + migrations
GREEN_IMAGE=myapp:v2.0.0 docker-compose up -d app_green

# Run migrations on green
docker exec app_green npm run migrate

# Test green
curl http://localhost:8082/test-endpoint

# Switch traffic
make green

# Monitor for issues
docker logs -f app_green

# If problems occur, instant rollback
make blue
Enter fullscreen mode Exit fullscreen mode

Top comments (0)