What is Blue/Green Deployment?
Imagine you're running a restaurant. You have two identical kitchens: Kitchen Blue (currently serving customers) and Kitchen Green (on standby). When you want to update the menu or equipment, you:
- Update Kitchen Green while Kitchen Blue serves customers
- Test Kitchen Green thoroughly
- Switch all orders to Kitchen Green
- Now Kitchen Blue is on standby for the next update
That's exactly how blue/green deployment works in software! You maintain two identical production environments and can switch between them instantly.
Why Use This Approach?
Traditional deployment problems:
- Downtime during updates (users see "Site under maintenance")
- No easy rollback if something breaks
- Risk of breaking production with untested changes
Blue/Green deployment benefits:
- Zero downtime - users never notice the switch
- Instant rollback - just switch back if issues arise
- Safe testing - new version runs in production environment before switch
Project Architecture
βββββββββββββββ
β Nginx β
β :8080 β
ββββββββ¬βββββββ
β
ββββββββββββββββββββββββ΄βββββββββββββββββββββββ
β β
βββββββββββΌβββββββββββ βββββββββββΌβββββββββββ
β Blue (Primary) β β Green (Backup) β
β :8081 β β :8082 β
ββββββββββββββββββββββ ββββββββββββββββββββββ
How it works:
- Nginx sits at port 8080 (your main entry point)
- Blue service at port 8081 (currently handling all traffic)
- Green service at port 8082 (backup, ready to take over)
- When Blue fails, Nginx automatically routes traffic to Green
- No requests are dropped during the transition
Setting Up the Project
Step 1: Project Structure
Create your project directory:
mkdir blue-green-deployment
cd blue-green-deployment
Your final structure will look like this:
blue-green-deployment/
βββ docker-compose.yml # Orchestrates all services
βββ nginx.conf.template # Nginx configuration
βββ entrypoint.sh # Nginx startup script
βββ env.example # Environment variables template
βββ .env # Your actual environment variables
βββ Makefile # Convenient commands
βββ test-failover.sh # Automated testing script
βββ README.md # Documentation
βββ DECISION.md # Technical decisions
βββ PART_B_RESEARCH.md # Infrastructure research
Step 2: Docker Compose Configuration
The docker-compose.yml file is the heart of this setup. Let's break it down:
services:
# Nginx acts as the load balancer with automatic failover
nginx:
image: nginx:alpine
container_name: nginx-lb
ports:
- "8080:80" # Main entry point for users
volumes:
- ./nginx.conf.template:/etc/nginx/nginx.conf:ro
- ./entrypoint.sh:/entrypoint.sh:ro
environment:
- ACTIVE_POOL=${ACTIVE_POOL:-blue} # Which pool is primary?
- BLUE_UPSTREAM=app_blue:${PORT:-3000}
- GREEN_UPSTREAM=app_green:${PORT:-3000}
depends_on:
- app_blue
- app_green
entrypoint: ["/bin/sh", "/entrypoint.sh"]
networks:
- app-network
restart: unless-stopped
# Blue pool - primary by default
app_blue:
image: ${BLUE_IMAGE}
container_name: app_blue
ports:
- "8081:${PORT:-3000}" # Direct access for testing
environment:
- APP_POOL=blue
- RELEASE_ID=${RELEASE_ID_BLUE:-v1.0.0-blue}
- PORT=${PORT:-3000}
networks:
- app-network
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
"http://localhost:${PORT:-3000}/healthz"]
interval: 5s
timeout: 3s
retries: 3
# Green pool - backup by default
app_green:
image: ${GREEN_IMAGE}
container_name: app_green
ports:
- "8082:${PORT:-3000}" # Direct access for testing
environment:
- APP_POOL=green
- RELEASE_ID=${RELEASE_ID_GREEN:-v1.0.0-green}
- PORT=${PORT:-3000}
networks:
- app-network
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
"http://localhost:${PORT:-3000}/healthz"]
interval: 5s
timeout: 3s
retries: 3
networks:
app-network:
driver: bridge
Key points to understand:
-
Port Strategy:
- Port 8080: Public-facing Nginx (users hit this)
- Port 8081: Direct access to Blue (for chaos testing)
- Port 8082: Direct access to Green (for chaos testing)
-
Health Checks:
- Every 5 seconds, Docker checks if the service is healthy
- Uses the
/healthzendpoint - After 3 failed attempts (3s timeout each), marks as unhealthy
-
Environment Variables:
-
ACTIVE_POOL: Which service gets traffic first (blue/green) -
RELEASE_ID: Tracks which version is running -
PORT: Application port (default: 3000)
-
Step 3: Nginx Configuration Magic
The nginx.conf.template is where the failover magic happens:
events {
worker_connections 1024;
}
http {
# Logging to help debug issues
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log warn;
# Combined upstream with failover logic
upstream app_backend {
# Active pool (primary)
server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
# Backup pool - only used if primary fails
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}
server {
listen 80;
server_name localhost;
# Aggressive timeouts for quick failover detection
proxy_connect_timeout 2s;
proxy_send_timeout 3s;
proxy_read_timeout 3s;
# Retry logic - crucial for zero-downtime failover
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 10s;
# Don't buffer - we want real-time responses
proxy_buffering off;
# Forward original client info
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
location / {
proxy_pass http://app_backend;
proxy_pass_request_headers on;
}
}
}
Let's decode this configuration:
1. Upstream Block:
upstream app_backend {
server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}
-
max_fails=2: After 2 failed requests, mark server as down -
fail_timeout=5s: Server is down for 5 seconds before retry -
backup: This server only receives traffic when primary fails
2. Timeout Settings:
proxy_connect_timeout 2s; # Can't connect? Fail fast
proxy_read_timeout 3s; # No response? Move to backup
These are aggressive timeouts for quick failure detection. In production, you might increase these based on your app's response times.
3. Retry Logic:
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;
This tells Nginx: "If you get an error, timeout, or 5xx status code, try the next server (backup) automatically."
Step 4: The Entrypoint Script
Nginx doesn't support environment variables natively in its config. We use envsubst to template it at runtime:
#!/bin/sh
set -e
# Figure out which upstream is active and which is backup
if [ "$ACTIVE_POOL" = "blue" ]; then
ACTIVE_UPSTREAM="$BLUE_UPSTREAM"
BACKUP_UPSTREAM="$GREEN_UPSTREAM"
else
ACTIVE_UPSTREAM="$GREEN_UPSTREAM"
BACKUP_UPSTREAM="$BLUE_UPSTREAM"
fi
echo "==> Setting up Nginx with active pool: $ACTIVE_POOL"
echo " Active upstream: $ACTIVE_UPSTREAM"
echo " Backup upstream: $BACKUP_UPSTREAM"
# Use envsubst to replace variables in the template
export ACTIVE_UPSTREAM
export BACKUP_UPSTREAM
envsubst '${ACTIVE_UPSTREAM} ${BACKUP_UPSTREAM}' \
< /etc/nginx/nginx.conf > /tmp/nginx.conf
# Test the config before starting (ALWAYS do this!)
nginx -t -c /tmp/nginx.conf
# Start nginx in foreground
echo "==> Starting Nginx..."
exec nginx -g 'daemon off;' -c /tmp/nginx.conf
What this script does:
- Reads the
ACTIVE_POOLenvironment variable - Sets up which upstream is active vs backup
- Replaces placeholders in nginx.conf
- Tests the configuration (prevents broken configs from starting)
- Starts Nginx with the processed config
Step 5: Environment Configuration
Create your .env file from the example:
cp env.example .env
The .env file contents:
# Docker images for blue and green pools
BLUE_IMAGE=your-registry/your-app:blue
GREEN_IMAGE=your-registry/your-app:green
# Which pool should be active? (blue or green)
ACTIVE_POOL=blue
# Release identifiers - show up in X-Release-Id header
RELEASE_ID_BLUE=v1.0.0-blue-20250129
RELEASE_ID_GREEN=v1.0.0-green-20250129
# Application port
PORT=3000
Pro tip: In a real deployment, BLUE_IMAGE and GREEN_IMAGE might point to different versions of your app:
- Blue:
myapp:v1.2.3 - Green:
myapp:v1.2.4(new version being tested)
Step 6: Makefile for Convenience
The Makefile provides friendly commands:
.PHONY: help up down restart logs test clean
help: ## Show available commands
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
awk 'BEGIN {FS = ":.*?## "}; {printf " %-15s %s\n", $$1, $$2}'
up: ## Start all services
@if [ ! -f .env ]; then \
echo "Creating .env from env.example..."; \
cp env.example .env; \
fi
docker-compose up -d
@echo "Services started!"
@echo "Nginx: http://localhost:8080"
@echo "Blue: http://localhost:8081"
@echo "Green: http://localhost:8082"
down: ## Stop all services
docker-compose down
test: ## Run failover test
./test-failover.sh
blue: ## Switch to blue as active
@sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=blue/' .env
@$(MAKE) restart
green: ## Switch to green as active
@sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=green/' .env
@$(MAKE) restart
Usage:
make help # See all commands
make up # Start everything
make test # Run failover tests
make green # Switch to green pool
Running the Project
Start the Stack
Option 1: Using Make (recommended)
make up
Option 2: Manual
cp env.example .env
docker-compose up -d
Verify It's Working
# Check service status
docker-compose ps
# Hit the main endpoint
curl -i http://localhost:8080/version
You should see headers like:
HTTP/1.1 200 OK
X-App-Pool: blue
X-Release-Id: v1.0.0-blue-20250129
Testing the Failover
This is where it gets exciting! Let's break the blue service and watch Nginx automatically switch to green.
Automated Test Script
The test-failover.sh script automates the entire test:
#!/bin/bash
set -e
echo "π΅ Blue/Green Failover Test"
echo "============================"
echo ""
# Step 1: Baseline check
echo "π Step 1: Checking baseline (should be blue)..."
for i in {1..3}; do
response=$(curl -s -i http://localhost:8080/version)
pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
echo " Request $i: Pool=$pool"
done
# Step 2: Trigger chaos on blue
echo "π₯ Step 2: Triggering chaos on blue..."
curl -s -X POST "http://localhost:8081/chaos/start?mode=error"
sleep 1
# Step 3: Test failover
echo "π Step 3: Testing failover (should switch to green)..."
success_count=0
green_count=0
for i in {1..10}; do
http_code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/version)
if [ "$http_code" = "200" ]; then
success_count=$((success_count + 1))
response=$(curl -s -i http://localhost:8080/version)
pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
if [ "$pool" = "green" ]; then
green_count=$((green_count + 1))
fi
echo " Request $i: HTTP $http_code, Pool=$pool β"
else
echo " Request $i: HTTP $http_code β FAILED"
fi
sleep 0.5
done
# Step 4: Stop chaos
echo "π Step 4: Stopping chaos..."
curl -s -X POST "http://localhost:8081/chaos/stop"
# Results
echo "π Results:"
echo " ββ Total requests: 10"
echo " ββ Successful (200): $success_count"
echo " ββ Routed to green: $green_count"
if [ $success_count -eq 10 ] && [ $green_count -ge 9 ]; then
echo "β
Test PASSED - Failover working correctly!"
else
echo "β Test FAILED - Check the logs"
fi
Run the test:
chmod +x test-failover.sh
./test-failover.sh
Expected output:
π΅ Blue/Green Failover Test
============================
π Step 1: Checking baseline (should be blue)...
Request 1: Pool=blue
Request 2: Pool=blue
Request 3: Pool=blue
π₯ Step 2: Triggering chaos on blue...
Chaos initiated: {"status":"chaos_started","mode":"error"}
π Step 3: Testing failover (should switch to green)...
Request 1: HTTP 200, Pool=green β
Request 2: HTTP 200, Pool=green β
Request 3: HTTP 200, Pool=green β
...
Request 10: HTTP 200, Pool=green β
π Results:
ββ Total requests: 10
ββ Successful (200): 10
ββ Routed to green: 10
β
Test PASSED - Failover working correctly!
What just happened?
- All requests initially went to blue
- We triggered chaos mode (blue starts returning 500 errors)
- Nginx detected blue was failing
- Zero requests failed - Nginx automatically retried on green
- All subsequent requests went to green
Manual Testing (Understanding Each Step)
Let's do it manually to understand what's happening:
1. Check baseline:
for i in {1..5}; do
curl -s http://localhost:8080/version | grep -E "pool|release"
done
All responses show "pool": "blue".
2. Break blue service:
# Trigger chaos mode - makes blue return 500 errors
curl -X POST "http://localhost:8081/chaos/start?mode=error"
3. Watch the magic:
# Keep hitting the endpoint
for i in {1..10}; do
curl -s -w "\nStatus: %{http_code}\n" http://localhost:8080/version | \
grep -E "pool|release|Status"
sleep 1
done
You'll see:
- All requests still return 200 (no failures!)
- Pool changes from "blue" to "green"
- Headers now show
"pool": "green"
4. Fix blue:
curl -X POST "http://localhost:8081/chaos/stop"
After a few seconds, traffic goes back to blue (it's the primary).
Key Concepts Explained
Why Aggressive Timeouts?
proxy_connect_timeout 2s;
proxy_read_timeout 3s;
Scenario: Your blue service starts hanging (taking 10+ seconds to respond).
With loose timeouts (10s):
- User makes request β Nginx waits 10s on blue β Fails β Retries on green
- User waited 10+ seconds (bad experience)
With tight timeouts (2-3s):
- User makes request β Nginx waits 2s on blue β Fails fast β Retries on green
- User gets response in ~2-3 seconds (much better!)
Trade-off: If your app legitimately takes 5 seconds to respond, 3s timeouts will cause false failures. Tune based on your app's performance.
The Backup Directive
server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
The backup keyword is crucial:
- Without it: Nginx load-balances 50/50 between blue and green
- With it: Nginx sends all traffic to blue, green only gets traffic when blue is down
This is what makes it true blue/green deployment (not load balancing).
Health Checks vs Failover
Docker health checks:
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/healthz"]
interval: 5s
Nginx failover:
max_fails=2 fail_timeout=5s
These serve different purposes:
- Docker health checks: Tell Docker orchestration layer if container is healthy
- Nginx failover: Actual traffic routing decisions
Both are important, but Nginx failover is what keeps users happy during failures.
Production Considerations
What Would Change in Production?
1. Timeouts would be tuned:
- Measure your app's real response times
- Set timeouts slightly above p95 latency
- Balance fast failover vs false positives
2. Monitoring & Alerting:
# Add Prometheus metrics
location /metrics {
stub_status on;
}
3. Structured Logging:
log_format json escape=json '{'
'"time":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"request":"$request",'
'"status":$status,'
'"upstream":"$upstream_addr"'
'}';
access_log /var/log/nginx/access.log json;
4. SSL/TLS:
server {
listen 443 ssl http2;
ssl_certificate /etc/ssl/certs/cert.pem;
ssl_certificate_key /etc/ssl/private/key.pem;
# ... rest of config
}
5. Rate Limiting:
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location / {
limit_req zone=api burst=20 nodelay;
proxy_pass http://app_backend;
}
Troubleshooting Guide
Problem: All requests failing
Check if containers are running:
docker-compose ps
Expected output:
NAME STATE PORTS
nginx-lb Up 0.0.0.0:8080->80/tcp
app_blue Up 0.0.0.0:8081->3000/tcp
app_green Up 0.0.0.0:8082->3000/tcp
If containers are down:
docker-compose logs
Problem: Not seeing failover
Check Nginx logs:
docker logs nginx-lb
# Or live tail:
docker logs -f nginx-lb
Look for:
[error] ... upstream timed out ... while connecting to upstream
[warn] ... marking server app_blue:3000 as down
Problem: Services won't start
Check if ports are already in use:
lsof -i :8080
lsof -i :8081
lsof -i :8082
If something is using these ports:
# Kill the process
kill -9 <PID>
# Or change ports in docker-compose.yml
ports:
- "9080:80" # Use 9080 instead of 8080
Problem: Changes to .env not taking effect
You need to recreate containers:
docker-compose down
docker-compose up -d
Just restarting isn't enough - environment variables are set at container creation time.
Real-World Use Cases
Scenario 1: Deploying a New Version
Without blue/green:
# Take site down
docker-compose down
# Update to new version
docker-compose up -d
# Hope nothing broke (users see downtime)
With blue/green:
# Update green to new version while blue serves traffic
sed -i 's/GREEN_IMAGE=.*/GREEN_IMAGE=myapp:v2.0.0/' .env
# Start green with new version
docker-compose up -d app_green
# Test green directly
curl http://localhost:8082/version
# Switch traffic to green (instant, zero downtime)
make green
# If something's wrong, instant rollback
make blue
Scenario 2: Database Migration
Challenge: New version needs schema changes.
Approach:
- Make schema changes backward-compatible
- Deploy new version to green (with migration)
- Test thoroughly
- Switch traffic
- Keep blue running for quick rollback
- After validation, update blue too
Example:
# Update green with new code + migrations
GREEN_IMAGE=myapp:v2.0.0 docker-compose up -d app_green
# Run migrations on green
docker exec app_green npm run migrate
# Test green
curl http://localhost:8082/test-endpoint
# Switch traffic
make green
# Monitor for issues
docker logs -f app_green
# If problems occur, instant rollback
make blue
Top comments (0)