Mart Young

Posted on Dec 9, 2025

Build a Blue/Green deployment with Nginx Auto-Failover

#devops #node #architecture #tutorial

Imagine you run two identical kitchens: Blue and Green. One serves customers, the other is warmed up and ready. If the active kitchen has trouble, you quietly switch orders to the standby and nobody notices. That’s Blue/Green. In this post we’ll build it ourselves, line by line, with Nginx doing the instant handoff—no prior code or prebuilt images required.

How It Fits Together (Quick Map)

Nginx: Front door. Sends traffic to the main pool, retries fast, and falls back to backup if the main one misbehaves. Logs everything as JSON.
Apps (Blue & Green): Same Node.js app, two copies. Env vars label which is which. They expose /healthz, /version, and chaos endpoints so we can test.
Dockerfile: Builds the app once; both Blue and Green use it.
docker-compose.yaml: Starts both apps, Nginx, and (if you want) the Slack watcher. Sets ports and health checks.
nginx.conf.template: Tells Nginx who’s primary, who’s backup, and to be impatient with failures.
watcher.py: Reads Nginx logs and posts to Slack when failover or high errors happen (optional, but helpful).
.env: One place to pick the active pool and set labels/alert thresholds.

What You’ll Learn

Blue/Green basics (two identical apps, one live, one ready).
How Nginx routes to a primary and instantly falls back to a backup.
Why health checks, short timeouts, and retries make failover fast.
How to add chaos endpoints to prove failover works.
How to read structured logs (and send Slack alerts) so you know which pool served traffic.
How to wire it all together with Docker Compose—no Kubernetes needed.

Prerequisites

Docker + Docker Compose.
Node.js (so we can build the tiny app locally).
(Optional) Slack webhook URL if you want alerts.
A terminal and a text editor. That’s it.

1) Create the Project from Scratch

Let’s start with nothing and build every file ourselves. Copy/paste is fine—understanding why each piece exists is the real goal.

1.1 package.json

This defines our minimal Node app and its dependencies.

cat > package.json <<'EOF'
{
  "name": "blue-green-app",
  "version": "1.0.0",
  "main": "app.js",
  "license": "MIT",
  "scripts": {
    "start": "node app.js"
  },
  "dependencies": {
    "express": "^4.18.2"
  }
}
EOF

1.2 app.js (with health + chaos endpoints)

This tiny server:

Responds to /healthz so Nginx can decide if we’re alive.
Responds to /version with headers that tell us which pool handled the request.
Has chaos endpoints so we can intentionally break one pool and watch traffic fail over.

cat > app.js <<'EOF'
const express = require('express');
const app = express();

const APP_POOL = process.env.APP_POOL || 'unknown';
const RELEASE_ID = process.env.RELEASE_ID || 'unknown';
const PORT = process.env.PORT || 3000;

let chaosMode = false;
let chaosType = 'error'; // 'error' or 'timeout'

// Add headers for tracing
app.use((req, res, next) => {
  res.setHeader('X-App-Pool', APP_POOL);
  res.setHeader('X-Release-Id', RELEASE_ID);
  next();
});

app.get('/', (req, res) => {
  res.json({
    service: 'Blue/Green Demo',
    pool: APP_POOL,
    releaseId: RELEASE_ID,
    status: chaosMode ? 'chaos' : 'healthy',
    chaosMode,
    chaosType: chaosMode ? chaosType : null,
    timestamp: new Date().toISOString(),
    endpoints: { version: '/version', health: '/healthz', chaos: '/chaos/start, /chaos/stop' }
  });
});

app.get('/healthz', (req, res) => {
  res.status(200).json({ status: 'healthy', pool: APP_POOL });
});

app.get('/version', (req, res) => {
  if (chaosMode && chaosType === 'error') return res.status(500).json({ error: 'Chaos: server error' });
  if (chaosMode && chaosType === 'timeout') return; // simulate hang
  res.json({ version: '1.0.0', pool: APP_POOL, releaseId: RELEASE_ID, timestamp: new Date().toISOString() });
});

app.post('/chaos/start', (req, res) => {
  const mode = req.query.mode || 'error';
  chaosMode = true;
  chaosType = mode;
  res.json({ message: 'Chaos started', mode, pool: APP_POOL });
});

app.post('/chaos/stop', (req, res) => {
  chaosMode = false;
  chaosType = 'error';
  res.json({ message: 'Chaos stopped', pool: APP_POOL });
});

app.listen(PORT, '0.0.0.0', () => {
  console.log(`App (${APP_POOL}) listening on ${PORT}`);
  console.log(`Release ID: ${RELEASE_ID}`);
});
EOF

1.3 Dockerfile (build the app image)

We’ll build the same image for Blue and Green; only the environment variables differ.

cat > Dockerfile <<'EOF'
FROM node:18-alpine
WORKDIR /app

# Install dependencies
COPY package*.json ./
RUN npm install --only=production

# Copy app code
COPY . .

EXPOSE 3000
CMD ["npm", "start"]
EOF

2) Nginx Config (Auto-Failover Upstreams)

Nginx is our traffic director. We template it so a single env var (ACTIVE_POOL) chooses who is primary. Create nginx.conf.template:

cat > nginx.conf.template <<'EOF'
events {
    worker_connections 1024;
}

http {
    # Structured JSON access logs
    log_format custom_json '{"time":"$time_iso8601"'
                          ',"remote_addr":"$remote_addr"'
                          ',"method":"$request_method"'
                          ',"uri":"$request_uri"'
                          ',"status":$status'
                          ',"bytes_sent":$bytes_sent'
                          ',"request_time":$request_time'
                          ',"upstream_response_time":"$upstream_response_time"'
                          ',"upstream_status":"$upstream_status"'
                          ',"upstream_addr":"$upstream_addr"'
                          ',"pool":"$sent_http_x_app_pool"'
                          ',"release":"$sent_http_x_release_id"}';

    upstream blue_pool {
        server app-blue:3000 max_fails=1 fail_timeout=3s;
        server app-green:3000 backup;
    }

    upstream green_pool {
        server app-green:3000 max_fails=1 fail_timeout=3s;
        server app-blue:3000 backup;
    }

    server {
        listen 80;
        server_name localhost;

        # Write JSON logs (shared volume)
        access_log /var/log/nginx/access.json custom_json;

        # Health check for LB
        location /healthz {
            access_log off;
            return 200 "healthy\n";
            add_header Content-Type text/plain;
        }

        location / {
            proxy_pass http://$UPSTREAM_POOL;

            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            proxy_connect_timeout 1s;
            proxy_send_timeout 3s;
            proxy_read_timeout 3s;

            proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
            proxy_next_upstream_tries 2;
            proxy_next_upstream_timeout 10s;

            proxy_pass_request_headers on;
            proxy_hide_header X-Powered-By;
        }
    }
}
EOF

Why these settings? (plain English)

max_fails=1 fail_timeout=3s: one bad request is enough to say “try the other one” for a few seconds.
Short timeouts (1s connect, 3s send/read): don’t wait around; switch fast.
proxy_next_upstream + retries: if the main one errors or stalls, immediately try the backup within ~10s total.

What just happened? Nginx knows who’s main, who’s backup, and to give up quickly on a slow/broken main.

3) Optional Alerts: watcher.py + requirements.txt

Think of this as a friendly pager: it reads Nginx’s JSON logs and pings Slack when failover happens or errors spike. If you don’t want alerts, you can skip this section and remove the watcher service later.

requirements.txt:

cat > requirements.txt <<'EOF'
requests==2.32.3
EOF

watcher.py:

cat > watcher.py <<'EOF'
import json, os, time, requests
from collections import deque
from datetime import datetime, timezone

LOG_PATH = os.environ.get("NGINX_LOG_FILE", "/var/log/nginx/access.json")
SLACK_WEBHOOK_URL = os.environ.get("SLACK_WEBHOOK_URL", "")
SLACK_PREFIX = os.environ.get("SLACK_PREFIX", "from: @Watcher")
ACTIVE_POOL = os.environ.get("ACTIVE_POOL", "blue")
ERROR_RATE_THRESHOLD = float(os.environ.get("ERROR_RATE_THRESHOLD", "2"))
WINDOW_SIZE = int(os.environ.get("WINDOW_SIZE", "200"))
ALERT_COOLDOWN_SEC = int(os.environ.get("ALERT_COOLDOWN_SEC", "300"))
MAINTENANCE_MODE = os.environ.get("MAINTENANCE_MODE", "false").lower() == "true"

def now_iso(): return datetime.now(timezone.utc).isoformat()

def post_to_slack(text: str):
    if not SLACK_WEBHOOK_URL:
        return
    try:
        requests.post(SLACK_WEBHOOK_URL, json={"text": f"{SLACK_PREFIX} | {text}"}, timeout=5).raise_for_status()
    except Exception:
        pass

def parse(line: str):
    try:
        data = json.loads(line.strip())
        return {
            "pool": data.get("pool"),
            "release": data.get("release"),
            "status": int(data["status"]) if data.get("status") else None,
            "upstream_status": str(data.get("upstream_status") or ""),
            "upstream_addr": data.get("upstream_addr"),
        }
    except Exception:
        return None

class AlertState:
    def __init__(self):
        self.last_pool = ACTIVE_POOL
        self.window = deque(maxlen=WINDOW_SIZE)
        self.cooldowns = {}
    def cooldown_ok(self, key):
        now = time.time()
        last = self.cooldowns.get(key)
        if last is None or (now - last) >= ALERT_COOLDOWN_SEC:
            self.cooldowns[key] = now
            return True
        return False
    def error_rate_pct(self):
        if not self.window: return 0.0
        err = 0
        for evt in self.window:
            if any(s.startswith("5") for s in evt.get("upstream_status","").split(",") if s):
                err += 1
            elif evt.get("status") and 500 <= int(evt["status"]) <= 599:
                err += 1
        return (err / len(self.window)) * 100.0
    def handle(self, evt):
        self.window.append(evt)
        if MAINTENANCE_MODE:
            return
        pool = evt.get("pool")
        if pool and self.last_pool and pool != self.last_pool:
            if self.cooldown_ok(f"failover_to_{pool}"):
                post_to_slack(f"*Failover Detected*: {self.last_pool} → {pool}\n• time: {now_iso()}\n• error_rate: {self.error_rate_pct():.2f}%\n• upstream: {evt.get('upstream_addr')}")
            self.last_pool = pool
        if len(self.window) >= max(10, int(WINDOW_SIZE * 0.5)):
            rate = self.error_rate_pct()
            if rate > ERROR_RATE_THRESHOLD and self.cooldown_ok(f"error_rate_{int(round(rate))}"):
                post_to_slack(f"*High Error Rate*: {rate:.2f}% over last {len(self.window)} requests\n• time: {now_iso()}\n• active_pool: {pool or self.last_pool}")

def tail(path):
    with open(path, "r") as f:
        f.seek(0, os.SEEK_END)
        while True:
            line = f.readline()
            if not line:
                time.sleep(0.2)
                continue
            yield line

def main():
    state = AlertState()
    while not os.path.exists(LOG_PATH):
        time.sleep(0.5)
    for line in tail(LOG_PATH):
        evt = parse(line)
        if evt: state.handle(evt)

if __name__ == "__main__":
    main()
EOF

4) docker-compose.yaml (Build + Orchestrate)

Compose glues everything together: it builds the single app image, runs it twice (Blue/Green), starts Nginx, and (optionally) the Slack watcher. This is the “one file to rule them all.”

cat > docker-compose.yaml <<'EOF'
version: '3.8'

services:
  app-blue:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: blue-app
    environment:
      - APP_POOL=blue
      - RELEASE_ID=${RELEASE_ID_BLUE}
      - PORT=${PORT:-3000}
    ports:
      - "8081:3000"
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://127.0.0.1:3000/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s

  app-green:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: green-app
    environment:
      - APP_POOL=green
      - RELEASE_ID=${RELEASE_ID_GREEN}
      - PORT=${PORT:-3000}
    ports:
      - "8082:3000"
    healthcheck:
      test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider http://127.0.0.1:3000/healthz || exit 1"]
      interval: 5s
      timeout: 3s
      retries: 3
      start_period: 10s

  nginx:
    image: nginx:alpine
    container_name: nginx-lb
    ports:
      - "8080:80"
    environment:
      - ACTIVE_POOL=${ACTIVE_POOL}
      - UPSTREAM_POOL=${ACTIVE_POOL}_pool
    volumes:
      - ./nginx.conf.template:/etc/nginx/nginx.conf.template:ro
      - nginx_logs:/var/log/nginx
    depends_on:
      - app-blue
      - app-green
    command: >
      sh -c "
        envsubst '$$UPSTREAM_POOL' < /etc/nginx/nginx.conf.template > /etc/nginx/nginx.conf &&
        nginx -g 'daemon off;'
      "

  alert_watcher:
    image: python:3.11-slim
    container_name: alert-watcher
    depends_on:
      - nginx
    environment:
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - SLACK_PREFIX=${SLACK_PREFIX:-from: @Watcher}
      - ACTIVE_POOL=${ACTIVE_POOL}
      - ERROR_RATE_THRESHOLD=${ERROR_RATE_THRESHOLD:-2}
      - WINDOW_SIZE=${WINDOW_SIZE:-200}
      - ALERT_COOLDOWN_SEC=${ALERT_COOLDOWN_SEC:-300}
      - MAINTENANCE_MODE=${MAINTENANCE_MODE:-false}
      - NGINX_LOG_FILE=/var/log/nginx/access.json
    volumes:
      - nginx_logs:/var/log/nginx
      - ./watcher.py:/opt/watcher/watcher.py:ro
      - ./requirements.txt:/opt/watcher/requirements.txt:ro
    command: >
      sh -c "pip install --no-cache-dir -r /opt/watcher/requirements.txt && python /opt/watcher/watcher.py"

volumes:
  nginx_logs:
EOF

Want it ultra-minimal? Comment out/remove alert_watcher if you don’t need Slack alerts. The stack still works without it.

What just happened? We wired four pieces: one shared app image, two containers (Blue/Green) with different env vars, Nginx in front, and an optional watcher that shares Nginx logs.

5) .env (Wire It All Up)

One place for all the knobs: which pool is primary, release labels, and alert thresholds. Changing ACTIVE_POOL later lets you flip who is “live” without touching code.

cat > .env <<'EOF'
# Which pool is primary (blue or green)
ACTIVE_POOL=blue

# Release IDs (just labels for observability)
RELEASE_ID_BLUE=release-v1.0.0-blue
RELEASE_ID_GREEN=release-v1.0.0-green

# App port inside the container
PORT=3000

# Optional Slack alerts
SLACK_WEBHOOK_URL=
SLACK_PREFIX=from: @YourName
ERROR_RATE_THRESHOLD=2
WINDOW_SIZE=200
ALERT_COOLDOWN_SEC=300
MAINTENANCE_MODE=false
EOF

6) Run Everything

Bring the whole stack up. Compose will build the image once and reuse it for both Blue and Green, then start Nginx and the watcher.

docker compose up -d
docker compose ps

You should see containers for blue, green, nginx, and (optionally) alert-watcher.

7) Sanity Checks

These calls prove traffic flows and headers are set so you can tell which pool responded.

# Through Nginx (main entry)
curl http://localhost:8080/version

# Direct to Blue
curl http://localhost:8081/version

# Direct to Green
curl http://localhost:8082/version

You should see JSON with pool and releaseId. By default, Blue is
active.

8) Prove Auto-Failover (Chaos Testing)

Time to break things on purpose. We’ll poison Blue and watch Nginx slide traffic to Green without customers seeing errors.

1) Baseline (Blue active):

curl http://localhost:8080/version
# Expect X-App-Pool: blue

2) Break Blue:

curl -X POST http://localhost:8081/chaos/start?mode=error

3) Check via Nginx:

curl http://localhost:8080/version
# Expect X-App-Pool: green (failover)

4) Heal Blue:

curl -X POST http://localhost:8081/chaos/stop

5) Try timeout chaos:

curl -X POST http://localhost:8081/chaos/start?mode=timeout

6) Light load test (should stay 200s, most from active pool):

for i in {1..50}; do curl -s http://localhost:8080/version >/dev/null; done

What just happened? We proved failover under two kinds of pain: errors and timeouts. Nginx noticed, retried, and shifted traffic to keep responses healthy.

9) Switch Pools Manually

Edit .env:

ACTIVE_POOL=green

Restart:

docker compose down
docker compose up -d

Nginx will now route to Green as primary, Blue as backup.

10) Slack Alerts (If Enabled)

If you kept alert_watcher, set SLACK_WEBHOOK_URL in .env, then:

docker compose up -d

Trigger chaos on Blue:

curl -X POST http://localhost:8081/chaos/start?mode=error
for i in {1..50}; do curl -s http://localhost:8080/version >/dev/null; done
curl -X POST http://localhost:8081/chaos/stop

You should see Slack messages for failover and (if errors > threshold) high error rate. Tune thresholds in .env.

What just happened? The watcher tailed Nginx’s JSON logs, spotted failover/high-error signals, and pinged Slack so humans know immediately.

11) Cleanup

docker compose down
# Full clean (images/volumes):
docker compose down -v --rmi all

What just happened? We shut down everything, and if you ran the full clean, you also removed images and volumes for a fresh slate next time.

Troubleshooting Quick Hits

Ports busy: Free 8080/8081/8082 or change mappings in compose.
No failover: Check health endpoints (/healthz), timeouts, and chaos mode.
Headers missing: Ensure app sets X-App-Pool/X-Release-Id and Nginx passes headers.
Slack silent: Verify webhook, internet egress, and watcher logs.
Slow failover: Tighten timeouts (proxy_connect_timeout, max_fails, fail_timeout).

Why This Matters

Zero downtime: Swap versions or survive failures without users noticing.
Confidence: Chaos testing proves failover actually works.
Clarity: Structured logs + headers show exactly who served each request.
Simplicity: Docker Compose + Nginx — no Kubernetes required.

Next Steps

Add CI to build/push the app image, tag blue/green releases.
Add canary routing (gradual traffic shift) on top of blue/green.
Ship logs to ELK/Datadog, add dashboards.
Extend alerts to email/PagerDuty.

You just built Blue/Green with automatic failover, chaos testing, and optional Slack alerts — from scratch. Happy shipping! 🚀

DEV Community