DEV Community

cypher682
cypher682

Posted on

Complete Beginner's Guide to Blue-Green Deployment with Nginx and Real-Time Alerting

Introduction

Welcome to this comprehensive guide on Blue-Green Deployment - a powerful deployment strategy used by companies like Netflix, Amazon, and Facebook to achieve zero-downtime deployments. This project demonstrates how to implement a production-ready blue-green deployment system with automatic failover and real-time Slack alerting.

Repository: HNG DevOps on GitHub

What You'll Learn:

  • What blue-green deployment is and why it matters
  • How to implement automatic failover with Nginx
  • How to build a real-time monitoring and alerting system
  • How to achieve zero-downtime deployments
  • How to integrate Slack notifications for DevOps alerts

Prerequisites:

  • Basic understanding of Docker
  • Familiarity with command line
  • Basic knowledge of web servers (helpful but not required)

What is Blue-Green Deployment?

The Problem: Traditional Deployments

Imagine you're running a website. When you deploy a new version:

  1. You stop the old version
  2. Deploy the new version
  3. Start the new version

Problem: During steps 1-3, your website is DOWN. Users see errors. You lose money.

The Solution: Blue-Green Deployment

Instead of having one environment, you have TWO identical environments:

  • BLUE (Production) - Currently serving users
  • GREEN (Staging) - New version waiting to go live

When you're ready to deploy:

  1. Deploy new version to GREEN
  2. Test GREEN thoroughly
  3. Switch traffic from BLUE to GREEN instantly
  4. If something goes wrong, switch back to BLUE instantly

Result: ZERO DOWNTIME

Real-World Analogy

Think of it like having two stages at a concert:

  • Stage 1 (Blue): Band is performing, audience is watching
  • Stage 2 (Green): Next band is setting up and sound-checking

When it's time to switch:

  • Rotate the stage 180°
  • Audience now sees Stage 2 (Green)
  • Stage 1 (Blue) becomes the setup area for the next act

If the new band has technical issues, rotate back to Stage 1 instantly!


Project Overview

This project implements a production-ready blue-green deployment system with:

Core Features

  1. Automatic Failover

    • Nginx detects when Blue instance fails
    • Automatically routes all traffic to Green
    • Zero failed requests to users
  2. Real-Time Alerting

    • Python watcher monitors Nginx logs
    • Detects failover events instantly
    • Sends alerts to Slack
    • Monitors error rates
  3. Zero-Downtime Deployment

    • Deploy new version to inactive instance
    • Switch traffic instantly
    • Rollback in seconds if needed
  4. Structured Logging

    • Every request logged with metadata
    • Pool information (blue/green)
    • Release version
    • Response times

Understanding the Core Concepts

1. Blue-Green vs. Load Balancing

Load Balancing:

Request 1 → Blue
Request 2 → Green
Request 3 → Blue
Request 4 → Green
Enter fullscreen mode Exit fullscreen mode

Traffic is distributed between instances.

Blue-Green (This Project):

All Requests → Blue (Primary)
              Green (Backup, standby)

If Blue fails:
All Requests → Green (Backup becomes active)
Enter fullscreen mode Exit fullscreen mode

Traffic goes to ONE instance at a time. The other is a hot standby.

2. Nginx Upstream Configuration

Nginx can route traffic to multiple backend servers (upstreams). This project uses:

upstream app {
    server app_blue:3000 max_fails=1 fail_timeout=5s;
    server app_green:3000 backup max_fails=1 fail_timeout=5s;
}
Enter fullscreen mode Exit fullscreen mode

Key Directives:

  • server app_blue:3000 - Primary server
  • backup - Only use if primary fails
  • max_fails=1 - Mark as failed after 1 error
  • fail_timeout=5s - Try again after 5 seconds

3. Failover Mechanism

When a request fails:

  1. Nginx tries Blue instance
  2. Blue returns 5xx error or times out
  3. Nginx marks Blue as failed
  4. Nginx retries request to Green (backup)
  5. User receives successful response from Green
  6. All subsequent requests go to Green

Result: User never sees an error!


Architecture Overview

System Architecture Diagram

                Internet
                   |
                   v
            +--------------+
            |    Nginx     |
            |  (Port 8080) |
            +--------------+
                   |
    +--------------+--------------+
    |                             |
    v                             v
+----------+                +----------+
| App Blue | (Primary)      |App Green | (Backup)
|Port 3000 |                |Port 3000 |
+----------+                +----------+
    |                             |
    +-------------+---------------+
                  |
                  v
          +--------------+
          | Nginx Logs   |
          +--------------+
                  |
                  v
        +------------------+
        | Alert Watcher    |
        | (Python)         |
        +------------------+
                  |
                  v
          +--------------+
          |    Slack     |
          +--------------+
Enter fullscreen mode Exit fullscreen mode

Component Breakdown

1. Nginx Proxy

Role: Traffic router and load balancer

Responsibilities:

  • Route all incoming requests
  • Detect backend failures
  • Perform automatic failover
  • Log all requests with metadata

2. App Blue (Primary Instance)

Role: Primary application server

Environment Variables:

APP_POOL=blue
RELEASE_ID=blue-release-1.0.0
PORT=3000
Enter fullscreen mode Exit fullscreen mode

3. App Green (Backup Instance)

Role: Backup application server (hot standby)

Environment Variables:

APP_POOL=green
RELEASE_ID=green-release-1.0.0
PORT=3000
Enter fullscreen mode Exit fullscreen mode

4. Alert Watcher (Python)

Role: Real-time log monitoring and alerting

Responsibilities:

  • Tail Nginx access logs
  • Parse structured log entries
  • Detect failover events
  • Monitor error rates
  • Send Slack alerts

Technology Stack

Component Technology Purpose
Reverse Proxy Nginx (Alpine) Traffic routing & failover
Application Python/Flask Demo web application
Monitoring Python 3.11 Log watcher & alerting
Alerting Slack Webhooks Real-time notifications
Containerization Docker Package all services
Orchestration Docker Compose Manage multi-container setup

Setting Up the Project

Step 1: Clone the Repository

git clone https://github.com/cypher682/hng13-stage-3-devops.git
cd hng13-stage-3-devops
Enter fullscreen mode Exit fullscreen mode

Step 2: Understand the Project Structure

hng13-stage-3-devops/
├── docker-compose.yml       # Container orchestration
├── nginx.conf.template      # Nginx configuration template
├── entrypoint.sh           # Nginx startup script
├── watcher.py              # Python log monitoring script
├── requirements.txt        # Python dependencies
├── .env.example            # Environment variables template
├── test-failover.sh        # Failover testing script
└── public/                 # Static HTML files
Enter fullscreen mode Exit fullscreen mode

Step 3: Configure Environment Variables

Copy the example environment file:

cp .env.example .env
Enter fullscreen mode Exit fullscreen mode

Edit .env with your configuration:

# Application Configuration
PORT=3000
ACTIVE_POOL=blue

# Docker Images
BLUE_IMAGE=yimikaade/wonderful:latest
GREEN_IMAGE=yimikaade/wonderful:latest

# Release Identifiers
RELEASE_ID_BLUE=blue-release-1.0.0
RELEASE_ID_GREEN=green-release-1.0.0

# Slack Integration
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# Alert Configuration
ERROR_RATE_THRESHOLD=2          # Percentage
WINDOW_SIZE=200                 # Number of requests
ALERT_COOLDOWN_SEC=300          # 5 minutes
Enter fullscreen mode Exit fullscreen mode

Step 4: Set Up Slack Webhook

  1. Go to Slack API
  2. Create a new Slack App
  3. Enable Incoming Webhooks
  4. Create a webhook for your channel
  5. Copy the webhook URL to .env

Step 5: Build and Start Services

# Build all containers
docker compose build

# Start all services in detached mode
docker compose up -d

# Verify all services are running
docker compose ps
Enter fullscreen mode Exit fullscreen mode

Expected Output:

NAME              STATUS    PORTS
nginx_proxy       Up        0.0.0.0:8080->80/tcp
app_blue          Up        3000/tcp
app_green         Up        3000/tcp
alert_watcher     Up
Enter fullscreen mode Exit fullscreen mode

Step 6: Verify the Deployment

Open your browser and navigate to:

http://localhost:8080/version
Enter fullscreen mode Exit fullscreen mode

You should see:

{
  "pool": "blue",
  "release": "blue-release-1.0.0",
  "status": "healthy"
}
Enter fullscreen mode Exit fullscreen mode

Understanding the Configuration

Nginx Configuration Template

The nginx.conf.template uses environment variable substitution:

upstream app {
    server app_blue:${PORT} max_fails=1 fail_timeout=5s;
    server app_green:${PORT} backup max_fails=1 fail_timeout=5s;
}

server {
    listen 80;

    location / {
        proxy_pass http://app;
        proxy_next_upstream error timeout http_502 http_503 http_504;

        # Timeouts for fast failure detection
        proxy_connect_timeout 2s;
        proxy_send_timeout 3s;
        proxy_read_timeout 3s;
    }
}
Enter fullscreen mode Exit fullscreen mode

Key Configuration Explained

1. Proxy Next Upstream

proxy_next_upstream error timeout http_502 http_503 http_504;
Enter fullscreen mode Exit fullscreen mode

What it does:
Automatically retry the request to the next upstream (Green) if:

  • Connection error occurs
  • Request times out
  • Upstream returns 502, 503, or 504

Result: User never sees these errors!

2. Aggressive Timeouts

proxy_connect_timeout 2s;
proxy_send_timeout 3s;
proxy_read_timeout 3s;
Enter fullscreen mode Exit fullscreen mode

Why so short?

  • Detect failures fast
  • Failover happens in seconds, not minutes
  • Better user experience

3. Structured Logging

log_format detailed 
    'pool=$upstream_http_x_app_pool '
    'release=$upstream_http_x_release_id '
    'upstream_status=$upstream_status '
    'latency=$request_time ';

access_log /var/log/nginx/access.log detailed;
Enter fullscreen mode Exit fullscreen mode

Log Example:

pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.045
Enter fullscreen mode Exit fullscreen mode

How Failover Works

Normal Operation (Blue Active)

User Request → Nginx → App Blue → Success (200 OK)
                       ↓
                   Nginx Logs: pool=blue
Enter fullscreen mode Exit fullscreen mode

Failure Scenario

Let's trace what happens when Blue crashes:

Step 1: User Makes Request

User → GET http://localhost:8080/
Enter fullscreen mode Exit fullscreen mode

Step 2: Nginx Tries Blue

Nginx → app_blue:3000
        ↓
    Connection Refused (Blue is down)
Enter fullscreen mode Exit fullscreen mode

Step 3: Nginx Detects Failure

Nginx marks app_blue as FAILED
(max_fails=1 threshold reached)
Enter fullscreen mode Exit fullscreen mode

Step 4: Nginx Retries to Green

Nginx → app_green:3000
        ↓
    Success! (200 OK)
Enter fullscreen mode Exit fullscreen mode

Step 5: User Receives Response

User ← 200 OK from Green
(User never knew Blue failed!)
Enter fullscreen mode Exit fullscreen mode

Step 6: Alert Watcher Detects Failover

# watcher.py detects pool change
if pool != self.last_alerted_pool:
    send_slack_alert("Failover detected: blue → green")
Enter fullscreen mode Exit fullscreen mode

Step 7: Slack Alert Sent

Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s
Enter fullscreen mode Exit fullscreen mode

Real-Time Alerting System

Alert Watcher Architecture

The watcher.py script monitors Nginx logs in real-time:

class AlertWatcher:
    def __init__(self):
        # Track current state
        self.current_pool = ACTIVE_POOL
        self.last_alerted_pool = ACTIVE_POOL

        # Rolling window for error rate
        self.request_window = deque(maxlen=WINDOW_SIZE)

        # Cooldown timers
        self.last_failover_alert_time = 0
        self.last_error_alert_time = 0
Enter fullscreen mode Exit fullscreen mode

How It Works

1. Log Tailing

def tail_log_file(self):
    with open(LOG_FILE, 'r') as log:
        log.seek(0, 2)  # Go to end of file
        while True:
            line = log.readline()
            if not line:
                time.sleep(0.1)
                continue
            self.process_log_line(line)
Enter fullscreen mode Exit fullscreen mode

Explanation:

  • Opens log file in read mode
  • Seeks to end (like tail -f)
  • Continuously reads new lines
  • Processes each line in real-time

2. Failover Detection

def check_failover(self, pool):
    if pool == self.last_alerted_pool:
        return  # No change

    # Determine alert type
    if pool == ACTIVE_POOL:
        title = f"Recovery detected: back to {pool}"
    else:
        title = f"Failover detected: {self.last_alerted_pool}{pool}"

    # Send alert
    self.send_slack_alert(title)
Enter fullscreen mode Exit fullscreen mode

3. Error Rate Monitoring

def check_error_rate(self, all_statuses):
    # Check if any status is 5xx
    is_5xx = any(500 <= s < 600 for s in all_statuses)

    # Add to sliding window
    self.request_window.append(is_5xx)

    # Calculate error rate
    error_count = sum(self.request_window)
    total_requests = len(self.request_window)
    error_rate = (error_count / total_requests) * 100

    # Check threshold
    if error_rate > ERROR_RATE_THRESHOLD:
        self.send_slack_alert(
            f"High upstream error rate: {error_rate:.2f}%"
        )
Enter fullscreen mode Exit fullscreen mode

Alert Types

1. Failover Alert

Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s
Enter fullscreen mode Exit fullscreen mode

When: Primary instance fails, traffic switches to backup

2. Recovery Alert

Recovery detected: back to blue (primary)
Release: blue-release-1.0.0
Upstream: 172.18.0.3:3000
Request time: 0.03s
Enter fullscreen mode Exit fullscreen mode

When: Primary instance recovers, traffic returns to primary

3. High Error Rate Alert

High upstream error rate
5xx in upstream attempts: 5.50% over last 200 requests (threshold 2%).
Current pool: green
Enter fullscreen mode Exit fullscreen mode

When: Error rate exceeds threshold in sliding window


Testing the Deployment

Test 1: Verify Normal Operation

# Check which pool is active
curl http://localhost:8080/version

# Expected output:
{
  "pool": "blue",
  "release": "blue-release-1.0.0"
}
Enter fullscreen mode Exit fullscreen mode

Test 2: Trigger Failover

Manual Failover Test:

# Stop Blue instance
docker compose stop app_blue

# Make requests
for i in {1..10}; do
  curl http://localhost:8080/version
  sleep 1
done

# You should see responses from Green
# Check Slack for failover alert
Enter fullscreen mode Exit fullscreen mode

Test 3: Verify Zero Downtime

# In one terminal, continuously make requests
while true; do
  curl -s http://localhost:8080/version | jq -r '.pool'
  sleep 0.5
done

# In another terminal, stop Blue
docker compose stop app_blue

# Observe: No failed requests!
# Output switches from "blue" to "green" seamlessly
Enter fullscreen mode Exit fullscreen mode

Test 4: Test Recovery

# Restart Blue instance
docker compose start app_blue

# Wait for health check to pass (10-15 seconds)
sleep 15

# Make requests
curl http://localhost:8080/version

# Should show Blue is active again
# Check Slack for recovery alert
Enter fullscreen mode Exit fullscreen mode

Test 5: View Nginx Logs

# View real-time logs
docker compose exec nginx_proxy tail -f /var/log/nginx/access.log

# Example output:
pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.023
pool=green release=green-release-1.0.0 upstream_status=502, 200 latency=0.045
Enter fullscreen mode Exit fullscreen mode

Common Issues and Troubleshooting

Issue 1: Failover Not Happening

Symptoms: Blue fails but traffic doesn't switch to Green

Solutions:

# Check Nginx config
docker compose exec nginx_proxy cat /etc/nginx/nginx.conf.processed

# Restart Nginx
docker compose restart nginx_proxy

# Verify Green is healthy
docker compose exec app_green wget -qO- http://localhost:3000/healthz
Enter fullscreen mode Exit fullscreen mode

Issue 2: Alerts Not Sending to Slack

Symptoms: Failover happens but no Slack notification

Solutions:

  1. Verify Webhook URL:
# Check .env file
cat .env | grep SLACK_WEBHOOK_URL

# Test webhook manually
curl -X POST YOUR_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{"text":"Test alert"}'
Enter fullscreen mode Exit fullscreen mode
  1. Restart Watcher:
docker compose restart alert_watcher
Enter fullscreen mode Exit fullscreen mode

Issue 3: Both Instances Receiving Traffic

Symptoms: Requests alternate between Blue and Green

Solution:

# Ensure Green has "backup" directive
# Check nginx.conf.template
# Rebuild and restart
docker compose down
docker compose up -d --build
Enter fullscreen mode Exit fullscreen mode

Debugging Commands Cheat Sheet

# View all container status
docker compose ps

# View all logs
docker compose logs

# View specific service logs
docker compose logs -f nginx_proxy

# Execute command in container
docker compose exec nginx_proxy sh

# Restart specific service
docker compose restart app_blue

# Clean up everything
docker compose down -v
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

What You've Learned

  1. Blue-Green Deployment

    • How to implement zero-downtime deployments
    • Difference between blue-green and load balancing
    • When to use blue-green vs. other strategies
  2. Nginx as a Reverse Proxy

    • Upstream configuration
    • Failover mechanisms
    • Health checks and timeouts
    • Structured logging
  3. Real-Time Monitoring

    • Log tailing and parsing
    • Event detection
    • Sliding window calculations
    • Alert deduplication
  4. Slack Integration

    • Webhook setup
    • Alert formatting
    • Error handling
  5. Docker Orchestration

    • Multi-container applications
    • Service dependencies
    • Volume management
    • Health checks

Real-World Applications

This pattern is used by:

  • Netflix - Canary deployments with instant rollback
  • Amazon - Blue-green for critical services
  • Heroku - Platform-level blue-green deployments
  • GitHub - Zero-downtime deployments

Next Steps

  1. Enhance the System

    • Add database with replication
    • Implement canary deployments
    • Add A/B testing capabilities
  2. Improve Monitoring

    • Add Prometheus metrics
    • Create Grafana dashboards
    • Implement distributed tracing
  3. Scale the Architecture

    • Deploy to Kubernetes
    • Use managed load balancers
    • Implement auto-scaling

Additional Resources

Documentation

Tools

  • Nginx - High-performance web server
  • Docker - Containerization platform
  • Slack - Team communication

Conclusion

Congratulations! You've just learned how to implement a production-ready blue-green deployment system with automatic failover and real-time alerting. This is a critical skill for modern DevOps engineers.

Remember:

  • Test thoroughly before deploying to production
  • Monitor continuously - you can't fix what you can't see
  • Automate everything - manual processes lead to errors
  • Document your decisions - future you will thank you

Blue-green deployment is just one piece of the DevOps puzzle. The principles you've learned here - automation, monitoring, resilience, and rapid recovery - apply to all aspects of modern infrastructure.

Keep experimenting, keep learning, and most importantly, keep building!

Full Project Repository: HNG DevOps on GitHub

Happy deploying!

Top comments (0)