cypher682

Posted on Dec 9, 2025

Complete Beginner's Guide to Blue-Green Deployment with Nginx and Real-Time Alerting

#architecture #devops #monitoring #tutorial

Introduction

Welcome to this comprehensive guide on Blue-Green Deployment - a powerful deployment strategy used by companies like Netflix, Amazon, and Facebook to achieve zero-downtime deployments. This project demonstrates how to implement a production-ready blue-green deployment system with automatic failover and real-time Slack alerting.

Repository: HNG DevOps on GitHub

What You'll Learn:

What blue-green deployment is and why it matters
How to implement automatic failover with Nginx
How to build a real-time monitoring and alerting system
How to achieve zero-downtime deployments
How to integrate Slack notifications for DevOps alerts

Prerequisites:

Basic understanding of Docker
Familiarity with command line
Basic knowledge of web servers (helpful but not required)

What is Blue-Green Deployment?

The Problem: Traditional Deployments

Imagine you're running a website. When you deploy a new version:

You stop the old version
Deploy the new version
Start the new version

Problem: During steps 1-3, your website is DOWN. Users see errors. You lose money.

The Solution: Blue-Green Deployment

Instead of having one environment, you have TWO identical environments:

BLUE (Production) - Currently serving users
GREEN (Staging) - New version waiting to go live

When you're ready to deploy:

Deploy new version to GREEN
Test GREEN thoroughly
Switch traffic from BLUE to GREEN instantly
If something goes wrong, switch back to BLUE instantly

Result: ZERO DOWNTIME

Real-World Analogy

Think of it like having two stages at a concert:

Stage 1 (Blue): Band is performing, audience is watching
Stage 2 (Green): Next band is setting up and sound-checking

When it's time to switch:

Rotate the stage 180°
Audience now sees Stage 2 (Green)
Stage 1 (Blue) becomes the setup area for the next act

If the new band has technical issues, rotate back to Stage 1 instantly!

Project Overview

This project implements a production-ready blue-green deployment system with:

Core Features

Automatic Failover
- Nginx detects when Blue instance fails
- Automatically routes all traffic to Green
- Zero failed requests to users
Real-Time Alerting
- Python watcher monitors Nginx logs
- Detects failover events instantly
- Sends alerts to Slack
- Monitors error rates
Zero-Downtime Deployment
- Deploy new version to inactive instance
- Switch traffic instantly
- Rollback in seconds if needed
Structured Logging
- Every request logged with metadata
- Pool information (blue/green)
- Release version
- Response times

Understanding the Core Concepts

1. Blue-Green vs. Load Balancing

Load Balancing:

Request 1 → Blue
Request 2 → Green
Request 3 → Blue
Request 4 → Green

Traffic is distributed between instances.

Blue-Green (This Project):

All Requests → Blue (Primary)
              Green (Backup, standby)

If Blue fails:
All Requests → Green (Backup becomes active)

Traffic goes to ONE instance at a time. The other is a hot standby.

2. Nginx Upstream Configuration

Nginx can route traffic to multiple backend servers (upstreams). This project uses:

upstream app {
    server app_blue:3000 max_fails=1 fail_timeout=5s;
    server app_green:3000 backup max_fails=1 fail_timeout=5s;
}

Key Directives:

server app_blue:3000 - Primary server
backup - Only use if primary fails
max_fails=1 - Mark as failed after 1 error
fail_timeout=5s - Try again after 5 seconds

3. Failover Mechanism

When a request fails:

Nginx tries Blue instance
Blue returns 5xx error or times out
Nginx marks Blue as failed
Nginx retries request to Green (backup)
User receives successful response from Green
All subsequent requests go to Green

Result: User never sees an error!

Architecture Overview

System Architecture Diagram

                Internet
                   |
                   v
            +--------------+
            |    Nginx     |
            |  (Port 8080) |
            +--------------+
                   |
    +--------------+--------------+
    |                             |
    v                             v
+----------+                +----------+
| App Blue | (Primary)      |App Green | (Backup)
|Port 3000 |                |Port 3000 |
+----------+                +----------+
    |                             |
    +-------------+---------------+
                  |
                  v
          +--------------+
          | Nginx Logs   |
          +--------------+
                  |
                  v
        +------------------+
        | Alert Watcher    |
        | (Python)         |
        +------------------+
                  |
                  v
          +--------------+
          |    Slack     |
          +--------------+

Component Breakdown

1. Nginx Proxy

Role: Traffic router and load balancer

Responsibilities:

Route all incoming requests
Detect backend failures
Perform automatic failover
Log all requests with metadata

2. App Blue (Primary Instance)

Role: Primary application server

Environment Variables:

APP_POOL=blue
RELEASE_ID=blue-release-1.0.0
PORT=3000

3. App Green (Backup Instance)

Role: Backup application server (hot standby)

Environment Variables:

APP_POOL=green
RELEASE_ID=green-release-1.0.0
PORT=3000

4. Alert Watcher (Python)

Role: Real-time log monitoring and alerting

Responsibilities:

Tail Nginx access logs
Parse structured log entries
Detect failover events
Monitor error rates
Send Slack alerts

Technology Stack

Component	Technology	Purpose
Reverse Proxy	Nginx (Alpine)	Traffic routing & failover
Application	Python/Flask	Demo web application
Monitoring	Python 3.11	Log watcher & alerting
Alerting	Slack Webhooks	Real-time notifications
Containerization	Docker	Package all services
Orchestration	Docker Compose	Manage multi-container setup

Setting Up the Project

Step 1: Clone the Repository

git clone https://github.com/cypher682/hng13-stage-3-devops.git
cd hng13-stage-3-devops

Step 2: Understand the Project Structure

hng13-stage-3-devops/
├── docker-compose.yml       # Container orchestration
├── nginx.conf.template      # Nginx configuration template
├── entrypoint.sh           # Nginx startup script
├── watcher.py              # Python log monitoring script
├── requirements.txt        # Python dependencies
├── .env.example            # Environment variables template
├── test-failover.sh        # Failover testing script
└── public/                 # Static HTML files

Step 3: Configure Environment Variables

Copy the example environment file:

cp .env.example .env

Edit .env with your configuration:

# Application Configuration
PORT=3000
ACTIVE_POOL=blue

# Docker Images
BLUE_IMAGE=yimikaade/wonderful:latest
GREEN_IMAGE=yimikaade/wonderful:latest

# Release Identifiers
RELEASE_ID_BLUE=blue-release-1.0.0
RELEASE_ID_GREEN=green-release-1.0.0

# Slack Integration
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# Alert Configuration
ERROR_RATE_THRESHOLD=2          # Percentage
WINDOW_SIZE=200                 # Number of requests
ALERT_COOLDOWN_SEC=300          # 5 minutes

Step 4: Set Up Slack Webhook

Go to Slack API
Create a new Slack App
Enable Incoming Webhooks
Create a webhook for your channel
Copy the webhook URL to .env

Step 5: Build and Start Services

# Build all containers
docker compose build

# Start all services in detached mode
docker compose up -d

# Verify all services are running
docker compose ps

Expected Output:

NAME              STATUS    PORTS
nginx_proxy       Up        0.0.0.0:8080->80/tcp
app_blue          Up        3000/tcp
app_green         Up        3000/tcp
alert_watcher     Up

Step 6: Verify the Deployment

Open your browser and navigate to:

http://localhost:8080/version

You should see:

{
  "pool": "blue",
  "release": "blue-release-1.0.0",
  "status": "healthy"
}

Understanding the Configuration

Nginx Configuration Template

The nginx.conf.template uses environment variable substitution:

upstream app {
    server app_blue:${PORT} max_fails=1 fail_timeout=5s;
    server app_green:${PORT} backup max_fails=1 fail_timeout=5s;
}

server {
    listen 80;

    location / {
        proxy_pass http://app;
        proxy_next_upstream error timeout http_502 http_503 http_504;

        # Timeouts for fast failure detection
        proxy_connect_timeout 2s;
        proxy_send_timeout 3s;
        proxy_read_timeout 3s;
    }
}

Key Configuration Explained

1. Proxy Next Upstream

proxy_next_upstream error timeout http_502 http_503 http_504;

What it does:
Automatically retry the request to the next upstream (Green) if:

Connection error occurs
Request times out
Upstream returns 502, 503, or 504

Result: User never sees these errors!

2. Aggressive Timeouts

proxy_connect_timeout 2s;
proxy_send_timeout 3s;
proxy_read_timeout 3s;

Why so short?

Detect failures fast
Failover happens in seconds, not minutes
Better user experience

3. Structured Logging

log_format detailed 
    'pool=$upstream_http_x_app_pool '
    'release=$upstream_http_x_release_id '
    'upstream_status=$upstream_status '
    'latency=$request_time ';

access_log /var/log/nginx/access.log detailed;

Log Example:

pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.045

How Failover Works

Normal Operation (Blue Active)

User Request → Nginx → App Blue → Success (200 OK)
                       ↓
                   Nginx Logs: pool=blue

Failure Scenario

Let's trace what happens when Blue crashes:

Step 1: User Makes Request

User → GET http://localhost:8080/

Step 2: Nginx Tries Blue

Nginx → app_blue:3000
        ↓
    Connection Refused (Blue is down)

Step 3: Nginx Detects Failure

Nginx marks app_blue as FAILED
(max_fails=1 threshold reached)

Step 4: Nginx Retries to Green

Nginx → app_green:3000
        ↓
    Success! (200 OK)

Step 5: User Receives Response

User ← 200 OK from Green
(User never knew Blue failed!)

Step 6: Alert Watcher Detects Failover

# watcher.py detects pool change
if pool != self.last_alerted_pool:
    send_slack_alert("Failover detected: blue → green")

Step 7: Slack Alert Sent

Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s

Real-Time Alerting System

Alert Watcher Architecture

The watcher.py script monitors Nginx logs in real-time:

class AlertWatcher:
    def __init__(self):
        # Track current state
        self.current_pool = ACTIVE_POOL
        self.last_alerted_pool = ACTIVE_POOL

        # Rolling window for error rate
        self.request_window = deque(maxlen=WINDOW_SIZE)

        # Cooldown timers
        self.last_failover_alert_time = 0
        self.last_error_alert_time = 0

How It Works

1. Log Tailing

def tail_log_file(self):
    with open(LOG_FILE, 'r') as log:
        log.seek(0, 2)  # Go to end of file
        while True:
            line = log.readline()
            if not line:
                time.sleep(0.1)
                continue
            self.process_log_line(line)

Explanation:

Opens log file in read mode
Seeks to end (like tail -f)
Continuously reads new lines
Processes each line in real-time

2. Failover Detection

def check_failover(self, pool):
    if pool == self.last_alerted_pool:
        return  # No change

    # Determine alert type
    if pool == ACTIVE_POOL:
        title = f"Recovery detected: back to {pool}"
    else:
        title = f"Failover detected: {self.last_alerted_pool} → {pool}"

    # Send alert
    self.send_slack_alert(title)

3. Error Rate Monitoring

def check_error_rate(self, all_statuses):
    # Check if any status is 5xx
    is_5xx = any(500 <= s < 600 for s in all_statuses)

    # Add to sliding window
    self.request_window.append(is_5xx)

    # Calculate error rate
    error_count = sum(self.request_window)
    total_requests = len(self.request_window)
    error_rate = (error_count / total_requests) * 100

    # Check threshold
    if error_rate > ERROR_RATE_THRESHOLD:
        self.send_slack_alert(
            f"High upstream error rate: {error_rate:.2f}%"
        )

Alert Types

1. Failover Alert

Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s

When: Primary instance fails, traffic switches to backup

2. Recovery Alert

Recovery detected: back to blue (primary)
Release: blue-release-1.0.0
Upstream: 172.18.0.3:3000
Request time: 0.03s

When: Primary instance recovers, traffic returns to primary

3. High Error Rate Alert

High upstream error rate
5xx in upstream attempts: 5.50% over last 200 requests (threshold 2%).
Current pool: green

When: Error rate exceeds threshold in sliding window

Testing the Deployment

Test 1: Verify Normal Operation

# Check which pool is active
curl http://localhost:8080/version

# Expected output:
{
  "pool": "blue",
  "release": "blue-release-1.0.0"
}

Test 2: Trigger Failover

Manual Failover Test:

# Stop Blue instance
docker compose stop app_blue

# Make requests
for i in {1..10}; do
  curl http://localhost:8080/version
  sleep 1
done

# You should see responses from Green
# Check Slack for failover alert

Test 3: Verify Zero Downtime

# In one terminal, continuously make requests
while true; do
  curl -s http://localhost:8080/version | jq -r '.pool'
  sleep 0.5
done

# In another terminal, stop Blue
docker compose stop app_blue

# Observe: No failed requests!
# Output switches from "blue" to "green" seamlessly

Test 4: Test Recovery

# Restart Blue instance
docker compose start app_blue

# Wait for health check to pass (10-15 seconds)
sleep 15

# Make requests
curl http://localhost:8080/version

# Should show Blue is active again
# Check Slack for recovery alert

Test 5: View Nginx Logs

# View real-time logs
docker compose exec nginx_proxy tail -f /var/log/nginx/access.log

# Example output:
pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.023
pool=green release=green-release-1.0.0 upstream_status=502, 200 latency=0.045

Common Issues and Troubleshooting

Issue 1: Failover Not Happening

Symptoms: Blue fails but traffic doesn't switch to Green

Solutions:

# Check Nginx config
docker compose exec nginx_proxy cat /etc/nginx/nginx.conf.processed

# Restart Nginx
docker compose restart nginx_proxy

# Verify Green is healthy
docker compose exec app_green wget -qO- http://localhost:3000/healthz

Issue 2: Alerts Not Sending to Slack

Symptoms: Failover happens but no Slack notification

Solutions:

Verify Webhook URL:

# Check .env file
cat .env | grep SLACK_WEBHOOK_URL

# Test webhook manually
curl -X POST YOUR_WEBHOOK_URL \
  -H 'Content-Type: application/json' \
  -d '{"text":"Test alert"}'

Restart Watcher:

docker compose restart alert_watcher

Issue 3: Both Instances Receiving Traffic

Symptoms: Requests alternate between Blue and Green

Solution:

# Ensure Green has "backup" directive
# Check nginx.conf.template
# Rebuild and restart
docker compose down
docker compose up -d --build

Debugging Commands Cheat Sheet

# View all container status
docker compose ps

# View all logs
docker compose logs

# View specific service logs
docker compose logs -f nginx_proxy

# Execute command in container
docker compose exec nginx_proxy sh

# Restart specific service
docker compose restart app_blue

# Clean up everything
docker compose down -v

Key Takeaways

What You've Learned

Blue-Green Deployment
- How to implement zero-downtime deployments
- Difference between blue-green and load balancing
- When to use blue-green vs. other strategies
Nginx as a Reverse Proxy
- Upstream configuration
- Failover mechanisms
- Health checks and timeouts
- Structured logging
Real-Time Monitoring
- Log tailing and parsing
- Event detection
- Sliding window calculations
- Alert deduplication
Slack Integration
- Webhook setup
- Alert formatting
- Error handling
Docker Orchestration
- Multi-container applications
- Service dependencies
- Volume management
- Health checks

Real-World Applications

This pattern is used by:

Netflix - Canary deployments with instant rollback
Amazon - Blue-green for critical services
Heroku - Platform-level blue-green deployments
GitHub - Zero-downtime deployments

Next Steps

Enhance the System
- Add database with replication
- Implement canary deployments
- Add A/B testing capabilities
Improve Monitoring
- Add Prometheus metrics
- Create Grafana dashboards
- Implement distributed tracing
Scale the Architecture
- Deploy to Kubernetes
- Use managed load balancers
- Implement auto-scaling

Additional Resources

Documentation

Tools

Nginx - High-performance web server
Docker - Containerization platform
Slack - Team communication

Conclusion

Congratulations! You've just learned how to implement a production-ready blue-green deployment system with automatic failover and real-time alerting. This is a critical skill for modern DevOps engineers.

Remember:

Test thoroughly before deploying to production
Monitor continuously - you can't fix what you can't see
Automate everything - manual processes lead to errors
Document your decisions - future you will thank you

Blue-green deployment is just one piece of the DevOps puzzle. The principles you've learned here - automation, monitoring, resilience, and rapid recovery - apply to all aspects of modern infrastructure.

Keep experimenting, keep learning, and most importantly, keep building!

Full Project Repository: HNG DevOps on GitHub

Happy deploying!