Introduction
Welcome to this comprehensive guide on Blue-Green Deployment - a powerful deployment strategy used by companies like Netflix, Amazon, and Facebook to achieve zero-downtime deployments. This project demonstrates how to implement a production-ready blue-green deployment system with automatic failover and real-time Slack alerting.
Repository: HNG DevOps on GitHub
What You'll Learn:
- What blue-green deployment is and why it matters
- How to implement automatic failover with Nginx
- How to build a real-time monitoring and alerting system
- How to achieve zero-downtime deployments
- How to integrate Slack notifications for DevOps alerts
Prerequisites:
- Basic understanding of Docker
- Familiarity with command line
- Basic knowledge of web servers (helpful but not required)
What is Blue-Green Deployment?
The Problem: Traditional Deployments
Imagine you're running a website. When you deploy a new version:
- You stop the old version
- Deploy the new version
- Start the new version
Problem: During steps 1-3, your website is DOWN. Users see errors. You lose money.
The Solution: Blue-Green Deployment
Instead of having one environment, you have TWO identical environments:
- BLUE (Production) - Currently serving users
- GREEN (Staging) - New version waiting to go live
When you're ready to deploy:
- Deploy new version to GREEN
- Test GREEN thoroughly
- Switch traffic from BLUE to GREEN instantly
- If something goes wrong, switch back to BLUE instantly
Result: ZERO DOWNTIME
Real-World Analogy
Think of it like having two stages at a concert:
- Stage 1 (Blue): Band is performing, audience is watching
- Stage 2 (Green): Next band is setting up and sound-checking
When it's time to switch:
- Rotate the stage 180°
- Audience now sees Stage 2 (Green)
- Stage 1 (Blue) becomes the setup area for the next act
If the new band has technical issues, rotate back to Stage 1 instantly!
Project Overview
This project implements a production-ready blue-green deployment system with:
Core Features
-
Automatic Failover
- Nginx detects when Blue instance fails
- Automatically routes all traffic to Green
- Zero failed requests to users
-
Real-Time Alerting
- Python watcher monitors Nginx logs
- Detects failover events instantly
- Sends alerts to Slack
- Monitors error rates
-
Zero-Downtime Deployment
- Deploy new version to inactive instance
- Switch traffic instantly
- Rollback in seconds if needed
-
Structured Logging
- Every request logged with metadata
- Pool information (blue/green)
- Release version
- Response times
Understanding the Core Concepts
1. Blue-Green vs. Load Balancing
Load Balancing:
Request 1 → Blue
Request 2 → Green
Request 3 → Blue
Request 4 → Green
Traffic is distributed between instances.
Blue-Green (This Project):
All Requests → Blue (Primary)
Green (Backup, standby)
If Blue fails:
All Requests → Green (Backup becomes active)
Traffic goes to ONE instance at a time. The other is a hot standby.
2. Nginx Upstream Configuration
Nginx can route traffic to multiple backend servers (upstreams). This project uses:
upstream app {
server app_blue:3000 max_fails=1 fail_timeout=5s;
server app_green:3000 backup max_fails=1 fail_timeout=5s;
}
Key Directives:
-
server app_blue:3000- Primary server -
backup- Only use if primary fails -
max_fails=1- Mark as failed after 1 error -
fail_timeout=5s- Try again after 5 seconds
3. Failover Mechanism
When a request fails:
- Nginx tries Blue instance
- Blue returns 5xx error or times out
- Nginx marks Blue as failed
- Nginx retries request to Green (backup)
- User receives successful response from Green
- All subsequent requests go to Green
Result: User never sees an error!
Architecture Overview
System Architecture Diagram
Internet
|
v
+--------------+
| Nginx |
| (Port 8080) |
+--------------+
|
+--------------+--------------+
| |
v v
+----------+ +----------+
| App Blue | (Primary) |App Green | (Backup)
|Port 3000 | |Port 3000 |
+----------+ +----------+
| |
+-------------+---------------+
|
v
+--------------+
| Nginx Logs |
+--------------+
|
v
+------------------+
| Alert Watcher |
| (Python) |
+------------------+
|
v
+--------------+
| Slack |
+--------------+
Component Breakdown
1. Nginx Proxy
Role: Traffic router and load balancer
Responsibilities:
- Route all incoming requests
- Detect backend failures
- Perform automatic failover
- Log all requests with metadata
2. App Blue (Primary Instance)
Role: Primary application server
Environment Variables:
APP_POOL=blue
RELEASE_ID=blue-release-1.0.0
PORT=3000
3. App Green (Backup Instance)
Role: Backup application server (hot standby)
Environment Variables:
APP_POOL=green
RELEASE_ID=green-release-1.0.0
PORT=3000
4. Alert Watcher (Python)
Role: Real-time log monitoring and alerting
Responsibilities:
- Tail Nginx access logs
- Parse structured log entries
- Detect failover events
- Monitor error rates
- Send Slack alerts
Technology Stack
| Component | Technology | Purpose |
|---|---|---|
| Reverse Proxy | Nginx (Alpine) | Traffic routing & failover |
| Application | Python/Flask | Demo web application |
| Monitoring | Python 3.11 | Log watcher & alerting |
| Alerting | Slack Webhooks | Real-time notifications |
| Containerization | Docker | Package all services |
| Orchestration | Docker Compose | Manage multi-container setup |
Setting Up the Project
Step 1: Clone the Repository
git clone https://github.com/cypher682/hng13-stage-3-devops.git
cd hng13-stage-3-devops
Step 2: Understand the Project Structure
hng13-stage-3-devops/
├── docker-compose.yml # Container orchestration
├── nginx.conf.template # Nginx configuration template
├── entrypoint.sh # Nginx startup script
├── watcher.py # Python log monitoring script
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── test-failover.sh # Failover testing script
└── public/ # Static HTML files
Step 3: Configure Environment Variables
Copy the example environment file:
cp .env.example .env
Edit .env with your configuration:
# Application Configuration
PORT=3000
ACTIVE_POOL=blue
# Docker Images
BLUE_IMAGE=yimikaade/wonderful:latest
GREEN_IMAGE=yimikaade/wonderful:latest
# Release Identifiers
RELEASE_ID_BLUE=blue-release-1.0.0
RELEASE_ID_GREEN=green-release-1.0.0
# Slack Integration
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
# Alert Configuration
ERROR_RATE_THRESHOLD=2 # Percentage
WINDOW_SIZE=200 # Number of requests
ALERT_COOLDOWN_SEC=300 # 5 minutes
Step 4: Set Up Slack Webhook
- Go to Slack API
- Create a new Slack App
- Enable Incoming Webhooks
- Create a webhook for your channel
- Copy the webhook URL to
.env
Step 5: Build and Start Services
# Build all containers
docker compose build
# Start all services in detached mode
docker compose up -d
# Verify all services are running
docker compose ps
Expected Output:
NAME STATUS PORTS
nginx_proxy Up 0.0.0.0:8080->80/tcp
app_blue Up 3000/tcp
app_green Up 3000/tcp
alert_watcher Up
Step 6: Verify the Deployment
Open your browser and navigate to:
http://localhost:8080/version
You should see:
{
"pool": "blue",
"release": "blue-release-1.0.0",
"status": "healthy"
}
Understanding the Configuration
Nginx Configuration Template
The nginx.conf.template uses environment variable substitution:
upstream app {
server app_blue:${PORT} max_fails=1 fail_timeout=5s;
server app_green:${PORT} backup max_fails=1 fail_timeout=5s;
}
server {
listen 80;
location / {
proxy_pass http://app;
proxy_next_upstream error timeout http_502 http_503 http_504;
# Timeouts for fast failure detection
proxy_connect_timeout 2s;
proxy_send_timeout 3s;
proxy_read_timeout 3s;
}
}
Key Configuration Explained
1. Proxy Next Upstream
proxy_next_upstream error timeout http_502 http_503 http_504;
What it does:
Automatically retry the request to the next upstream (Green) if:
- Connection error occurs
- Request times out
- Upstream returns 502, 503, or 504
Result: User never sees these errors!
2. Aggressive Timeouts
proxy_connect_timeout 2s;
proxy_send_timeout 3s;
proxy_read_timeout 3s;
Why so short?
- Detect failures fast
- Failover happens in seconds, not minutes
- Better user experience
3. Structured Logging
log_format detailed
'pool=$upstream_http_x_app_pool '
'release=$upstream_http_x_release_id '
'upstream_status=$upstream_status '
'latency=$request_time ';
access_log /var/log/nginx/access.log detailed;
Log Example:
pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.045
How Failover Works
Normal Operation (Blue Active)
User Request → Nginx → App Blue → Success (200 OK)
↓
Nginx Logs: pool=blue
Failure Scenario
Let's trace what happens when Blue crashes:
Step 1: User Makes Request
User → GET http://localhost:8080/
Step 2: Nginx Tries Blue
Nginx → app_blue:3000
↓
Connection Refused (Blue is down)
Step 3: Nginx Detects Failure
Nginx marks app_blue as FAILED
(max_fails=1 threshold reached)
Step 4: Nginx Retries to Green
Nginx → app_green:3000
↓
Success! (200 OK)
Step 5: User Receives Response
User ← 200 OK from Green
(User never knew Blue failed!)
Step 6: Alert Watcher Detects Failover
# watcher.py detects pool change
if pool != self.last_alerted_pool:
send_slack_alert("Failover detected: blue → green")
Step 7: Slack Alert Sent
Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s
Real-Time Alerting System
Alert Watcher Architecture
The watcher.py script monitors Nginx logs in real-time:
class AlertWatcher:
def __init__(self):
# Track current state
self.current_pool = ACTIVE_POOL
self.last_alerted_pool = ACTIVE_POOL
# Rolling window for error rate
self.request_window = deque(maxlen=WINDOW_SIZE)
# Cooldown timers
self.last_failover_alert_time = 0
self.last_error_alert_time = 0
How It Works
1. Log Tailing
def tail_log_file(self):
with open(LOG_FILE, 'r') as log:
log.seek(0, 2) # Go to end of file
while True:
line = log.readline()
if not line:
time.sleep(0.1)
continue
self.process_log_line(line)
Explanation:
- Opens log file in read mode
- Seeks to end (like
tail -f) - Continuously reads new lines
- Processes each line in real-time
2. Failover Detection
def check_failover(self, pool):
if pool == self.last_alerted_pool:
return # No change
# Determine alert type
if pool == ACTIVE_POOL:
title = f"Recovery detected: back to {pool}"
else:
title = f"Failover detected: {self.last_alerted_pool} → {pool}"
# Send alert
self.send_slack_alert(title)
3. Error Rate Monitoring
def check_error_rate(self, all_statuses):
# Check if any status is 5xx
is_5xx = any(500 <= s < 600 for s in all_statuses)
# Add to sliding window
self.request_window.append(is_5xx)
# Calculate error rate
error_count = sum(self.request_window)
total_requests = len(self.request_window)
error_rate = (error_count / total_requests) * 100
# Check threshold
if error_rate > ERROR_RATE_THRESHOLD:
self.send_slack_alert(
f"High upstream error rate: {error_rate:.2f}%"
)
Alert Types
1. Failover Alert
Failover detected: blue → green (to backup)
Release: green-release-1.0.0
Upstream: 172.18.0.4:3000
Request time: 0.05s
When: Primary instance fails, traffic switches to backup
2. Recovery Alert
Recovery detected: back to blue (primary)
Release: blue-release-1.0.0
Upstream: 172.18.0.3:3000
Request time: 0.03s
When: Primary instance recovers, traffic returns to primary
3. High Error Rate Alert
High upstream error rate
5xx in upstream attempts: 5.50% over last 200 requests (threshold 2%).
Current pool: green
When: Error rate exceeds threshold in sliding window
Testing the Deployment
Test 1: Verify Normal Operation
# Check which pool is active
curl http://localhost:8080/version
# Expected output:
{
"pool": "blue",
"release": "blue-release-1.0.0"
}
Test 2: Trigger Failover
Manual Failover Test:
# Stop Blue instance
docker compose stop app_blue
# Make requests
for i in {1..10}; do
curl http://localhost:8080/version
sleep 1
done
# You should see responses from Green
# Check Slack for failover alert
Test 3: Verify Zero Downtime
# In one terminal, continuously make requests
while true; do
curl -s http://localhost:8080/version | jq -r '.pool'
sleep 0.5
done
# In another terminal, stop Blue
docker compose stop app_blue
# Observe: No failed requests!
# Output switches from "blue" to "green" seamlessly
Test 4: Test Recovery
# Restart Blue instance
docker compose start app_blue
# Wait for health check to pass (10-15 seconds)
sleep 15
# Make requests
curl http://localhost:8080/version
# Should show Blue is active again
# Check Slack for recovery alert
Test 5: View Nginx Logs
# View real-time logs
docker compose exec nginx_proxy tail -f /var/log/nginx/access.log
# Example output:
pool=blue release=blue-release-1.0.0 upstream_status=200 latency=0.023
pool=green release=green-release-1.0.0 upstream_status=502, 200 latency=0.045
Common Issues and Troubleshooting
Issue 1: Failover Not Happening
Symptoms: Blue fails but traffic doesn't switch to Green
Solutions:
# Check Nginx config
docker compose exec nginx_proxy cat /etc/nginx/nginx.conf.processed
# Restart Nginx
docker compose restart nginx_proxy
# Verify Green is healthy
docker compose exec app_green wget -qO- http://localhost:3000/healthz
Issue 2: Alerts Not Sending to Slack
Symptoms: Failover happens but no Slack notification
Solutions:
- Verify Webhook URL:
# Check .env file
cat .env | grep SLACK_WEBHOOK_URL
# Test webhook manually
curl -X POST YOUR_WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d '{"text":"Test alert"}'
- Restart Watcher:
docker compose restart alert_watcher
Issue 3: Both Instances Receiving Traffic
Symptoms: Requests alternate between Blue and Green
Solution:
# Ensure Green has "backup" directive
# Check nginx.conf.template
# Rebuild and restart
docker compose down
docker compose up -d --build
Debugging Commands Cheat Sheet
# View all container status
docker compose ps
# View all logs
docker compose logs
# View specific service logs
docker compose logs -f nginx_proxy
# Execute command in container
docker compose exec nginx_proxy sh
# Restart specific service
docker compose restart app_blue
# Clean up everything
docker compose down -v
Key Takeaways
What You've Learned
-
Blue-Green Deployment
- How to implement zero-downtime deployments
- Difference between blue-green and load balancing
- When to use blue-green vs. other strategies
-
Nginx as a Reverse Proxy
- Upstream configuration
- Failover mechanisms
- Health checks and timeouts
- Structured logging
-
Real-Time Monitoring
- Log tailing and parsing
- Event detection
- Sliding window calculations
- Alert deduplication
-
Slack Integration
- Webhook setup
- Alert formatting
- Error handling
-
Docker Orchestration
- Multi-container applications
- Service dependencies
- Volume management
- Health checks
Real-World Applications
This pattern is used by:
- Netflix - Canary deployments with instant rollback
- Amazon - Blue-green for critical services
- Heroku - Platform-level blue-green deployments
- GitHub - Zero-downtime deployments
Next Steps
-
Enhance the System
- Add database with replication
- Implement canary deployments
- Add A/B testing capabilities
-
Improve Monitoring
- Add Prometheus metrics
- Create Grafana dashboards
- Implement distributed tracing
-
Scale the Architecture
- Deploy to Kubernetes
- Use managed load balancers
- Implement auto-scaling
Additional Resources
Documentation
- Nginx Upstream Documentation
- Docker Compose Documentation
- Slack Webhooks Guide
- Blue-Green Deployment Pattern
Tools
Conclusion
Congratulations! You've just learned how to implement a production-ready blue-green deployment system with automatic failover and real-time alerting. This is a critical skill for modern DevOps engineers.
Remember:
- Test thoroughly before deploying to production
- Monitor continuously - you can't fix what you can't see
- Automate everything - manual processes lead to errors
- Document your decisions - future you will thank you
Blue-green deployment is just one piece of the DevOps puzzle. The principles you've learned here - automation, monitoring, resilience, and rapid recovery - apply to all aspects of modern infrastructure.
Keep experimenting, keep learning, and most importantly, keep building!
Full Project Repository: HNG DevOps on GitHub
Happy deploying!
Top comments (0)