Introduction
In the previous article, we explored the Blue/Green deployment strategy using Nginx as a reverse proxy to maintain service availability whenever one of the upstream servers fails. You can read that article here: Blue/Green Deployment with Nginx Upstreams
In this follow-up article, we extend the concept by adding monitoring and alerting mechanisms to the deployment.
In DevOps, monitoring and alerting are critical for maintaining system reliability and availability. When a server fails or becomes unstable, the alerting system notifies the responsible team immediately so the issue can be resolved quickly, maintaining high availability.
Here, we introduce a Watcher Service that runs as a sidecar container in the Docker Compose stack. It monitors Nginx logs in real time and sends alerts to Slack whenever:
- A failover event occurs whereby the traffic pool switches from Blue to Green
- The traffic pool switches back from Green to Blue
- There is a high error rate over a given period
1. Formatting Nginx Logs
To enable meaningful monitoring, the Nginx access logs will have to include specific upstream details such as:
- Application pool (
x-app-pool) - Release version (
x-release-id) - Upstream status
- Upstream request and response times
These logs are stored in a shared volume so that the watcher application can also access them.
log_format main '$remote_addr - $remote_user [$time_local] '
'"$request" status=$status body_bytes_sent=$body_bytes_sent '
'pool=$upstream_http_x_app_pool release=$upstream_http_x_release_id '
'upstream_status=$upstream_status upstream_addr=$upstream_addr '
'request_time=$request_time upstream_response_time=$upstream_response_time';
error_log /var/log/nginx/error.log warn;
access_log /var/log/nginx/access.log main;
2. Creating the Log Watcher Application
The watcher script performs these major tasks:
- Monitor logs in real time and detect pool switches (Blue to Green and Green back to Blue)
- Detect and alert on high failure rates (e.g., Blue server repeatedly failing)
- Implement alert cooldown to prevent spamming your Slack channel
Below are the major sections of the watcher.py application.
Imports and Configuration
import time
import os
import re
import requests
import logging
from collections import deque
LOG_FILE = os.getenv("NGINX_LOG_PATH")
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL")
ERROR_RATE_THRESHOLD = float(os.getenv("ERROR_RATE_THRESHOLD", "2.0"))
WINDOW_SIZE = int(os.getenv("WINDOW_SIZE", "200"))
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
ALERT_COOLDOWN = float(os.getenv("ALERT_COOLDOWN", "300"))
logging.basicConfig(
level=logging.INFO,
format="[%(asctime)s] %(levelname)s: %(message)s",
datefmt="%H:%M:%S",
)
pattern = re.compile(
r'(?P<ip>\S+) - - \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" '
r'status=(?P<status>\d+) [^ ]* pool=(?P<pool>\S+) release=(?P<release>\S+) '
r'upstream_status=(?P<upstream_status>[0-9,\s]+) upstream_addr=(?P<upstream_addr>[0-9\.:,\s]+) '
r'request_time=(?P<request_time>[\d\.]+) upstream_response_time=(?P<upstream_response_time>[\d\.,\s]+)'
)
recent = deque(maxlen=WINDOW_SIZE)
last_pool = os.getenv("ACTIVE_POOL", "blue")
last_check = 0.0
last_alert_time = {"failover": 0, "switch": 0, "error_rate": 0}
Slack Notification Function
def send_slack_alert(message: str):
if not SLACK_WEBHOOK_URL:
logging.warning("SLACK_WEBHOOK_URL not set. Cannot send alert.")
return
try:
response = requests.post(SLACK_WEBHOOK_URL, json={"text": message})
response.raise_for_status()
logging.info("Slack alert sent.")
except requests.exceptions.RequestException as e:
logging.error(f"Failed to send Slack alert: {e}")
High Error Rate Detection
This function checks the percentage of upstream 5xx errors over a defined window and threshold. Once the rate is higher than the threshold, it will send an alert notification to a Slack channel.
def check_alert():
now = time.time()
while recent and now - recent[0][0] > WINDOW_SIZE * CHECK_INTERVAL:
recent.popleft()
total = len(recent)
if total == 0:
return
errors = sum(1 for _, _, upstream_status in recent if upstream_status.startswith("5"))
rate = (errors / total) * 100
if rate >= ERROR_RATE_THRESHOLD and now - last_alert_time["error_rate"] > ALERT_COOLDOWN:
last_alert_time["error_rate"] = now
send_slack_alert(
f"🚨 *High Error Rate Detected!*\n"
f"• Error rate: `{rate:.2f}%`\n"
f"• Threshold: `{ERROR_RATE_THRESHOLD}%`\n"
f"• Total requests: `{total}`\n"
f"• Active Pool: `{last_pool}`\n"
f"• Time: {time.strftime('%Y-%m-%d %H:%M:%S')}`"
)
Traffic Switch & Failover Monitoring
This function watches Nginx logs in real-time by reading the server logs and detects:
- Failover events
- Pool changes
- Upstream server changes
def monitor_log():
global last_pool, last_check
logging.info("Starting log watcher...")
send_slack_alert("👀 Log watcher started. Monitoring Nginx logs...")
while not os.path.exists(LOG_FILE):
time.sleep(5)
with open(LOG_FILE, "r") as f:
f.seek(0, 2)
while True:
try:
line = f.readline()
if not line:
time.sleep(CHECK_INTERVAL)
continue
match = pattern.search(line)
if not match:
continue
data = match.groupdict()
pool = data["pool"]
release = data["release"]
upstream_status = data["upstream_status"]
upstream = data["upstream_addr"]
status_list = [s.strip() for s in upstream_status.split(",")]
addr_list = [a.strip() for a in upstream.split(",")]
latest_status = status_list[-1]
previous_status = status_list[0]
previous_upstream = addr_list[0]
current_upstream = addr_list[-1]
recent.append((time.time(), pool, upstream_status))
# Failover detection
if upstream_status.startswith("5") and time.time() - last_alert_time["failover"] > ALERT_COOLDOWN:
last_alert_time["failover"] = time.time()
send_slack_alert(
f"⚠️ *Failover Detected!*\n"
f"• Previous Pool: `{last_pool}`\n"
f"• New Pool: `{pool}`\n"
f"• Release: `{release}`\n"
f"• Upstream: `{current_upstream}`\n"
f"• Status: `{latest_status}`"
)
# Traffic pool switch
elif pool != last_pool and time.time() - last_alert_time["switch"] > ALERT_COOLDOWN:
last_alert_time["switch"] = time.time()
send_slack_alert(
f"🔄 *Traffic Switch Detected!*\n"
f"• `{last_pool}` → `{pool}`\n"
f"• Release: `{release}`\n"
f"• Upstream: `{current_upstream}`\n"
f"• Status: `{latest_status}`"
)
last_pool = pool
if time.time() - last_check >= CHECK_INTERVAL:
check_alert()
last_check = time.time()
except Exception as e:
logging.error(f"Error processing log line: {e}")
time.sleep(2)
3. Main Application Entry Point
if __name__ == "__main__":
try:
monitor_log()
except KeyboardInterrupt:
send_slack_alert("Log monitor stopped manually by user.")
logging.info("Log watcher stopped manually.")
except Exception as e:
send_slack_alert(f"Log monitor stopped due to error: `{e}`")
logging.error(f"Log watcher stopped due to error: {e}")
Add this requests to your requirements.txt file.
4. Running the Watcher in a Sidecar Container
The watcher runs alongside Nginx using a sidecar pattern. A sidecar container is a secondary container that runs alongside a main application container. The shared log directory allows the watcher application to read Nginx logs in real time, as shown in the part of the Docker Compose file below
monitor:
image: python:3.11-slim
container_name: monitor
volumes:
- ./logs:/var/log/nginx
- ./log_watcher/watcher.py:/app/watcher.py
- ./log_watcher/requirements.txt:/app/requirements.txt
environment:
- SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
- ACTIVE_POOL=${ACTIVE_POOL}
- ERROR_RATE_THRESHOLD=${ERROR_RATE_THRESHOLD}
- WINDOW_SIZE=${WINDOW_SIZE}
- ALERT_COOLDOWN=${ALERT_COOLDOWN}
- MAINTENANCE_MODE=${MAINTENANCE_MODE}
- CHECK_INTERVAL=${CHECK_INTERVAL}
- NGINX_LOG_PATH=/var/log/nginx/access.log
working_dir: /app
command: >
/bin/sh -c "pip install -r requirements.txt && python watcher.py"
networks:
default:
driver: bridge
volumes:
nginx_logs:
5. Run the Docker Compose file
After adding the monitor service in the Docker Compose file, see the previous article to see the existing Docker Compose file. Start up the application by running docker compose up -d --build.
6. Simulate Monitoring
To simulate the monitoring and alerting, we will use a bash script below that will send requests to the http://localhost:8080/version endpoint 120 times and after every 3 requests, and it will introduce chaos in the system by triggering the chaos endpoint POST http://localhost:8081/chaos/start?mode=error. This will ensure that there is enough number of default server failures that will be above the threshold in order to trigger a high failure alert message.
#!/bin/bash
BASE_NGINX="http://localhost:8080/version"
CHAOS_ON="http://localhost:8081/chaos/start?mode=error"
CHAOS_OFF="http://localhost:8081/chaos/stop"
TOTAL_REQUESTS=120
TOGGLE_INTERVAL=3 # Toggle error every 3 requests
IN_ERROR_MODE=false
echo "Starting Chaos Simulation..."
echo "Sending $TOTAL_REQUESTS requests to $BASE_NGINX with chaos toggled every $TOGGLE_INTERVAL requests"
echo ""
for ((i=1; i<=TOTAL_REQUESTS; i++)); do
# Toggle chaos mode every N requests
if (( i % TOGGLE_INTERVAL == 0 )); then
if [ "$IN_ERROR_MODE" = true ]; then
echo -e "\n[$i] Turning OFF error mode on Blue..."
if curl -s -X POST "$CHAOS_OFF" > /dev/null; then
IN_ERROR_MODE=false
else
echo "Failed to stop chaos."
fi
else
echo -e "\n[$i] Turning ON error mode on Blue..."
if curl -s -X POST "$CHAOS_ON" > /dev/null; then
IN_ERROR_MODE=true
else
echo "Failed to start chaos."
fi
fi
fi
# Send request to Nginx (load balancer)
HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_NGINX")
if [[ "$HTTP_STATUS" == "200" ]]; then
echo "[$i] Status: $HTTP_STATUS"
else
echo "[$i] Error: $HTTP_STATUS"
fi
# Sleep 0.1 second
sleep 0.1
done
# Ensure chaos mode is turned off at the end
echo -e "\n Stopping any remaining chaos mode..."
curl -s -X POST "$CHAOS_OFF" > /dev/null || echo "Cleanup failed."
echo -e "\n Simulation complete!"
The screenshots for the high error rate, failover and server switch when the blue server is healthy are shown below

Conclusion
This monitoring and alerting system enhances Blue/Green deployments by providing real-time insights into application stability. It ensures that:
- Failovers are detected immediately
- Traffic switches are tracked
- High error rates trigger alerts
- Your team can respond quickly to maintain maximum uptime
This is just a simplified monitoring and alerting system. Advanced observability tools like Grafana Loki, Prometheus, etc, can be used for more complex systems.


Top comments (0)