DEV Community

Cover image for Monitoring and Alerting for Blue/Green Deployment
Oluchi Oraekwe
Oluchi Oraekwe

Posted on

Monitoring and Alerting for Blue/Green Deployment

Introduction

In the previous article, we explored the Blue/Green deployment strategy using Nginx as a reverse proxy to maintain service availability whenever one of the upstream servers fails. You can read that article here: Blue/Green Deployment with Nginx Upstreams

In this follow-up article, we extend the concept by adding monitoring and alerting mechanisms to the deployment.
In DevOps, monitoring and alerting are critical for maintaining system reliability and availability. When a server fails or becomes unstable, the alerting system notifies the responsible team immediately so the issue can be resolved quickly, maintaining high availability.

Here, we introduce a Watcher Service that runs as a sidecar container in the Docker Compose stack. It monitors Nginx logs in real time and sends alerts to Slack whenever:

  • A failover event occurs whereby the traffic pool switches from Blue to Green
  • The traffic pool switches back from Green to Blue
  • There is a high error rate over a given period

1. Formatting Nginx Logs

To enable meaningful monitoring, the Nginx access logs will have to include specific upstream details such as:

  • Application pool (x-app-pool)
  • Release version (x-release-id)
  • Upstream status
  • Upstream request and response times

These logs are stored in a shared volume so that the watcher application can also access them.

log_format main '$remote_addr - $remote_user [$time_local] '
        '"$request" status=$status body_bytes_sent=$body_bytes_sent '
        'pool=$upstream_http_x_app_pool release=$upstream_http_x_release_id '
        'upstream_status=$upstream_status upstream_addr=$upstream_addr '
        'request_time=$request_time upstream_response_time=$upstream_response_time';

error_log /var/log/nginx/error.log warn;
access_log /var/log/nginx/access.log main;
Enter fullscreen mode Exit fullscreen mode

2. Creating the Log Watcher Application

The watcher script performs these major tasks:

  • Monitor logs in real time and detect pool switches (Blue to Green and Green back to Blue)
  • Detect and alert on high failure rates (e.g., Blue server repeatedly failing)
  • Implement alert cooldown to prevent spamming your Slack channel

Below are the major sections of the watcher.py application.

Imports and Configuration

import time
import os
import re
import requests
import logging
from collections import deque

LOG_FILE = os.getenv("NGINX_LOG_PATH")
SLACK_WEBHOOK_URL = os.getenv("SLACK_WEBHOOK_URL")
ERROR_RATE_THRESHOLD = float(os.getenv("ERROR_RATE_THRESHOLD", "2.0"))
WINDOW_SIZE = int(os.getenv("WINDOW_SIZE", "200"))
CHECK_INTERVAL = int(os.getenv("CHECK_INTERVAL", "10"))
ALERT_COOLDOWN = float(os.getenv("ALERT_COOLDOWN", "300"))

logging.basicConfig(
    level=logging.INFO,
    format="[%(asctime)s] %(levelname)s: %(message)s",
    datefmt="%H:%M:%S",
)

pattern = re.compile(
    r'(?P<ip>\S+) - - \[(?P<time>[^\]]+)\] "(?P<method>\S+) (?P<path>\S+) \S+" '
    r'status=(?P<status>\d+) [^ ]* pool=(?P<pool>\S+) release=(?P<release>\S+) '
    r'upstream_status=(?P<upstream_status>[0-9,\s]+) upstream_addr=(?P<upstream_addr>[0-9\.:,\s]+) '
    r'request_time=(?P<request_time>[\d\.]+) upstream_response_time=(?P<upstream_response_time>[\d\.,\s]+)'
)

recent = deque(maxlen=WINDOW_SIZE)
last_pool = os.getenv("ACTIVE_POOL", "blue")
last_check = 0.0
last_alert_time = {"failover": 0, "switch": 0, "error_rate": 0}

Enter fullscreen mode Exit fullscreen mode

Slack Notification Function

def send_slack_alert(message: str):
    if not SLACK_WEBHOOK_URL:
        logging.warning("SLACK_WEBHOOK_URL not set. Cannot send alert.")
        return
    try:
        response = requests.post(SLACK_WEBHOOK_URL, json={"text": message})
        response.raise_for_status()
        logging.info("Slack alert sent.")
    except requests.exceptions.RequestException as e:
        logging.error(f"Failed to send Slack alert: {e}")
Enter fullscreen mode Exit fullscreen mode

High Error Rate Detection

This function checks the percentage of upstream 5xx errors over a defined window and threshold. Once the rate is higher than the threshold, it will send an alert notification to a Slack channel.

def check_alert():
    now = time.time()

    while recent and now - recent[0][0] > WINDOW_SIZE * CHECK_INTERVAL:
        recent.popleft()

    total = len(recent)
    if total == 0:
        return

    errors = sum(1 for _, _, upstream_status in recent if upstream_status.startswith("5"))
    rate = (errors / total) * 100

    if rate >= ERROR_RATE_THRESHOLD and now - last_alert_time["error_rate"] > ALERT_COOLDOWN:
        last_alert_time["error_rate"] = now
        send_slack_alert(
            f"🚨 *High Error Rate Detected!*\n"
            f"• Error rate: `{rate:.2f}%`\n"
            f"• Threshold: `{ERROR_RATE_THRESHOLD}%`\n"
            f"• Total requests: `{total}`\n"
            f"• Active Pool: `{last_pool}`\n"
            f"• Time: {time.strftime('%Y-%m-%d %H:%M:%S')}`"
        )
Enter fullscreen mode Exit fullscreen mode

Traffic Switch & Failover Monitoring

This function watches Nginx logs in real-time by reading the server logs and detects:

  • Failover events
  • Pool changes
  • Upstream server changes
def monitor_log():
    global last_pool, last_check
    logging.info("Starting log watcher...")
    send_slack_alert("👀 Log watcher started. Monitoring Nginx logs...")

    while not os.path.exists(LOG_FILE):
        time.sleep(5)

    with open(LOG_FILE, "r") as f:
        f.seek(0, 2)
        while True:
            try:
                line = f.readline()
                if not line:
                    time.sleep(CHECK_INTERVAL)
                    continue

                match = pattern.search(line)
                if not match:
                    continue

                data = match.groupdict()
                pool = data["pool"]
                release = data["release"]
                upstream_status = data["upstream_status"]
                upstream = data["upstream_addr"]

                status_list = [s.strip() for s in upstream_status.split(",")]
                addr_list = [a.strip() for a in upstream.split(",")]

                latest_status = status_list[-1]
                previous_status = status_list[0]
                previous_upstream = addr_list[0]
                current_upstream = addr_list[-1]

                recent.append((time.time(), pool, upstream_status))

                # Failover detection
                if upstream_status.startswith("5") and time.time() - last_alert_time["failover"] > ALERT_COOLDOWN:
                    last_alert_time["failover"] = time.time()
                    send_slack_alert(
                        f"⚠️ *Failover Detected!*\n"
                        f"• Previous Pool: `{last_pool}`\n"
                        f"• New Pool: `{pool}`\n"
                        f"• Release: `{release}`\n"
                        f"• Upstream: `{current_upstream}`\n"
                        f"• Status: `{latest_status}`"
                    )

                # Traffic pool switch
                elif pool != last_pool and time.time() - last_alert_time["switch"] > ALERT_COOLDOWN:
                    last_alert_time["switch"] = time.time()
                    send_slack_alert(
                        f"🔄 *Traffic Switch Detected!*\n"
                        f"• `{last_pool}` → `{pool}`\n"
                        f"• Release: `{release}`\n"
                        f"• Upstream: `{current_upstream}`\n"
                        f"• Status: `{latest_status}`"
                    )

                last_pool = pool

                if time.time() - last_check >= CHECK_INTERVAL:
                    check_alert()
                    last_check = time.time()

            except Exception as e:
                logging.error(f"Error processing log line: {e}")
                time.sleep(2)

Enter fullscreen mode Exit fullscreen mode

3. Main Application Entry Point

if __name__ == "__main__":
    try:
        monitor_log()
    except KeyboardInterrupt:
        send_slack_alert("Log monitor stopped manually by user.")
        logging.info("Log watcher stopped manually.")
    except Exception as e:
        send_slack_alert(f"Log monitor stopped due to error: `{e}`")
        logging.error(f"Log watcher stopped due to error: {e}")
Enter fullscreen mode Exit fullscreen mode

Add this requests to your requirements.txt file.

4. Running the Watcher in a Sidecar Container

The watcher runs alongside Nginx using a sidecar pattern. A sidecar container is a secondary container that runs alongside a main application container. The shared log directory allows the watcher application to read Nginx logs in real time, as shown in the part of the Docker Compose file below

monitor:
    image: python:3.11-slim
    container_name: monitor
    volumes:
      - ./logs:/var/log/nginx
      - ./log_watcher/watcher.py:/app/watcher.py
      - ./log_watcher/requirements.txt:/app/requirements.txt
    environment:
      - SLACK_WEBHOOK_URL=${SLACK_WEBHOOK_URL}
      - ACTIVE_POOL=${ACTIVE_POOL}
      - ERROR_RATE_THRESHOLD=${ERROR_RATE_THRESHOLD}
      - WINDOW_SIZE=${WINDOW_SIZE}
      - ALERT_COOLDOWN=${ALERT_COOLDOWN}
      - MAINTENANCE_MODE=${MAINTENANCE_MODE}
      - CHECK_INTERVAL=${CHECK_INTERVAL}
      - NGINX_LOG_PATH=/var/log/nginx/access.log
    working_dir: /app
    command: >
      /bin/sh -c "pip install -r requirements.txt && python watcher.py"

networks:
  default:
    driver: bridge
volumes:
  nginx_logs:
Enter fullscreen mode Exit fullscreen mode

5. Run the Docker Compose file

After adding the monitor service in the Docker Compose file, see the previous article to see the existing Docker Compose file. Start up the application by running docker compose up -d --build.

6. Simulate Monitoring

To simulate the monitoring and alerting, we will use a bash script below that will send requests to the http://localhost:8080/version endpoint 120 times and after every 3 requests, and it will introduce chaos in the system by triggering the chaos endpoint POST http://localhost:8081/chaos/start?mode=error. This will ensure that there is enough number of default server failures that will be above the threshold in order to trigger a high failure alert message.

#!/bin/bash

BASE_NGINX="http://localhost:8080/version"
CHAOS_ON="http://localhost:8081/chaos/start?mode=error"
CHAOS_OFF="http://localhost:8081/chaos/stop"
TOTAL_REQUESTS=120
TOGGLE_INTERVAL=3  # Toggle error every 3 requests
IN_ERROR_MODE=false

echo "Starting Chaos Simulation..."
echo "Sending $TOTAL_REQUESTS requests to $BASE_NGINX with chaos toggled every $TOGGLE_INTERVAL requests"
echo ""

for ((i=1; i<=TOTAL_REQUESTS; i++)); do
  # Toggle chaos mode every N requests
  if (( i % TOGGLE_INTERVAL == 0 )); then
    if [ "$IN_ERROR_MODE" = true ]; then
      echo -e "\n[$i] Turning OFF error mode on Blue..."
      if curl -s -X POST "$CHAOS_OFF" > /dev/null; then
        IN_ERROR_MODE=false
      else
        echo "Failed to stop chaos."
      fi
    else
      echo -e "\n[$i] Turning ON error mode on Blue..."
      if curl -s -X POST "$CHAOS_ON" > /dev/null; then
        IN_ERROR_MODE=true
      else
        echo "Failed to start chaos."
      fi
    fi
  fi

  # Send request to Nginx (load balancer)
  HTTP_STATUS=$(curl -s -o /dev/null -w "%{http_code}" "$BASE_NGINX")
  if [[ "$HTTP_STATUS" == "200" ]]; then
    echo "[$i] Status: $HTTP_STATUS"
  else
    echo "[$i] Error: $HTTP_STATUS"
  fi

  # Sleep 0.1 second
  sleep 0.1
done

# Ensure chaos mode is turned off at the end
echo -e "\n Stopping any remaining chaos mode..."
curl -s -X POST "$CHAOS_OFF" > /dev/null || echo "Cleanup failed."

echo -e "\n Simulation complete!"

Enter fullscreen mode Exit fullscreen mode

The screenshots for the high error rate, failover and server switch when the blue server is healthy are shown below
high error rate

fail over

server switch

Conclusion

This monitoring and alerting system enhances Blue/Green deployments by providing real-time insights into application stability. It ensures that:

  • Failovers are detected immediately
  • Traffic switches are tracked
  • High error rates trigger alerts
  • Your team can respond quickly to maintain maximum uptime

This is just a simplified monitoring and alerting system. Advanced observability tools like Grafana Loki, Prometheus, etc, can be used for more complex systems.

Top comments (0)