Sangwoo Lee

Posted on Dec 8, 2025

Building a Zero-Downtime CI/CD Pipeline: Blue-Green Deployments for 100K+ Daily Requests

#devops #docker #nginx #githubactions

How I implemented production-grade CI/CD for a high-traffic push notification service handling 100K+ daily requests with zero downtime

When your push notification service handles 100,000+ daily requests and can't afford a single second of downtime, deployment strategy becomes critical. After experimenting with various CI/CD approaches, I settled on a GitHub Actions + Docker + Blue-Green deployment pattern that achieves true zero-downtime releases.

Here's the complete implementation story, including why I chose this specific stack over alternatives like Jenkins, why Blue-Green over Canary or Rolling deployments, and all the gotchas I encountered in production.

The Requirements: Zero Tolerance for Downtime

My push notification service has strict uptime requirements:

24/7 availability: No maintenance windows allowed
High traffic: 100,000+ notifications daily
Long-running processes: BullMQ workers processing jobs for 5-15 minutes
Database connections: Existing connections must drain gracefully
No dropped requests: In-flight requests must complete successfully

Traditional deployment approaches all had deal-breakers:

Strategy	Downtime	Complexity	Issue
Direct replacement	30-60s	Low	❌ Unacceptable downtime
Rolling update	~10s	Medium	❌ Partial outage during rollout
Canary	Minimal	High	❌ Requires traffic splitting infrastructure
Blue-Green	0s	Medium	✅ Perfect fit

Architecture Overview

High-Level Flow

┌─────────────┐      ┌──────────────┐     ┌─────────────┐     ┌──────────────┐
│   GitHub    │────▶│GitHub Actions│────▶│  EC2 SSH   │────▶│  Deploy      │
│   Push      │      │   Workflow   │     │  Trigger    │     │  Script      │
└─────────────┘      └──────────────┘     └─────────────┘     └──────┬───────┘
                                                                      │
                    ┌─────────────────────────────────────────────────┘
                    │
                    ▼
      ┌──────────────────────────────────────────────┐
      │          Docker Compose Build                │
      │  ┌────────────────────────────────────────┐  │
      │  │    Multi-Stage Dockerfile              │  │
      │  │  ┌──────────────┐  ┌──────────────┐    │  │
      │  │  │Build Stage   │─▶│Runtime Stage│    │  │
      │  │  │(Node 22)     │  │(Optimized)   │    │  │
      │  │  └──────────────┘  └──────────────┘    │  │
      │  └────────────────────────────────────────┘  │
      └──────────────────────────────────────────────┘
                    │
                    ▼
      ┌──────────────────────────────────────────────┐
      │        Blue-Green Container Switch           │
      │                                              │
      │  Before:                    After:           │
      │  ┌──────────┐              ┌──────────┐     │
      │  │  Blue    │◀─Primary    │  Green   │◀─Primary │
      │  │  :3011   │              │  :3012   │     │
      │  └──────────┘              └──────────┘     │
      │  ┌──────────┐              ┌──────────┐     │
      │  │  Green   │◀─Backup     │  Blue    │◀─Backup  │
      │  │  :3012   │              │  :3011   │     │
      │  └──────────┘              └──────────┘     │
      └──────────────────────────────────────────────┘
                    │
                    ▼
      ┌───────────────────────────────────────────────┐
      │          Nginx Config Switch                  │
      │                                               │
      │  server localhost:3011; → :3012;              │
      │  server localhost:3012 backup; → :3011 backup;│
      │                                               │
      │  nginx -s reload (instant switch)             │
      └───────────────────────────────────────────────┘
                    │
                    ▼
      ┌──────────────────────────────────────────────┐
      │      Graceful Old Container Shutdown         │
      │                                              │
      │  Wait 30s for in-flight requests             │
      │  Stop old container (keep for rollback)      │
      └──────────────────────────────────────────────┘

Infrastructure Components

                    ┌─────────────────────────────────┐
                    │        AWS EC2 Instance         │
                    │  (Amazon Linux 2023, t3.large)  │
                    └─────────────────────────────────┘
                                  │
        ┌─────────────────────────┼─────────────────────────┐
        │                         │                         │
        ▼                         ▼                         ▼
┌───────────────┐        ┌───────────────┐        ┌──────────────┐
│  Nginx        │        │  Docker       │        │  Shell       │
│  (Reverse     │        │  Compose      │        │  Scripts     │
│   Proxy)      │        │               │        │              │
│               │        │ ┌───────────┐ │        │ deploy-      │
│ Port 443      │───────▶││  Blue     │ │        │ messaging.sh │
│ (SSL/TLS)     │        │ │  :3011    │ │        │              │
│               │        │ └───────────┘ │        └──────────────┘
│ Upstream:     │        │               │
│ :3011/:3012   │        │ ┌───────────┐ │
└───────────────┘        │ │  Green    │ │
                         │ │  :3012    │ │
                         │ └───────────┘ │
                         │               │
                         │ ┌───────────┐ │
                         │ │  Redis    │ │
                         │ │  :6380    │ │
                         │ └───────────┘ │
                         └───────────────┘

Why This Stack? Decision Breakdown

GitHub Actions vs Jenkins vs GitLab CI

I evaluated several CI/CD platforms:

Jenkins

Pros:

Mature ecosystem
Highly customizable
Self-hosted (complete control)

Cons:

❌ Requires dedicated server maintenance
❌ Plugin hell (compatibility issues)
❌ Steep learning curve
❌ Need to manage Jenkins updates/security

GitLab CI

Pros:

Excellent integration with GitLab
Strong container registry
Built-in Kubernetes support

Cons:

❌ My code is on GitHub (migration friction)
❌ GitLab Runner setup required
❌ Additional cost for private repos

GitHub Actions (Winner) ✅

Pros:

✅ Native GitHub integration (zero setup)
✅ Free for public repos, generous free tier for private
✅ Managed infrastructure (no server to maintain)
✅ Massive marketplace of actions
✅ YAML-based (simple, declarative)
✅ Built-in secrets management
✅ Excellent documentation

Cons:

Vendor lock-in (acceptable trade-off)
Minute limits on free tier (not an issue for my usage)

The deciding factor: I could set up the entire pipeline in 30 minutes vs 2-3 days for Jenkins setup, configuration, and maintenance. For a solo developer or small team, this is a no-brainer.

Blue-Green vs Canary vs Rolling Deployments

Rolling Deployment

How it works: Gradually replace instances one-by-one

Pros:

Efficient resource usage
Gradual rollout

Cons:

❌ Mixed versions running simultaneously (API compatibility issues)
❌ Partial downtime as each instance restarts
❌ Complex rollback (need to track which instances updated)
❌ Long deployment time for multiple instances

Canary Deployment

How it works: Route small percentage of traffic to new version

Pros:

Test with real traffic
Gradual risk mitigation
Easy to catch issues early

Cons:

❌ Requires sophisticated traffic splitting (ALB, Istio, etc.)
❌ Complex metrics/monitoring setup needed
❌ Need automated rollback logic
❌ Infrastructure overhead (traffic routing, health checks)
❌ Overkill for small-scale services

Blue-Green Deployment (Winner) ✅

How it works: Maintain two identical environments, switch instantly

Pros:

✅ True zero-downtime (instant switch)
✅ Easy rollback (just switch back)
✅ Simple implementation (Nginx config change)
✅ Full testing before switch
✅ Clear state: old or new (no mixed versions)
✅ Works great with Docker Compose

Cons:

Double resource usage (acceptable on t3.large)
Requires health checks

The deciding factor: For a push notification service with long-running jobs, Blue-Green guarantees that:

No in-flight requests are dropped
BullMQ workers complete their jobs
Database connections drain properly
Rollback is instant if issues occur

The resource cost is worth the operational simplicity and reliability.

Why Not Docker Hub?

Many tutorials use Docker Hub as an intermediary:

Build → Push to Docker Hub → Pull on EC2 → Run

Why I skipped it:

Security: No credentials stored in GitHub Actions for registry
Simplicity: Direct build on EC2 (one less moving part)
Speed: No image push/pull over internet
Cost: Docker Hub rate limits on free tier
Privacy: Source code never leaves my infrastructure

Trade-off: EC2 must have enough resources for building. Solution: t3.large with 2 vCPU / 8GB RAM handles builds fine.

Implementation: Step by Step

Step 1: Multi-Stage Dockerfile (Optimized)

Key insight: Separate build and runtime stages to minimize image size and attack surface.

# Dockerfile

# ============= Stage 1: Build =============
FROM node:22-alpine AS builder

WORKDIR /app

# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm install

# Copy source code and build
COPY . .
RUN npm run build

# ============= Stage 2: Runtime =============
FROM node:22-alpine AS runner

WORKDIR /app

# Copy only production artifacts from builder
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./

# Install only production dependencies + curl for health checks
RUN apk add --no-cache curl \
  && npm install --only=production

EXPOSE 3002

CMD ["node", "dist/main"]

Why Multi-Stage?

Metric	Single-Stage	Multi-Stage	Improvement
Image size	~1.2 GB	~350 MB	3.4x smaller
Build dependencies	Included	Excluded	Attack surface ↓
TypeScript files	Included	Excluded	Security ↑
Layer caching	Poor	Excellent	Build speed ↑

Pro tips:

node:22-alpine vs node:22: Alpine is 4x smaller (120MB vs 500MB base)
npm install --only=production: Excludes devDependencies (TypeScript, Jest, etc.)
COPY package*.json before source: Docker layer caching speeds up builds
apk add curl: Needed for Docker health checks

Step 2: Docker Compose with Health Checks

Critical requirement: Know when container is actually ready (not just "running")

# docker-compose.yml
services:

  ### Blue Container
  messaging-blue:
    build: .
    env_file: .env
    depends_on:
      redis-messaging:
        condition: service_healthy  # Wait for Redis
    networks: [messaging-net]
    ports:
      - "3011:3002"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
      interval: 10s      # Check every 10 seconds
      timeout: 5s        # Timeout after 5 seconds
      retries: 5         # Must succeed 5 times
      start_period: 20s  # Grace period for app startup

  ### Green Container
  messaging-green:
    build: .
    env_file: .env
    depends_on:
      redis-messaging:
        condition: service_healthy
    networks: [messaging-net]
    ports:
      - "3012:3002"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

  ### Redis (shared by both containers)
  redis-messaging:
    container_name: redis-messaging
    image: redis:7-alpine
    command: >
      redis-server 
      --appendonly yes 
      --requirepass password123 
      --maxmemory 512mb 
      --maxmemory-policy noeviction
    ports:
      - "6380:6379"
    restart: always
    networks: [messaging-net]
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "password123", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

networks:
  messaging-net:
    driver: bridge

Key Design Decisions:

Separate ports (3011/3012): Blue and Green can run simultaneously
Shared Redis: Both containers use same Redis instance (job queue persistence)
Health checks: Deploy script waits for healthy status before switching
start_period: 20s: Grace period for NestJS app initialization
retries: 5: Must succeed 5 consecutive checks (avoids flaky switches)

Why /health/live endpoint?

// src/main.ts (NestJS)
async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  // Health check endpoint (simple but effective)
  app.get('/health/live', (req, res) => {
    res.status(200).send('OK');
  });

  await app.listen(3002, '0.0.0.0');
}

This endpoint confirms:

✅ NestJS app started
✅ HTTP server listening
✅ Ready to handle requests

For production, consider more sophisticated health checks:

app.get('/health/ready', async (req, res) => {
  // Check database connection
  const dbOk = await checkDatabase();
  // Check Redis connection
  const redisOk = await checkRedis();
  // Check external API
  const fcmOk = await checkFirebase();

  if (dbOk && redisOk && fcmOk) {
    res.status(200).json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'not ready' });
  }
});

Step 3: Nginx Configuration (Traffic Router)

Nginx acts as the traffic cop, instantly routing all requests to the active container.

# /etc/nginx/conf.d/messaging.goodtv.co.kr.conf

# Blue-Green upstream definition
upstream messaging-server {
    server localhost:3011;         # Blue (initially primary)
    server localhost:3012 backup;  # Green (initially backup)
}

server {
    listen 443 ssl;
    server_name messaging.goodtv.co.kr;

    # SSL certificates
    ssl_certificate      /etc/nginx/ssl/goodtv.co.kr.pem;
    ssl_certificate_key  /etc/nginx/ssl/WILD.goodtv.co.kr.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers on;
    ssl_ciphers HIGH:!aNULL:!MD5;

    location / {
        proxy_pass http://messaging-server;
        proxy_http_version 1.1;

        # WebSocket support (for long-polling if needed)
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;

        # Forward client info
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Cookie $http_cookie;

        proxy_redirect off;
    }

    access_log /var/log/nginx/api-ssl-access.log;
    error_log /var/log/nginx/api-ssl-error.log;
}

How Nginx backup works:

Normal state:
  Primary: :3011  →  100% traffic
  Backup:  :3012  →  0% traffic (only used if :3011 fails)

After deployment switch:
  Primary: :3012  →  100% traffic
  Backup:  :3011  →  0% traffic (old version, kept for rollback)

Key benefit: Switching traffic is a simple text replacement + Nginx reload (< 10ms).

Step 4: GitHub Actions Workflow

Trigger on main branch push, SSH into EC2, run deploy script.

# .github/workflows/ci-cd-messaging.yml
name: Messaging CI/CD

on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      # Step 1: Checkout code
      - name: Checkout repository
        uses: actions/checkout@v3

      # Step 2: Setup Node.js (for local build/test if needed)
      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '22'

      # Step 3: Install dependencies (optional, for tests)
      - name: Install dependencies
        run: npm install

      # Step 4: Build project (validates TypeScript)
      - name: Build project
        run: npm run build

      # Step 5: SSH into EC2 and trigger deploy script
      - name: Deploy to EC2
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SSH_HOST }}
          port: ${{ secrets.SSH_PORT }}
          username: ${{ secrets.SSH_USER }}
          key: ${{ secrets.SSH_KEY }}
          script: |
            /home/ec2-user/deploy-messaging.sh

GitHub Secrets Setup:

Settings → Secrets and variables → Actions → New repository secret

SSH_HOST: 54.123.45.67 (EC2 public IP)
SSH_PORT: 22
SSH_USER: ec2-user
SSH_KEY: <contents of private key>

Why build twice (GitHub Actions + EC2)?

GitHub Actions build: Validation only (catch TypeScript errors early)
EC2 build: Actual deployment (ensures consistency with production environment)

Alternative approach: Build on GitHub Actions, push to Docker registry, pull on EC2. I skipped this for simplicity.

Step 5: The Deploy Script (The Magic)

This script orchestrates the entire Blue-Green switch.

#!/bin/bash
# /home/ec2-user/deploy-messaging.sh
set -euo pipefail  # Exit on error, undefined variable, or pipe failure

# ============= 0. Cleanup Dangling Images =============
echo "🔍 Checking for dangling images..."
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)

if [ "$DANGLING_COUNT" -gt 0 ]; then
    echo "🧹 Found $DANGLING_COUNT dangling images. Cleaning up..."
    docker image prune -f
else
    echo "✅ No dangling images found."
fi

# Navigate to project directory
cd /home/ec2-user/push-messaging-server


# ============= 1. Remove Old Stopped Containers =============
for OLD_COLOR in messaging-blue messaging-green; do
  CONTAINER_NAME="messaging-service-${OLD_COLOR}-1"
  if docker inspect "$CONTAINER_NAME" >/dev/null 2>&1; then
    STATUS=$(docker inspect --format='{{.State.Status}}' "$CONTAINER_NAME")
    if [ "$STATUS" = "exited" ]; then
      echo "🗑  Removing old stopped container: $CONTAINER_NAME"
      docker compose -p messaging-service rm -f "$OLD_COLOR"
    fi
  fi
done


# ============= 2. Determine Blue-Green Target =============
CURRENT=$(docker compose -p messaging-service ps -q messaging-blue | wc -l)

if [ "$CURRENT" -gt 0 ]; then
    # Blue is running → deploy Green
    NEW=messaging-green
    OLD=messaging-blue
    NEW_PORT=3012
    OLD_PORT=3011
else
    # Green is running (or nothing) → deploy Blue
    NEW=messaging-blue
    OLD=messaging-green
    NEW_PORT=3011
    OLD_PORT=3012
fi

echo "🎯 Deployment target: $NEW (port: $NEW_PORT)"


# ============= 3. Pull Latest Code & Build New Container =============
git fetch --all
git checkout main
git reset --hard origin/main

docker compose -p messaging-service build $NEW
docker compose -p messaging-service up -d $NEW || {
    echo "🚨 Container startup failed. Showing logs:"
    docker logs messaging-service-$NEW-1
    exit 1
}


# ============= 4. Wait for Health Check =============
MAX_RETRIES=30
COUNT=0

while [ "$(docker inspect --format='{{.State.Health.Status}}' messaging-service-$NEW-1)" != "healthy" ]; do
    if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
        echo "❌ Health check failed: $NEW container not healthy."
        docker logs messaging-service-$NEW-1
        docker compose -p messaging-service stop $NEW
        exit 1
    fi
    echo "🟡 Waiting for health check... ($COUNT/$MAX_RETRIES)"
    sleep 5
    COUNT=$((COUNT + 1))
done

echo "✅ $NEW container is healthy!"


# ============= 5. Switch Nginx Configuration =============
if [ "$NEW" == "messaging-green" ]; then
    # Switch to Green
    sudo sed -i "s/^ *server localhost:3011;/server localhost:3012;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/^ *server localhost:3012 backup;/server localhost:3011 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
    # Switch to Blue
    sudo sed -i "s/^ *server localhost:3012;/server localhost:3011;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/^ *server localhost:3011 backup;/server localhost:3012 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi


# ============= 6. Reload Nginx =============
if ! sudo nginx -t; then
  echo "❌ Nginx configuration test failed. Aborting."
  exit 1
fi

sudo nginx -s reload
echo "✅ Nginx reloaded. Traffic now routing to $NEW (port: $NEW_PORT)"


# ============= 7. Gracefully Stop Old Container =============
sleep 30  # Wait for in-flight requests to complete

docker compose -p messaging-service stop $OLD || true

if docker inspect messaging-service-$OLD-1 >/dev/null 2>&1; then
  echo "⏳ Waiting for $OLD container to stop..."
  MAX_RETRIES=30
  COUNT=0

  while [ "$(docker inspect --format='{{.State.Status}}' messaging-service-$OLD-1)" != "exited" ]; do
    if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
      echo "❌ Failed to stop $OLD container gracefully."
      docker logs messaging-service-$OLD-1
      break
    fi
    sleep 2
    COUNT=$((COUNT + 1))
  done

  echo "🔒 $OLD container stopped (kept for rollback)."
else
  echo "💡 No $OLD container found. Skipping."
fi

echo "🎉 Deployment complete! Active: $NEW, Standby: $OLD"

Script Flow Visualization:

┌─────────────────────────────────────────────────────────────┐
│                    Deploy Script Flow                       │
└─────────────────────────────────────────────────────────────┘

1. Cleanup
   └─ Remove dangling images (free disk space)

2. Determine Target
   └─ Blue running? → Deploy Green
   └─ Green running? → Deploy Blue
   └─ Nothing running? → Deploy Blue

3. Build & Start New Container
   └─ git pull latest code
   └─ docker compose build NEW
   └─ docker compose up -d NEW
   └─ Wait for health check (up to 2.5 min)

4. Switch Traffic (instant)
   └─ sed: Replace port in Nginx config
   └─ nginx -t (validate config)
   └─ nginx -s reload (< 10ms switch)

5. Graceful Shutdown
   └─ sleep 30s (drain in-flight requests)
   └─ docker stop OLD (keep container for rollback)

Result:
┌──────────────┐                    ┌──────────────┐
│  Blue        │                    │  Green       │
│  (OLD)       │  Traffic switch    │  (NEW)       │
│  Stopped     │◀───────────────────│  Active      │
│  (Rollback)  │       Instant      │  (Primary)   │
└──────────────┘                    └──────────────┘

Critical implementation details:

set -euo pipefail: Fail fast on any error

   set -e  # Exit on error
   set -u  # Exit on undefined variable
   set -o pipefail  # Exit if any command in pipe fails

Health check loop: Don't switch traffic until container is truly ready

   # Docker health status: starting → healthy → unhealthy
   while [ "$(docker inspect --format='{{.State.Health.Status}}' ...)" != "healthy" ]; do
     # Wait...
   done

sed for config replacement: Fast, atomic text replacement

   # Before: server localhost:3011;
   # After:  server localhost:3012;
   sudo sed -i "s/server localhost:3011;/server localhost:3012;/" config.conf

nginx -s reload: Graceful reload (no dropped connections)
- New workers start with new config
- Old workers finish current requests
- Old workers shut down after requests complete
- Total switch time: < 10ms
30-second drain: Wait for in-flight requests to complete

   sleep 30  # Typical request: 1-5s, long-running: up to 30s
   docker stop $OLD  # Now safe to stop

Keep stopped container: Easy rollback if needed

   # Don't delete, just stop
   docker compose stop $OLD
   # To rollback: just reverse the Nginx switch

Production Deployment in Action

Typical Deployment Timeline

T+0:00  - GitHub push to main
T+0:05  - GitHub Actions triggered
T+0:15  - Actions build validates (TypeScript compilation)
T+0:20  - SSH into EC2, deploy script starts
T+0:25  - git pull, docker build starts
T+3:00  - New container starts
T+3:30  - Health check passes (5 retries × 10s)
T+3:31  - Nginx config switched
T+3:32  - Traffic now on new container ← ZERO DOWNTIME
T+4:02  - Old container stopped (30s drain)
T+4:03  - Deployment complete ✅

Total time: ~4 minutes
User-visible downtime: 0 seconds

Real Deployment Log

[ec2-user@ip-172-31-38-149 ~]$ /home/ec2-user/deploy-messaging.sh

🔍 Checking for dangling images...
✅ No dangling images found.

🎯 Deployment target: messaging-green (port: 3012)

Fetching latest code...
Already on 'main'
HEAD is now at a7f3d21 Fix: Improve batch logging performance

Building new container...
[+] Building 142.3s (18/18) FINISHED
 => [builder 1/6] FROM node:22-alpine                   0.0s
 => [builder 2/6] WORKDIR /app                          0.1s
 => [builder 3/6] COPY package*.json ./                 0.1s
 => [builder 4/6] RUN npm install                      48.2s
 => [builder 5/6] COPY . .                              1.3s
 => [builder 6/6] RUN npm run build                    52.1s
 => [runner 1/4] FROM node:22-alpine                    0.0s
 => [runner 2/4] WORKDIR /app                           0.0s
 => [runner 3/4] COPY --from=builder /app/dist ./dist   0.2s
 => [runner 4/4] RUN apk add --no-cache curl && npm... 28.4s
 => exporting to image                                  11.9s

Starting container...
[+] Running 1/1
 ✔ Container messaging-service-messaging-green-1  Started  2.3s

🟡 Waiting for health check... (0/30)
🟡 Waiting for health check... (1/30)
🟡 Waiting for health check... (2/30)
🟡 Waiting for health check... (3/30)
🟡 Waiting for health check... (4/30)
🟡 Waiting for health check... (5/30)
✅ messaging-green container is healthy!

Switching Nginx configuration...
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
✅ Nginx reloaded. Traffic now routing to messaging-green (port: 3012)

Waiting 30s for in-flight requests to complete...
⏳ Waiting for messaging-blue container to stop...
🔒 messaging-blue container stopped (kept for rollback).

🎉 Deployment complete! Active: messaging-green, Standby: messaging-blue

Monitoring During Deployment

Server metrics during a typical deployment:

CPU Usage:
  Pre-deploy:  18-22% (normal load)
  Building:    65-80% (docker build)
  Post-deploy: 20-25% (slightly higher, new container initializing)
  Steady:      18-22% (back to normal)

Memory Usage:
  Pre-deploy:  3.2 GB / 8 GB (40%)
  Both running: 4.8 GB / 8 GB (60%) ← Both containers alive during switch
  Post-deploy: 3.4 GB / 8 GB (42%)

Request Error Rate:
  During switch: 0.00% ← Zero dropped requests

Response Time:
  Pre-deploy:   avg 145ms, p95 320ms
  During switch: avg 147ms, p95 325ms ← No spike
  Post-deploy:  avg 143ms, p95 318ms

Key observation: Users experience zero impact during deployment.

Rollback Strategy

Instant Rollback (< 1 minute)

If the new deployment has issues:

# 1. Identify the problem (monitoring alerts)
echo "🚨 Issues detected in new deployment!"

# 2. Determine current active container
if docker compose -p messaging-service ps -q messaging-green | grep -q .; then
    # Green is active, rollback to Blue
    ACTIVE=messaging-green
    ROLLBACK=messaging-blue
    ACTIVE_PORT=3012
    ROLLBACK_PORT=3011
else
    # Blue is active, rollback to Green
    ACTIVE=messaging-blue
    ROLLBACK=messaging-green
    ACTIVE_PORT=3011
    ROLLBACK_PORT=3012
fi

echo "🔄 Rolling back from $ACTIVE to $ROLLBACK"

# 3. Switch Nginx back to old container
if [ "$ROLLBACK" == "messaging-blue" ]; then
    sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
    sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi

# 4. Reload Nginx (instant switch)
sudo nginx -t && sudo nginx -s reload

# 5. Restart old container if stopped
docker compose -p messaging-service start $ROLLBACK

echo "✅ Rollback complete! Active: $ROLLBACK"

Total rollback time: ~30-60 seconds (most of that is restarting the old container if it was stopped)

Why Blue-Green Excels at Rollbacks

Strategy	Rollback Time	Complexity	Data Loss Risk
Rolling	5-10 min	High	Medium
Canary	2-5 min	High	Low
Blue-Green	30-60s	Low	None

The secret: Old container is still alive (just stopped). Starting it takes seconds.

Gotchas and Lessons Learned

1. Health Checks Are Non-Negotiable

Initial mistake: No health check, just sleep 30 after starting container.

Problem: Sometimes app took 40+ seconds to start, leading to:

Nginx switched before app ready
502 Bad Gateway errors
Partial downtime

Solution: Proper Docker health checks with retries.

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
  interval: 10s
  timeout: 5s
  retries: 5        # Must succeed 5 times
  start_period: 20s # Grace period

Result: Zero false positives, zero premature switches.

2. Graceful Shutdown Matters

Initial mistake: docker stop immediately after switch.

Problem: In-flight requests aborted mid-processing, causing:

Failed push notifications
Incomplete database transactions
BullMQ jobs marked as failed

Solution: 30-second drain period.

# Wait for requests to complete
sleep 30

# Now safe to stop
docker stop $OLD

Result: Zero dropped requests during deployment.

3. Disk Space Monitoring

Problem: Docker images accumulated, filling disk.

df -h
# /dev/xvda1  100G  92G  8G  92% /  ← Uh oh

Solution: Automated cleanup in deploy script.

# Remove dangling images (layers from old builds)
docker image prune -f

# Optional: Weekly cron job for aggressive cleanup
# crontab -e
# 0 2 * * 0 docker system prune -af --volumes

Result: Disk usage stable at 45-50%.

4. Redis Persistence Configuration

Initial mistake: Redis without persistence.

Problem: Container restart = lost BullMQ job queue.

Solution: Redis AOF (Append-Only File).

redis-messaging:
  image: redis:7-alpine
  command: >
    redis-server 
    --appendonly yes  # ← Enable persistence
    --requirepass password123

Result: Job queue survives deployments.

5. SELinux Permissions (Amazon Linux)

Problem: Nginx couldn't access Docker containers.

sudo nginx -t
# nginx: [emerg] connect() to localhost:3011 failed (13: Permission denied)

Solution: SELinux policy adjustment.

# Allow Nginx to connect to network
sudo setsebool -P httpd_can_network_connect 1

# Or create custom policy (more secure)
sudo semanage port -a -t http_port_t -p tcp 3011
sudo semanage port -a -t http_port_t -p tcp 3012

Result: Nginx successfully proxies to containers.

6. Git Reset Issues

Initial mistake: git pull failing due to local changes.

git pull
# error: Your local changes to the following files would be overwritten

Solution: Hard reset to remote.

git fetch --all
git reset --hard origin/main  # Force match remote

Trade-off: Local changes lost. Solution: Never edit directly on EC2.

Performance Impact Analysis

Resource Usage During Deployment

t3.large specs: 2 vCPU, 8 GB RAM, General Purpose SSD

Phase	CPU	Memory	Duration
Normal operation	18-22%	3.2 GB	-
Docker build	65-80%	4.0 GB	2-3 min
Both containers	35-45%	4.8 GB	30-60s
New container only	20-25%	3.4 GB	-

Conclusion: t3.large has ample headroom for Blue-Green deployment.

Network Traffic

During deployment:

Inbound: Normal (users unaffected)
Outbound: +50 MB (npm install, git pull)
No external Docker registry traffic (build locally)

Database Connections

Concern: Both containers connecting simultaneously?

Reality:

Old container: Draining, no new connections
New container: Fresh connections
Total: Never exceeds normal load

MSSQL connection pool: 50 max, typically use 8-12.

Production Metrics: 6 Months Later

After 6 months and 150+ deployments:

Metric	Result
Total deployments	156
Failed deployments	2 (1.3%)
Rollbacks needed	1 (0.6%)
User-reported issues	0
Average deployment time	3m 47s
Downtime per deployment	0s
Total uptime	99.97%

Failure causes:

Docker build failed (TypeScript error) → Caught before switch
Health check timeout (DB connection issue) → Deployment aborted
Rollback: Memory leak in new version → Rolled back in 45 seconds

Key insight: The system is self-healing. Failed deployments never affect users.

Cost Analysis

Infrastructure Costs (Monthly)

Resource	Specs	Cost
EC2 t3.large	2 vCPU, 8 GB RAM	$60.74
EBS SSD	100 GB	$10.00
Data transfer	~50 GB/month	$4.50
Total		$75.24

GitHub Actions: Free tier (2,000 minutes/month) - using ~200 minutes/month

Alternative Costs (for comparison)

Solution	Monthly Cost	Setup Time
Current (GitHub Actions + EC2)	$75	2 hours
Jenkins on EC2	$95 (t3.medium for Jenkins + t3.large for app)	1-2 days
AWS CodeDeploy	$85 (EC2 + ALB + CodeDeploy)	4-6 hours
ECS Fargate Blue-Green	$120 (Fargate + ALB)	1 day
Kubernetes (EKS)	$220 (EKS + nodes + ALB)	2-3 days

Winner: Current solution offers best cost/simplicity ratio for small-scale services.

When NOT to Use This Approach

This Blue-Green + Docker + GitHub Actions setup is great for small-to-medium services, but has limits:

Scale Limitations

❌ Don't use this approach when:

Traffic > 10,000 RPS: Single EC2 instance bottleneck
Multi-region deployment: No built-in geo-routing
10+ services: Manual script maintenance becomes unwieldy
Team size > 10: Need centralized CI/CD governance

✅ Better alternatives:

High scale: Kubernetes (EKS), AWS ECS
Multi-region: AWS Global Accelerator + ALB
Microservices: Kubernetes, Service Mesh (Istio)
Large teams: Jenkins, GitLab CI, ArgoCD

Complexity Trade-offs

This approach optimizes for:

✅ Simplicity (1 server, shell script)
✅ Cost (minimal infrastructure)
✅ Speed (fast iteration)

If you need:

Advanced traffic management (canary %, A/B testing)
Auto-scaling (horizontal pod autoscaler)
Service mesh (Istio, Linkerd)
Observability (Prometheus, Grafana, Jaeger)

Then invest in Kubernetes. The operational complexity pays off at scale.

Future Improvements

Short-term (planned)

Automated smoke tests after deployment

   # After Nginx switch, before old container stop
   curl -f https://messaging.goodtv.co.kr/health || rollback

Slack notifications

   curl -X POST $SLACK_WEBHOOK \
     -d "{\"text\": \"✅ Deployment successful: $NEW\"}"

Rollback button (GitHub Actions workflow_dispatch)

   on:
     workflow_dispatch:
       inputs:
         target:
           description: 'Container to switch to (blue/green)'
           required: true

Long-term (if needed)

Canary deployment: Route 10% traffic to new version first
Auto-scaling: Add more EC2 instances behind ALB
Monitoring: Integrate Prometheus + Grafana
Database migrations: Separate migration job before deployment

Conclusion

Building a production-grade CI/CD pipeline doesn't require expensive enterprise tools or complex Kubernetes clusters. With GitHub Actions, Docker multi-stage builds, and a well-crafted shell script, I achieved:

True zero-downtime deployments (0 seconds user impact)
Fast iteration (4-minute deployments)
Easy rollbacks (30-60 seconds)
Low cost ($75/month infrastructure)
High reliability (99.97% uptime over 6 months)

The key decisions:

GitHub Actions over Jenkins: Managed infrastructure, zero maintenance
Blue-Green over Canary: Simpler for long-running jobs, instant rollback
No Docker Hub: Direct build on EC2, fewer moving parts
Multi-stage Dockerfile: 3.4x smaller images, faster deployments
Health checks: No premature switches, reliable deployments

For small-to-medium services (< 10,000 RPS, single region, small team), this architecture hits the sweet spot of simplicity, reliability, and cost.

Key Takeaways

Blue-Green deployment eliminates downtime for services with long-running processes
GitHub Actions provides enterprise-grade CI/CD without operational overhead
Multi-stage Dockerfiles reduce image size 3-4x
Health checks are mandatory for reliable container switching
Nginx upstream with backup provides instant traffic switching
30-second drain period prevents dropped in-flight requests
Keeping stopped containers enables instant rollback
Shell scripts can be production-grade with proper error handling
Choose simplicity over features until you outgrow it