DEV Community

Cover image for Building a Zero-Downtime CI/CD Pipeline: Blue-Green Deployments for 100K+ Daily Requests
Sangwoo Lee
Sangwoo Lee

Posted on

Building a Zero-Downtime CI/CD Pipeline: Blue-Green Deployments for 100K+ Daily Requests

How I implemented production-grade CI/CD for a high-traffic push notification service handling 100K+ daily requests with zero downtime

When your push notification service handles 100,000+ daily requests and can't afford a single second of downtime, deployment strategy becomes critical. After experimenting with various CI/CD approaches, I settled on a GitHub Actions + Docker + Blue-Green deployment pattern that achieves true zero-downtime releases.

Here's the complete implementation story, including why I chose this specific stack over alternatives like Jenkins, why Blue-Green over Canary or Rolling deployments, and all the gotchas I encountered in production.

The Requirements: Zero Tolerance for Downtime

My push notification service has strict uptime requirements:

  • 24/7 availability: No maintenance windows allowed
  • High traffic: 100,000+ notifications daily
  • Long-running processes: BullMQ workers processing jobs for 5-15 minutes
  • Database connections: Existing connections must drain gracefully
  • No dropped requests: In-flight requests must complete successfully

Traditional deployment approaches all had deal-breakers:

Strategy Downtime Complexity Issue
Direct replacement 30-60s Low ❌ Unacceptable downtime
Rolling update ~10s Medium ❌ Partial outage during rollout
Canary Minimal High ❌ Requires traffic splitting infrastructure
Blue-Green 0s Medium βœ… Perfect fit

Architecture Overview

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   GitHub    │────▢│GitHub Actions│────▢│  EC2 SSH   │────▢│  Deploy      β”‚
β”‚   Push      β”‚      β”‚   Workflow   β”‚     β”‚  Trigger    β”‚     β”‚  Script      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                                                                      β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚          Docker Compose Build                β”‚
      β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
      β”‚  β”‚    Multi-Stage Dockerfile              β”‚  β”‚
      β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚
      β”‚  β”‚  β”‚Build Stage   │─▢│Runtime Stageβ”‚    β”‚  β”‚
      β”‚  β”‚  β”‚(Node 22)     β”‚  β”‚(Optimized)   β”‚    β”‚  β”‚
      β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”‚
      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚        Blue-Green Container Switch           β”‚
      β”‚                                              β”‚
      β”‚  Before:                    After:           β”‚
      β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
      β”‚  β”‚  Blue    │◀─Primary    β”‚  Green   │◀─Primary β”‚
      β”‚  β”‚  :3011   β”‚              β”‚  :3012   β”‚     β”‚
      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
      β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
      β”‚  β”‚  Green   │◀─Backup     β”‚  Blue    │◀─Backup  β”‚
      β”‚  β”‚  :3012   β”‚              β”‚  :3011   β”‚     β”‚
      β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚          Nginx Config Switch                  β”‚
      β”‚                                               β”‚
      β”‚  server localhost:3011; β†’ :3012;              β”‚
      β”‚  server localhost:3012 backup; β†’ :3011 backup;β”‚
      β”‚                                               β”‚
      β”‚  nginx -s reload (instant switch)             β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
      β”‚      Graceful Old Container Shutdown         β”‚
      β”‚                                              β”‚
      β”‚  Wait 30s for in-flight requests             β”‚
      β”‚  Stop old container (keep for rollback)      β”‚
      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Infrastructure Components

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚        AWS EC2 Instance         β”‚
                    β”‚  (Amazon Linux 2023, t3.large)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                  β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚                         β”‚                         β”‚
        β–Ό                         β–Ό                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Nginx        β”‚        β”‚  Docker       β”‚        β”‚  Shell       β”‚
β”‚  (Reverse     β”‚        β”‚  Compose      β”‚        β”‚  Scripts     β”‚
β”‚   Proxy)      β”‚        β”‚               β”‚        β”‚              β”‚
β”‚               β”‚        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚        β”‚ deploy-      β”‚
β”‚ Port 443      │───────▢││  Blue     β”‚ β”‚        β”‚ messaging.sh β”‚
β”‚ (SSL/TLS)     β”‚        β”‚ β”‚  :3011    β”‚ β”‚        β”‚              β”‚
β”‚               β”‚        β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Upstream:     β”‚        β”‚               β”‚
β”‚ :3011/:3012   β”‚        β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚ β”‚  Green    β”‚ β”‚
                         β”‚ β”‚  :3012    β”‚ β”‚
                         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                         β”‚               β”‚
                         β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                         β”‚ β”‚  Redis    β”‚ β”‚
                         β”‚ β”‚  :6380    β”‚ β”‚
                         β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Why This Stack? Decision Breakdown

GitHub Actions vs Jenkins vs GitLab CI

I evaluated several CI/CD platforms:

Jenkins

Pros:

  • Mature ecosystem
  • Highly customizable
  • Self-hosted (complete control)

Cons:

  • ❌ Requires dedicated server maintenance
  • ❌ Plugin hell (compatibility issues)
  • ❌ Steep learning curve
  • ❌ Need to manage Jenkins updates/security

GitLab CI

Pros:

  • Excellent integration with GitLab
  • Strong container registry
  • Built-in Kubernetes support

Cons:

  • ❌ My code is on GitHub (migration friction)
  • ❌ GitLab Runner setup required
  • ❌ Additional cost for private repos

GitHub Actions (Winner) βœ…

Pros:

  • βœ… Native GitHub integration (zero setup)
  • βœ… Free for public repos, generous free tier for private
  • βœ… Managed infrastructure (no server to maintain)
  • βœ… Massive marketplace of actions
  • βœ… YAML-based (simple, declarative)
  • βœ… Built-in secrets management
  • βœ… Excellent documentation

Cons:

  • Vendor lock-in (acceptable trade-off)
  • Minute limits on free tier (not an issue for my usage)

The deciding factor: I could set up the entire pipeline in 30 minutes vs 2-3 days for Jenkins setup, configuration, and maintenance. For a solo developer or small team, this is a no-brainer.

Blue-Green vs Canary vs Rolling Deployments

Rolling Deployment

How it works: Gradually replace instances one-by-one

Pros:

  • Efficient resource usage
  • Gradual rollout

Cons:

  • ❌ Mixed versions running simultaneously (API compatibility issues)
  • ❌ Partial downtime as each instance restarts
  • ❌ Complex rollback (need to track which instances updated)
  • ❌ Long deployment time for multiple instances

Canary Deployment

How it works: Route small percentage of traffic to new version

Pros:

  • Test with real traffic
  • Gradual risk mitigation
  • Easy to catch issues early

Cons:

  • ❌ Requires sophisticated traffic splitting (ALB, Istio, etc.)
  • ❌ Complex metrics/monitoring setup needed
  • ❌ Need automated rollback logic
  • ❌ Infrastructure overhead (traffic routing, health checks)
  • ❌ Overkill for small-scale services

Blue-Green Deployment (Winner) βœ…

How it works: Maintain two identical environments, switch instantly

Pros:

  • βœ… True zero-downtime (instant switch)
  • βœ… Easy rollback (just switch back)
  • βœ… Simple implementation (Nginx config change)
  • βœ… Full testing before switch
  • βœ… Clear state: old or new (no mixed versions)
  • βœ… Works great with Docker Compose

Cons:

  • Double resource usage (acceptable on t3.large)
  • Requires health checks

The deciding factor: For a push notification service with long-running jobs, Blue-Green guarantees that:

  1. No in-flight requests are dropped
  2. BullMQ workers complete their jobs
  3. Database connections drain properly
  4. Rollback is instant if issues occur

The resource cost is worth the operational simplicity and reliability.

Why Not Docker Hub?

Many tutorials use Docker Hub as an intermediary:

Build β†’ Push to Docker Hub β†’ Pull on EC2 β†’ Run
Enter fullscreen mode Exit fullscreen mode

Why I skipped it:

  1. Security: No credentials stored in GitHub Actions for registry
  2. Simplicity: Direct build on EC2 (one less moving part)
  3. Speed: No image push/pull over internet
  4. Cost: Docker Hub rate limits on free tier
  5. Privacy: Source code never leaves my infrastructure

Trade-off: EC2 must have enough resources for building. Solution: t3.large with 2 vCPU / 8GB RAM handles builds fine.

Implementation: Step by Step

Step 1: Multi-Stage Dockerfile (Optimized)

Key insight: Separate build and runtime stages to minimize image size and attack surface.

# Dockerfile

# ============= Stage 1: Build =============
FROM node:22-alpine AS builder

WORKDIR /app

# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm install

# Copy source code and build
COPY . .
RUN npm run build

# ============= Stage 2: Runtime =============
FROM node:22-alpine AS runner

WORKDIR /app

# Copy only production artifacts from builder
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./

# Install only production dependencies + curl for health checks
RUN apk add --no-cache curl \
  && npm install --only=production

EXPOSE 3002

CMD ["node", "dist/main"]
Enter fullscreen mode Exit fullscreen mode

Why Multi-Stage?

Metric Single-Stage Multi-Stage Improvement
Image size ~1.2 GB ~350 MB 3.4x smaller
Build dependencies Included Excluded Attack surface ↓
TypeScript files Included Excluded Security ↑
Layer caching Poor Excellent Build speed ↑

Pro tips:

  • node:22-alpine vs node:22: Alpine is 4x smaller (120MB vs 500MB base)
  • npm install --only=production: Excludes devDependencies (TypeScript, Jest, etc.)
  • COPY package*.json before source: Docker layer caching speeds up builds
  • apk add curl: Needed for Docker health checks

Step 2: Docker Compose with Health Checks

Critical requirement: Know when container is actually ready (not just "running")

# docker-compose.yml
services:

  ### Blue Container
  messaging-blue:
    build: .
    env_file: .env
    depends_on:
      redis-messaging:
        condition: service_healthy  # Wait for Redis
    networks: [messaging-net]
    ports:
      - "3011:3002"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
      interval: 10s      # Check every 10 seconds
      timeout: 5s        # Timeout after 5 seconds
      retries: 5         # Must succeed 5 times
      start_period: 20s  # Grace period for app startup

  ### Green Container
  messaging-green:
    build: .
    env_file: .env
    depends_on:
      redis-messaging:
        condition: service_healthy
    networks: [messaging-net]
    ports:
      - "3012:3002"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 20s

  ### Redis (shared by both containers)
  redis-messaging:
    container_name: redis-messaging
    image: redis:7-alpine
    command: >
      redis-server 
      --appendonly yes 
      --requirepass password123 
      --maxmemory 512mb 
      --maxmemory-policy noeviction
    ports:
      - "6380:6379"
    restart: always
    networks: [messaging-net]
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "password123", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

networks:
  messaging-net:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

Key Design Decisions:

  1. Separate ports (3011/3012): Blue and Green can run simultaneously
  2. Shared Redis: Both containers use same Redis instance (job queue persistence)
  3. Health checks: Deploy script waits for healthy status before switching
  4. start_period: 20s: Grace period for NestJS app initialization
  5. retries: 5: Must succeed 5 consecutive checks (avoids flaky switches)

Why /health/live endpoint?

// src/main.ts (NestJS)
async function bootstrap() {
  const app = await NestFactory.create(AppModule);

  // Health check endpoint (simple but effective)
  app.get('/health/live', (req, res) => {
    res.status(200).send('OK');
  });

  await app.listen(3002, '0.0.0.0');
}
Enter fullscreen mode Exit fullscreen mode

This endpoint confirms:

  • βœ… NestJS app started
  • βœ… HTTP server listening
  • βœ… Ready to handle requests

For production, consider more sophisticated health checks:

app.get('/health/ready', async (req, res) => {
  // Check database connection
  const dbOk = await checkDatabase();
  // Check Redis connection
  const redisOk = await checkRedis();
  // Check external API
  const fcmOk = await checkFirebase();

  if (dbOk && redisOk && fcmOk) {
    res.status(200).json({ status: 'ready' });
  } else {
    res.status(503).json({ status: 'not ready' });
  }
});
Enter fullscreen mode Exit fullscreen mode

Step 3: Nginx Configuration (Traffic Router)

Nginx acts as the traffic cop, instantly routing all requests to the active container.

# /etc/nginx/conf.d/messaging.goodtv.co.kr.conf

# Blue-Green upstream definition
upstream messaging-server {
    server localhost:3011;         # Blue (initially primary)
    server localhost:3012 backup;  # Green (initially backup)
}

server {
    listen 443 ssl;
    server_name messaging.goodtv.co.kr;

    # SSL certificates
    ssl_certificate      /etc/nginx/ssl/goodtv.co.kr.pem;
    ssl_certificate_key  /etc/nginx/ssl/WILD.goodtv.co.kr.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_prefer_server_ciphers on;
    ssl_ciphers HIGH:!aNULL:!MD5;

    location / {
        proxy_pass http://messaging-server;
        proxy_http_version 1.1;

        # WebSocket support (for long-polling if needed)
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;

        # Forward client info
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Cookie $http_cookie;

        proxy_redirect off;
    }

    access_log /var/log/nginx/api-ssl-access.log;
    error_log /var/log/nginx/api-ssl-error.log;
}
Enter fullscreen mode Exit fullscreen mode

How Nginx backup works:

Normal state:
  Primary: :3011  β†’  100% traffic
  Backup:  :3012  β†’  0% traffic (only used if :3011 fails)

After deployment switch:
  Primary: :3012  β†’  100% traffic
  Backup:  :3011  β†’  0% traffic (old version, kept for rollback)
Enter fullscreen mode Exit fullscreen mode

Key benefit: Switching traffic is a simple text replacement + Nginx reload (< 10ms).

Step 4: GitHub Actions Workflow

Trigger on main branch push, SSH into EC2, run deploy script.

# .github/workflows/ci-cd-messaging.yml
name: Messaging CI/CD

on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest

    steps:
      # Step 1: Checkout code
      - name: Checkout repository
        uses: actions/checkout@v3

      # Step 2: Setup Node.js (for local build/test if needed)
      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '22'

      # Step 3: Install dependencies (optional, for tests)
      - name: Install dependencies
        run: npm install

      # Step 4: Build project (validates TypeScript)
      - name: Build project
        run: npm run build

      # Step 5: SSH into EC2 and trigger deploy script
      - name: Deploy to EC2
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SSH_HOST }}
          port: ${{ secrets.SSH_PORT }}
          username: ${{ secrets.SSH_USER }}
          key: ${{ secrets.SSH_KEY }}
          script: |
            /home/ec2-user/deploy-messaging.sh
Enter fullscreen mode Exit fullscreen mode

GitHub Secrets Setup:

Settings β†’ Secrets and variables β†’ Actions β†’ New repository secret

SSH_HOST: 54.123.45.67 (EC2 public IP)
SSH_PORT: 22
SSH_USER: ec2-user
SSH_KEY: <contents of private key>
Enter fullscreen mode Exit fullscreen mode

Why build twice (GitHub Actions + EC2)?

  • GitHub Actions build: Validation only (catch TypeScript errors early)
  • EC2 build: Actual deployment (ensures consistency with production environment)

Alternative approach: Build on GitHub Actions, push to Docker registry, pull on EC2. I skipped this for simplicity.

Step 5: The Deploy Script (The Magic)

This script orchestrates the entire Blue-Green switch.

#!/bin/bash
# /home/ec2-user/deploy-messaging.sh
set -euo pipefail  # Exit on error, undefined variable, or pipe failure

# ============= 0. Cleanup Dangling Images =============
echo "πŸ” Checking for dangling images..."
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)

if [ "$DANGLING_COUNT" -gt 0 ]; then
    echo "🧹 Found $DANGLING_COUNT dangling images. Cleaning up..."
    docker image prune -f
else
    echo "βœ… No dangling images found."
fi

# Navigate to project directory
cd /home/ec2-user/push-messaging-server


# ============= 1. Remove Old Stopped Containers =============
for OLD_COLOR in messaging-blue messaging-green; do
  CONTAINER_NAME="messaging-service-${OLD_COLOR}-1"
  if docker inspect "$CONTAINER_NAME" >/dev/null 2>&1; then
    STATUS=$(docker inspect --format='{{.State.Status}}' "$CONTAINER_NAME")
    if [ "$STATUS" = "exited" ]; then
      echo "πŸ—‘  Removing old stopped container: $CONTAINER_NAME"
      docker compose -p messaging-service rm -f "$OLD_COLOR"
    fi
  fi
done


# ============= 2. Determine Blue-Green Target =============
CURRENT=$(docker compose -p messaging-service ps -q messaging-blue | wc -l)

if [ "$CURRENT" -gt 0 ]; then
    # Blue is running β†’ deploy Green
    NEW=messaging-green
    OLD=messaging-blue
    NEW_PORT=3012
    OLD_PORT=3011
else
    # Green is running (or nothing) β†’ deploy Blue
    NEW=messaging-blue
    OLD=messaging-green
    NEW_PORT=3011
    OLD_PORT=3012
fi

echo "🎯 Deployment target: $NEW (port: $NEW_PORT)"


# ============= 3. Pull Latest Code & Build New Container =============
git fetch --all
git checkout main
git reset --hard origin/main

docker compose -p messaging-service build $NEW
docker compose -p messaging-service up -d $NEW || {
    echo "🚨 Container startup failed. Showing logs:"
    docker logs messaging-service-$NEW-1
    exit 1
}


# ============= 4. Wait for Health Check =============
MAX_RETRIES=30
COUNT=0

while [ "$(docker inspect --format='{{.State.Health.Status}}' messaging-service-$NEW-1)" != "healthy" ]; do
    if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
        echo "❌ Health check failed: $NEW container not healthy."
        docker logs messaging-service-$NEW-1
        docker compose -p messaging-service stop $NEW
        exit 1
    fi
    echo "🟑 Waiting for health check... ($COUNT/$MAX_RETRIES)"
    sleep 5
    COUNT=$((COUNT + 1))
done

echo "βœ… $NEW container is healthy!"


# ============= 5. Switch Nginx Configuration =============
if [ "$NEW" == "messaging-green" ]; then
    # Switch to Green
    sudo sed -i "s/^ *server localhost:3011;/server localhost:3012;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/^ *server localhost:3012 backup;/server localhost:3011 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
    # Switch to Blue
    sudo sed -i "s/^ *server localhost:3012;/server localhost:3011;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/^ *server localhost:3011 backup;/server localhost:3012 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi


# ============= 6. Reload Nginx =============
if ! sudo nginx -t; then
  echo "❌ Nginx configuration test failed. Aborting."
  exit 1
fi

sudo nginx -s reload
echo "βœ… Nginx reloaded. Traffic now routing to $NEW (port: $NEW_PORT)"


# ============= 7. Gracefully Stop Old Container =============
sleep 30  # Wait for in-flight requests to complete

docker compose -p messaging-service stop $OLD || true

if docker inspect messaging-service-$OLD-1 >/dev/null 2>&1; then
  echo "⏳ Waiting for $OLD container to stop..."
  MAX_RETRIES=30
  COUNT=0

  while [ "$(docker inspect --format='{{.State.Status}}' messaging-service-$OLD-1)" != "exited" ]; do
    if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
      echo "❌ Failed to stop $OLD container gracefully."
      docker logs messaging-service-$OLD-1
      break
    fi
    sleep 2
    COUNT=$((COUNT + 1))
  done

  echo "πŸ”’ $OLD container stopped (kept for rollback)."
else
  echo "πŸ’‘ No $OLD container found. Skipping."
fi

echo "πŸŽ‰ Deployment complete! Active: $NEW, Standby: $OLD"
Enter fullscreen mode Exit fullscreen mode

Script Flow Visualization:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Deploy Script Flow                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Cleanup
   └─ Remove dangling images (free disk space)

2. Determine Target
   └─ Blue running? β†’ Deploy Green
   └─ Green running? β†’ Deploy Blue
   └─ Nothing running? β†’ Deploy Blue

3. Build & Start New Container
   └─ git pull latest code
   └─ docker compose build NEW
   └─ docker compose up -d NEW
   └─ Wait for health check (up to 2.5 min)

4. Switch Traffic (instant)
   └─ sed: Replace port in Nginx config
   └─ nginx -t (validate config)
   └─ nginx -s reload (< 10ms switch)

5. Graceful Shutdown
   └─ sleep 30s (drain in-flight requests)
   └─ docker stop OLD (keep container for rollback)

Result:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Blue        β”‚                    β”‚  Green       β”‚
β”‚  (OLD)       β”‚  Traffic switch    β”‚  (NEW)       β”‚
β”‚  Stopped     │◀───────────────────│  Active      β”‚
β”‚  (Rollback)  β”‚       Instant      β”‚  (Primary)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Critical implementation details:

  1. set -euo pipefail: Fail fast on any error
   set -e  # Exit on error
   set -u  # Exit on undefined variable
   set -o pipefail  # Exit if any command in pipe fails
Enter fullscreen mode Exit fullscreen mode
  1. Health check loop: Don't switch traffic until container is truly ready
   # Docker health status: starting β†’ healthy β†’ unhealthy
   while [ "$(docker inspect --format='{{.State.Health.Status}}' ...)" != "healthy" ]; do
     # Wait...
   done
Enter fullscreen mode Exit fullscreen mode
  1. sed for config replacement: Fast, atomic text replacement
   # Before: server localhost:3011;
   # After:  server localhost:3012;
   sudo sed -i "s/server localhost:3011;/server localhost:3012;/" config.conf
Enter fullscreen mode Exit fullscreen mode
  1. nginx -s reload: Graceful reload (no dropped connections)

    • New workers start with new config
    • Old workers finish current requests
    • Old workers shut down after requests complete
    • Total switch time: < 10ms
  2. 30-second drain: Wait for in-flight requests to complete

   sleep 30  # Typical request: 1-5s, long-running: up to 30s
   docker stop $OLD  # Now safe to stop
Enter fullscreen mode Exit fullscreen mode
  1. Keep stopped container: Easy rollback if needed
   # Don't delete, just stop
   docker compose stop $OLD
   # To rollback: just reverse the Nginx switch
Enter fullscreen mode Exit fullscreen mode

Production Deployment in Action

Typical Deployment Timeline

T+0:00  - GitHub push to main
T+0:05  - GitHub Actions triggered
T+0:15  - Actions build validates (TypeScript compilation)
T+0:20  - SSH into EC2, deploy script starts
T+0:25  - git pull, docker build starts
T+3:00  - New container starts
T+3:30  - Health check passes (5 retries Γ— 10s)
T+3:31  - Nginx config switched
T+3:32  - Traffic now on new container ← ZERO DOWNTIME
T+4:02  - Old container stopped (30s drain)
T+4:03  - Deployment complete βœ…

Total time: ~4 minutes
User-visible downtime: 0 seconds
Enter fullscreen mode Exit fullscreen mode

Real Deployment Log

[ec2-user@ip-172-31-38-149 ~]$ /home/ec2-user/deploy-messaging.sh

πŸ” Checking for dangling images...
βœ… No dangling images found.

🎯 Deployment target: messaging-green (port: 3012)

Fetching latest code...
Already on 'main'
HEAD is now at a7f3d21 Fix: Improve batch logging performance

Building new container...
[+] Building 142.3s (18/18) FINISHED
 => [builder 1/6] FROM node:22-alpine                   0.0s
 => [builder 2/6] WORKDIR /app                          0.1s
 => [builder 3/6] COPY package*.json ./                 0.1s
 => [builder 4/6] RUN npm install                      48.2s
 => [builder 5/6] COPY . .                              1.3s
 => [builder 6/6] RUN npm run build                    52.1s
 => [runner 1/4] FROM node:22-alpine                    0.0s
 => [runner 2/4] WORKDIR /app                           0.0s
 => [runner 3/4] COPY --from=builder /app/dist ./dist   0.2s
 => [runner 4/4] RUN apk add --no-cache curl && npm... 28.4s
 => exporting to image                                  11.9s

Starting container...
[+] Running 1/1
 βœ” Container messaging-service-messaging-green-1  Started  2.3s

🟑 Waiting for health check... (0/30)
🟑 Waiting for health check... (1/30)
🟑 Waiting for health check... (2/30)
🟑 Waiting for health check... (3/30)
🟑 Waiting for health check... (4/30)
🟑 Waiting for health check... (5/30)
βœ… messaging-green container is healthy!

Switching Nginx configuration...
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
βœ… Nginx reloaded. Traffic now routing to messaging-green (port: 3012)

Waiting 30s for in-flight requests to complete...
⏳ Waiting for messaging-blue container to stop...
πŸ”’ messaging-blue container stopped (kept for rollback).

πŸŽ‰ Deployment complete! Active: messaging-green, Standby: messaging-blue
Enter fullscreen mode Exit fullscreen mode

Monitoring During Deployment

Server metrics during a typical deployment:

CPU Usage:
  Pre-deploy:  18-22% (normal load)
  Building:    65-80% (docker build)
  Post-deploy: 20-25% (slightly higher, new container initializing)
  Steady:      18-22% (back to normal)

Memory Usage:
  Pre-deploy:  3.2 GB / 8 GB (40%)
  Both running: 4.8 GB / 8 GB (60%) ← Both containers alive during switch
  Post-deploy: 3.4 GB / 8 GB (42%)

Request Error Rate:
  During switch: 0.00% ← Zero dropped requests

Response Time:
  Pre-deploy:   avg 145ms, p95 320ms
  During switch: avg 147ms, p95 325ms ← No spike
  Post-deploy:  avg 143ms, p95 318ms
Enter fullscreen mode Exit fullscreen mode

Key observation: Users experience zero impact during deployment.

Rollback Strategy

Instant Rollback (< 1 minute)

If the new deployment has issues:

# 1. Identify the problem (monitoring alerts)
echo "🚨 Issues detected in new deployment!"

# 2. Determine current active container
if docker compose -p messaging-service ps -q messaging-green | grep -q .; then
    # Green is active, rollback to Blue
    ACTIVE=messaging-green
    ROLLBACK=messaging-blue
    ACTIVE_PORT=3012
    ROLLBACK_PORT=3011
else
    # Blue is active, rollback to Green
    ACTIVE=messaging-blue
    ROLLBACK=messaging-green
    ACTIVE_PORT=3011
    ROLLBACK_PORT=3012
fi

echo "πŸ”„ Rolling back from $ACTIVE to $ROLLBACK"

# 3. Switch Nginx back to old container
if [ "$ROLLBACK" == "messaging-blue" ]; then
    sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
    sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
    sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi

# 4. Reload Nginx (instant switch)
sudo nginx -t && sudo nginx -s reload

# 5. Restart old container if stopped
docker compose -p messaging-service start $ROLLBACK

echo "βœ… Rollback complete! Active: $ROLLBACK"
Enter fullscreen mode Exit fullscreen mode

Total rollback time: ~30-60 seconds (most of that is restarting the old container if it was stopped)

Why Blue-Green Excels at Rollbacks

Strategy Rollback Time Complexity Data Loss Risk
Rolling 5-10 min High Medium
Canary 2-5 min High Low
Blue-Green 30-60s Low None

The secret: Old container is still alive (just stopped). Starting it takes seconds.

Gotchas and Lessons Learned

1. Health Checks Are Non-Negotiable

Initial mistake: No health check, just sleep 30 after starting container.

Problem: Sometimes app took 40+ seconds to start, leading to:

  • Nginx switched before app ready
  • 502 Bad Gateway errors
  • Partial downtime

Solution: Proper Docker health checks with retries.

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
  interval: 10s
  timeout: 5s
  retries: 5        # Must succeed 5 times
  start_period: 20s # Grace period
Enter fullscreen mode Exit fullscreen mode

Result: Zero false positives, zero premature switches.

2. Graceful Shutdown Matters

Initial mistake: docker stop immediately after switch.

Problem: In-flight requests aborted mid-processing, causing:

  • Failed push notifications
  • Incomplete database transactions
  • BullMQ jobs marked as failed

Solution: 30-second drain period.

# Wait for requests to complete
sleep 30

# Now safe to stop
docker stop $OLD
Enter fullscreen mode Exit fullscreen mode

Result: Zero dropped requests during deployment.

3. Disk Space Monitoring

Problem: Docker images accumulated, filling disk.

df -h
# /dev/xvda1  100G  92G  8G  92% /  ← Uh oh
Enter fullscreen mode Exit fullscreen mode

Solution: Automated cleanup in deploy script.

# Remove dangling images (layers from old builds)
docker image prune -f

# Optional: Weekly cron job for aggressive cleanup
# crontab -e
# 0 2 * * 0 docker system prune -af --volumes
Enter fullscreen mode Exit fullscreen mode

Result: Disk usage stable at 45-50%.

4. Redis Persistence Configuration

Initial mistake: Redis without persistence.

Problem: Container restart = lost BullMQ job queue.

Solution: Redis AOF (Append-Only File).

redis-messaging:
  image: redis:7-alpine
  command: >
    redis-server 
    --appendonly yes  # ← Enable persistence
    --requirepass password123
Enter fullscreen mode Exit fullscreen mode

Result: Job queue survives deployments.

5. SELinux Permissions (Amazon Linux)

Problem: Nginx couldn't access Docker containers.

sudo nginx -t
# nginx: [emerg] connect() to localhost:3011 failed (13: Permission denied)
Enter fullscreen mode Exit fullscreen mode

Solution: SELinux policy adjustment.

# Allow Nginx to connect to network
sudo setsebool -P httpd_can_network_connect 1

# Or create custom policy (more secure)
sudo semanage port -a -t http_port_t -p tcp 3011
sudo semanage port -a -t http_port_t -p tcp 3012
Enter fullscreen mode Exit fullscreen mode

Result: Nginx successfully proxies to containers.

6. Git Reset Issues

Initial mistake: git pull failing due to local changes.

git pull
# error: Your local changes to the following files would be overwritten
Enter fullscreen mode Exit fullscreen mode

Solution: Hard reset to remote.

git fetch --all
git reset --hard origin/main  # Force match remote
Enter fullscreen mode Exit fullscreen mode

Trade-off: Local changes lost. Solution: Never edit directly on EC2.

Performance Impact Analysis

Resource Usage During Deployment

t3.large specs: 2 vCPU, 8 GB RAM, General Purpose SSD

Phase CPU Memory Duration
Normal operation 18-22% 3.2 GB -
Docker build 65-80% 4.0 GB 2-3 min
Both containers 35-45% 4.8 GB 30-60s
New container only 20-25% 3.4 GB -

Conclusion: t3.large has ample headroom for Blue-Green deployment.

Network Traffic

During deployment:

  • Inbound: Normal (users unaffected)
  • Outbound: +50 MB (npm install, git pull)
  • No external Docker registry traffic (build locally)

Database Connections

Concern: Both containers connecting simultaneously?

Reality:

  • Old container: Draining, no new connections
  • New container: Fresh connections
  • Total: Never exceeds normal load

MSSQL connection pool: 50 max, typically use 8-12.

Production Metrics: 6 Months Later

After 6 months and 150+ deployments:

Metric Result
Total deployments 156
Failed deployments 2 (1.3%)
Rollbacks needed 1 (0.6%)
User-reported issues 0
Average deployment time 3m 47s
Downtime per deployment 0s
Total uptime 99.97%

Failure causes:

  1. Docker build failed (TypeScript error) β†’ Caught before switch
  2. Health check timeout (DB connection issue) β†’ Deployment aborted
  3. Rollback: Memory leak in new version β†’ Rolled back in 45 seconds

Key insight: The system is self-healing. Failed deployments never affect users.

Cost Analysis

Infrastructure Costs (Monthly)

Resource Specs Cost
EC2 t3.large 2 vCPU, 8 GB RAM $60.74
EBS SSD 100 GB $10.00
Data transfer ~50 GB/month $4.50
Total $75.24

GitHub Actions: Free tier (2,000 minutes/month) - using ~200 minutes/month

Alternative Costs (for comparison)

Solution Monthly Cost Setup Time
Current (GitHub Actions + EC2) $75 2 hours
Jenkins on EC2 $95 (t3.medium for Jenkins + t3.large for app) 1-2 days
AWS CodeDeploy $85 (EC2 + ALB + CodeDeploy) 4-6 hours
ECS Fargate Blue-Green $120 (Fargate + ALB) 1 day
Kubernetes (EKS) $220 (EKS + nodes + ALB) 2-3 days

Winner: Current solution offers best cost/simplicity ratio for small-scale services.

When NOT to Use This Approach

This Blue-Green + Docker + GitHub Actions setup is great for small-to-medium services, but has limits:

Scale Limitations

❌ Don't use this approach when:

  • Traffic > 10,000 RPS: Single EC2 instance bottleneck
  • Multi-region deployment: No built-in geo-routing
  • 10+ services: Manual script maintenance becomes unwieldy
  • Team size > 10: Need centralized CI/CD governance

βœ… Better alternatives:

  • High scale: Kubernetes (EKS), AWS ECS
  • Multi-region: AWS Global Accelerator + ALB
  • Microservices: Kubernetes, Service Mesh (Istio)
  • Large teams: Jenkins, GitLab CI, ArgoCD

Complexity Trade-offs

This approach optimizes for:

  • βœ… Simplicity (1 server, shell script)
  • βœ… Cost (minimal infrastructure)
  • βœ… Speed (fast iteration)

If you need:

  • Advanced traffic management (canary %, A/B testing)
  • Auto-scaling (horizontal pod autoscaler)
  • Service mesh (Istio, Linkerd)
  • Observability (Prometheus, Grafana, Jaeger)

Then invest in Kubernetes. The operational complexity pays off at scale.

Future Improvements

Short-term (planned)

  1. Automated smoke tests after deployment
   # After Nginx switch, before old container stop
   curl -f https://messaging.goodtv.co.kr/health || rollback
Enter fullscreen mode Exit fullscreen mode
  1. Slack notifications
   curl -X POST $SLACK_WEBHOOK \
     -d "{\"text\": \"βœ… Deployment successful: $NEW\"}"
Enter fullscreen mode Exit fullscreen mode
  1. Rollback button (GitHub Actions workflow_dispatch)
   on:
     workflow_dispatch:
       inputs:
         target:
           description: 'Container to switch to (blue/green)'
           required: true
Enter fullscreen mode Exit fullscreen mode

Long-term (if needed)

  1. Canary deployment: Route 10% traffic to new version first
  2. Auto-scaling: Add more EC2 instances behind ALB
  3. Monitoring: Integrate Prometheus + Grafana
  4. Database migrations: Separate migration job before deployment

Conclusion

Building a production-grade CI/CD pipeline doesn't require expensive enterprise tools or complex Kubernetes clusters. With GitHub Actions, Docker multi-stage builds, and a well-crafted shell script, I achieved:

  • True zero-downtime deployments (0 seconds user impact)
  • Fast iteration (4-minute deployments)
  • Easy rollbacks (30-60 seconds)
  • Low cost ($75/month infrastructure)
  • High reliability (99.97% uptime over 6 months)

The key decisions:

  1. GitHub Actions over Jenkins: Managed infrastructure, zero maintenance
  2. Blue-Green over Canary: Simpler for long-running jobs, instant rollback
  3. No Docker Hub: Direct build on EC2, fewer moving parts
  4. Multi-stage Dockerfile: 3.4x smaller images, faster deployments
  5. Health checks: No premature switches, reliable deployments

For small-to-medium services (< 10,000 RPS, single region, small team), this architecture hits the sweet spot of simplicity, reliability, and cost.

Key Takeaways

  • Blue-Green deployment eliminates downtime for services with long-running processes
  • GitHub Actions provides enterprise-grade CI/CD without operational overhead
  • Multi-stage Dockerfiles reduce image size 3-4x
  • Health checks are mandatory for reliable container switching
  • Nginx upstream with backup provides instant traffic switching
  • 30-second drain period prevents dropped in-flight requests
  • Keeping stopped containers enables instant rollback
  • Shell scripts can be production-grade with proper error handling
  • Choose simplicity over features until you outgrow it

Top comments (0)