How I implemented production-grade CI/CD for a high-traffic push notification service handling 100K+ daily requests with zero downtime
When your push notification service handles 100,000+ daily requests and can't afford a single second of downtime, deployment strategy becomes critical. After experimenting with various CI/CD approaches, I settled on a GitHub Actions + Docker + Blue-Green deployment pattern that achieves true zero-downtime releases.
Here's the complete implementation story, including why I chose this specific stack over alternatives like Jenkins, why Blue-Green over Canary or Rolling deployments, and all the gotchas I encountered in production.
The Requirements: Zero Tolerance for Downtime
My push notification service has strict uptime requirements:
- 24/7 availability: No maintenance windows allowed
- High traffic: 100,000+ notifications daily
- Long-running processes: BullMQ workers processing jobs for 5-15 minutes
- Database connections: Existing connections must drain gracefully
- No dropped requests: In-flight requests must complete successfully
Traditional deployment approaches all had deal-breakers:
| Strategy | Downtime | Complexity | Issue |
|---|---|---|---|
| Direct replacement | 30-60s | Low | ❌ Unacceptable downtime |
| Rolling update | ~10s | Medium | ❌ Partial outage during rollout |
| Canary | Minimal | High | ❌ Requires traffic splitting infrastructure |
| Blue-Green | 0s | Medium | ✅ Perfect fit |
Architecture Overview
High-Level Flow
┌─────────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐
│ GitHub │────▶│GitHub Actions│────▶│ EC2 SSH │────▶│ Deploy │
│ Push │ │ Workflow │ │ Trigger │ │ Script │
└─────────────┘ └──────────────┘ └─────────────┘ └──────┬───────┘
│
┌─────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ Docker Compose Build │
│ ┌────────────────────────────────────────┐ │
│ │ Multi-Stage Dockerfile │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │Build Stage │─▶│Runtime Stage│ │ │
│ │ │(Node 22) │ │(Optimized) │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ Blue-Green Container Switch │
│ │
│ Before: After: │
│ ┌──────────┐ ┌──────────┐ │
│ │ Blue │◀─Primary │ Green │◀─Primary │
│ │ :3011 │ │ :3012 │ │
│ └──────────┘ └──────────┘ │
│ ┌──────────┐ ┌──────────┐ │
│ │ Green │◀─Backup │ Blue │◀─Backup │
│ │ :3012 │ │ :3011 │ │
│ └──────────┘ └──────────┘ │
└──────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────┐
│ Nginx Config Switch │
│ │
│ server localhost:3011; → :3012; │
│ server localhost:3012 backup; → :3011 backup;│
│ │
│ nginx -s reload (instant switch) │
└───────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ Graceful Old Container Shutdown │
│ │
│ Wait 30s for in-flight requests │
│ Stop old container (keep for rollback) │
└──────────────────────────────────────────────┘
Infrastructure Components
┌─────────────────────────────────┐
│ AWS EC2 Instance │
│ (Amazon Linux 2023, t3.large) │
└─────────────────────────────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌──────────────┐
│ Nginx │ │ Docker │ │ Shell │
│ (Reverse │ │ Compose │ │ Scripts │
│ Proxy) │ │ │ │ │
│ │ │ ┌───────────┐ │ │ deploy- │
│ Port 443 │───────▶││ Blue │ │ │ messaging.sh │
│ (SSL/TLS) │ │ │ :3011 │ │ │ │
│ │ │ └───────────┘ │ └──────────────┘
│ Upstream: │ │ │
│ :3011/:3012 │ │ ┌───────────┐ │
└───────────────┘ │ │ Green │ │
│ │ :3012 │ │
│ └───────────┘ │
│ │
│ ┌───────────┐ │
│ │ Redis │ │
│ │ :6380 │ │
│ └───────────┘ │
└───────────────┘
Why This Stack? Decision Breakdown
GitHub Actions vs Jenkins vs GitLab CI
I evaluated several CI/CD platforms:
Jenkins
Pros:
- Mature ecosystem
- Highly customizable
- Self-hosted (complete control)
Cons:
- ❌ Requires dedicated server maintenance
- ❌ Plugin hell (compatibility issues)
- ❌ Steep learning curve
- ❌ Need to manage Jenkins updates/security
GitLab CI
Pros:
- Excellent integration with GitLab
- Strong container registry
- Built-in Kubernetes support
Cons:
- ❌ My code is on GitHub (migration friction)
- ❌ GitLab Runner setup required
- ❌ Additional cost for private repos
GitHub Actions (Winner) ✅
Pros:
- ✅ Native GitHub integration (zero setup)
- ✅ Free for public repos, generous free tier for private
- ✅ Managed infrastructure (no server to maintain)
- ✅ Massive marketplace of actions
- ✅ YAML-based (simple, declarative)
- ✅ Built-in secrets management
- ✅ Excellent documentation
Cons:
- Vendor lock-in (acceptable trade-off)
- Minute limits on free tier (not an issue for my usage)
The deciding factor: I could set up the entire pipeline in 30 minutes vs 2-3 days for Jenkins setup, configuration, and maintenance. For a solo developer or small team, this is a no-brainer.
Blue-Green vs Canary vs Rolling Deployments
Rolling Deployment
How it works: Gradually replace instances one-by-one
Pros:
- Efficient resource usage
- Gradual rollout
Cons:
- ❌ Mixed versions running simultaneously (API compatibility issues)
- ❌ Partial downtime as each instance restarts
- ❌ Complex rollback (need to track which instances updated)
- ❌ Long deployment time for multiple instances
Canary Deployment
How it works: Route small percentage of traffic to new version
Pros:
- Test with real traffic
- Gradual risk mitigation
- Easy to catch issues early
Cons:
- ❌ Requires sophisticated traffic splitting (ALB, Istio, etc.)
- ❌ Complex metrics/monitoring setup needed
- ❌ Need automated rollback logic
- ❌ Infrastructure overhead (traffic routing, health checks)
- ❌ Overkill for small-scale services
Blue-Green Deployment (Winner) ✅
How it works: Maintain two identical environments, switch instantly
Pros:
- ✅ True zero-downtime (instant switch)
- ✅ Easy rollback (just switch back)
- ✅ Simple implementation (Nginx config change)
- ✅ Full testing before switch
- ✅ Clear state: old or new (no mixed versions)
- ✅ Works great with Docker Compose
Cons:
- Double resource usage (acceptable on t3.large)
- Requires health checks
The deciding factor: For a push notification service with long-running jobs, Blue-Green guarantees that:
- No in-flight requests are dropped
- BullMQ workers complete their jobs
- Database connections drain properly
- Rollback is instant if issues occur
The resource cost is worth the operational simplicity and reliability.
Why Not Docker Hub?
Many tutorials use Docker Hub as an intermediary:
Build → Push to Docker Hub → Pull on EC2 → Run
Why I skipped it:
- Security: No credentials stored in GitHub Actions for registry
- Simplicity: Direct build on EC2 (one less moving part)
- Speed: No image push/pull over internet
- Cost: Docker Hub rate limits on free tier
- Privacy: Source code never leaves my infrastructure
Trade-off: EC2 must have enough resources for building. Solution: t3.large with 2 vCPU / 8GB RAM handles builds fine.
Implementation: Step by Step
Step 1: Multi-Stage Dockerfile (Optimized)
Key insight: Separate build and runtime stages to minimize image size and attack surface.
# Dockerfile
# ============= Stage 1: Build =============
FROM node:22-alpine AS builder
WORKDIR /app
# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm install
# Copy source code and build
COPY . .
RUN npm run build
# ============= Stage 2: Runtime =============
FROM node:22-alpine AS runner
WORKDIR /app
# Copy only production artifacts from builder
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
# Install only production dependencies + curl for health checks
RUN apk add --no-cache curl \
&& npm install --only=production
EXPOSE 3002
CMD ["node", "dist/main"]
Why Multi-Stage?
| Metric | Single-Stage | Multi-Stage | Improvement |
|---|---|---|---|
| Image size | ~1.2 GB | ~350 MB | 3.4x smaller |
| Build dependencies | Included | Excluded | Attack surface ↓ |
| TypeScript files | Included | Excluded | Security ↑ |
| Layer caching | Poor | Excellent | Build speed ↑ |
Pro tips:
-
node:22-alpinevsnode:22: Alpine is 4x smaller (120MB vs 500MB base) -
npm install --only=production: Excludes devDependencies (TypeScript, Jest, etc.) -
COPY package*.jsonbefore source: Docker layer caching speeds up builds -
apk add curl: Needed for Docker health checks
Step 2: Docker Compose with Health Checks
Critical requirement: Know when container is actually ready (not just "running")
# docker-compose.yml
services:
### Blue Container
messaging-blue:
build: .
env_file: .env
depends_on:
redis-messaging:
condition: service_healthy # Wait for Redis
networks: [messaging-net]
ports:
- "3011:3002"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
interval: 10s # Check every 10 seconds
timeout: 5s # Timeout after 5 seconds
retries: 5 # Must succeed 5 times
start_period: 20s # Grace period for app startup
### Green Container
messaging-green:
build: .
env_file: .env
depends_on:
redis-messaging:
condition: service_healthy
networks: [messaging-net]
ports:
- "3012:3002"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
interval: 10s
timeout: 5s
retries: 5
start_period: 20s
### Redis (shared by both containers)
redis-messaging:
container_name: redis-messaging
image: redis:7-alpine
command: >
redis-server
--appendonly yes
--requirepass password123
--maxmemory 512mb
--maxmemory-policy noeviction
ports:
- "6380:6379"
restart: always
networks: [messaging-net]
healthcheck:
test: ["CMD", "redis-cli", "-a", "password123", "ping"]
interval: 10s
timeout: 5s
retries: 5
networks:
messaging-net:
driver: bridge
Key Design Decisions:
- Separate ports (3011/3012): Blue and Green can run simultaneously
- Shared Redis: Both containers use same Redis instance (job queue persistence)
-
Health checks: Deploy script waits for
healthystatus before switching - start_period: 20s: Grace period for NestJS app initialization
- retries: 5: Must succeed 5 consecutive checks (avoids flaky switches)
Why /health/live endpoint?
// src/main.ts (NestJS)
async function bootstrap() {
const app = await NestFactory.create(AppModule);
// Health check endpoint (simple but effective)
app.get('/health/live', (req, res) => {
res.status(200).send('OK');
});
await app.listen(3002, '0.0.0.0');
}
This endpoint confirms:
- ✅ NestJS app started
- ✅ HTTP server listening
- ✅ Ready to handle requests
For production, consider more sophisticated health checks:
app.get('/health/ready', async (req, res) => {
// Check database connection
const dbOk = await checkDatabase();
// Check Redis connection
const redisOk = await checkRedis();
// Check external API
const fcmOk = await checkFirebase();
if (dbOk && redisOk && fcmOk) {
res.status(200).json({ status: 'ready' });
} else {
res.status(503).json({ status: 'not ready' });
}
});
Step 3: Nginx Configuration (Traffic Router)
Nginx acts as the traffic cop, instantly routing all requests to the active container.
# /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
# Blue-Green upstream definition
upstream messaging-server {
server localhost:3011; # Blue (initially primary)
server localhost:3012 backup; # Green (initially backup)
}
server {
listen 443 ssl;
server_name messaging.goodtv.co.kr;
# SSL certificates
ssl_certificate /etc/nginx/ssl/goodtv.co.kr.pem;
ssl_certificate_key /etc/nginx/ssl/WILD.goodtv.co.kr.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;
ssl_ciphers HIGH:!aNULL:!MD5;
location / {
proxy_pass http://messaging-server;
proxy_http_version 1.1;
# WebSocket support (for long-polling if needed)
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
# Forward client info
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header Cookie $http_cookie;
proxy_redirect off;
}
access_log /var/log/nginx/api-ssl-access.log;
error_log /var/log/nginx/api-ssl-error.log;
}
How Nginx backup works:
Normal state:
Primary: :3011 → 100% traffic
Backup: :3012 → 0% traffic (only used if :3011 fails)
After deployment switch:
Primary: :3012 → 100% traffic
Backup: :3011 → 0% traffic (old version, kept for rollback)
Key benefit: Switching traffic is a simple text replacement + Nginx reload (< 10ms).
Step 4: GitHub Actions Workflow
Trigger on main branch push, SSH into EC2, run deploy script.
# .github/workflows/ci-cd-messaging.yml
name: Messaging CI/CD
on:
push:
branches: [ main ]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
# Step 1: Checkout code
- name: Checkout repository
uses: actions/checkout@v3
# Step 2: Setup Node.js (for local build/test if needed)
- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '22'
# Step 3: Install dependencies (optional, for tests)
- name: Install dependencies
run: npm install
# Step 4: Build project (validates TypeScript)
- name: Build project
run: npm run build
# Step 5: SSH into EC2 and trigger deploy script
- name: Deploy to EC2
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SSH_HOST }}
port: ${{ secrets.SSH_PORT }}
username: ${{ secrets.SSH_USER }}
key: ${{ secrets.SSH_KEY }}
script: |
/home/ec2-user/deploy-messaging.sh
GitHub Secrets Setup:
Settings → Secrets and variables → Actions → New repository secret
SSH_HOST: 54.123.45.67 (EC2 public IP)
SSH_PORT: 22
SSH_USER: ec2-user
SSH_KEY: <contents of private key>
Why build twice (GitHub Actions + EC2)?
- GitHub Actions build: Validation only (catch TypeScript errors early)
- EC2 build: Actual deployment (ensures consistency with production environment)
Alternative approach: Build on GitHub Actions, push to Docker registry, pull on EC2. I skipped this for simplicity.
Step 5: The Deploy Script (The Magic)
This script orchestrates the entire Blue-Green switch.
#!/bin/bash
# /home/ec2-user/deploy-messaging.sh
set -euo pipefail # Exit on error, undefined variable, or pipe failure
# ============= 0. Cleanup Dangling Images =============
echo "🔍 Checking for dangling images..."
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)
if [ "$DANGLING_COUNT" -gt 0 ]; then
echo "🧹 Found $DANGLING_COUNT dangling images. Cleaning up..."
docker image prune -f
else
echo "✅ No dangling images found."
fi
# Navigate to project directory
cd /home/ec2-user/push-messaging-server
# ============= 1. Remove Old Stopped Containers =============
for OLD_COLOR in messaging-blue messaging-green; do
CONTAINER_NAME="messaging-service-${OLD_COLOR}-1"
if docker inspect "$CONTAINER_NAME" >/dev/null 2>&1; then
STATUS=$(docker inspect --format='{{.State.Status}}' "$CONTAINER_NAME")
if [ "$STATUS" = "exited" ]; then
echo "🗑 Removing old stopped container: $CONTAINER_NAME"
docker compose -p messaging-service rm -f "$OLD_COLOR"
fi
fi
done
# ============= 2. Determine Blue-Green Target =============
CURRENT=$(docker compose -p messaging-service ps -q messaging-blue | wc -l)
if [ "$CURRENT" -gt 0 ]; then
# Blue is running → deploy Green
NEW=messaging-green
OLD=messaging-blue
NEW_PORT=3012
OLD_PORT=3011
else
# Green is running (or nothing) → deploy Blue
NEW=messaging-blue
OLD=messaging-green
NEW_PORT=3011
OLD_PORT=3012
fi
echo "🎯 Deployment target: $NEW (port: $NEW_PORT)"
# ============= 3. Pull Latest Code & Build New Container =============
git fetch --all
git checkout main
git reset --hard origin/main
docker compose -p messaging-service build $NEW
docker compose -p messaging-service up -d $NEW || {
echo "🚨 Container startup failed. Showing logs:"
docker logs messaging-service-$NEW-1
exit 1
}
# ============= 4. Wait for Health Check =============
MAX_RETRIES=30
COUNT=0
while [ "$(docker inspect --format='{{.State.Health.Status}}' messaging-service-$NEW-1)" != "healthy" ]; do
if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
echo "❌ Health check failed: $NEW container not healthy."
docker logs messaging-service-$NEW-1
docker compose -p messaging-service stop $NEW
exit 1
fi
echo "🟡 Waiting for health check... ($COUNT/$MAX_RETRIES)"
sleep 5
COUNT=$((COUNT + 1))
done
echo "✅ $NEW container is healthy!"
# ============= 5. Switch Nginx Configuration =============
if [ "$NEW" == "messaging-green" ]; then
# Switch to Green
sudo sed -i "s/^ *server localhost:3011;/server localhost:3012;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/^ *server localhost:3012 backup;/server localhost:3011 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
# Switch to Blue
sudo sed -i "s/^ *server localhost:3012;/server localhost:3011;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/^ *server localhost:3011 backup;/server localhost:3012 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi
# ============= 6. Reload Nginx =============
if ! sudo nginx -t; then
echo "❌ Nginx configuration test failed. Aborting."
exit 1
fi
sudo nginx -s reload
echo "✅ Nginx reloaded. Traffic now routing to $NEW (port: $NEW_PORT)"
# ============= 7. Gracefully Stop Old Container =============
sleep 30 # Wait for in-flight requests to complete
docker compose -p messaging-service stop $OLD || true
if docker inspect messaging-service-$OLD-1 >/dev/null 2>&1; then
echo "⏳ Waiting for $OLD container to stop..."
MAX_RETRIES=30
COUNT=0
while [ "$(docker inspect --format='{{.State.Status}}' messaging-service-$OLD-1)" != "exited" ]; do
if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
echo "❌ Failed to stop $OLD container gracefully."
docker logs messaging-service-$OLD-1
break
fi
sleep 2
COUNT=$((COUNT + 1))
done
echo "🔒 $OLD container stopped (kept for rollback)."
else
echo "💡 No $OLD container found. Skipping."
fi
echo "🎉 Deployment complete! Active: $NEW, Standby: $OLD"
Script Flow Visualization:
┌─────────────────────────────────────────────────────────────┐
│ Deploy Script Flow │
└─────────────────────────────────────────────────────────────┘
1. Cleanup
└─ Remove dangling images (free disk space)
2. Determine Target
└─ Blue running? → Deploy Green
└─ Green running? → Deploy Blue
└─ Nothing running? → Deploy Blue
3. Build & Start New Container
└─ git pull latest code
└─ docker compose build NEW
└─ docker compose up -d NEW
└─ Wait for health check (up to 2.5 min)
4. Switch Traffic (instant)
└─ sed: Replace port in Nginx config
└─ nginx -t (validate config)
└─ nginx -s reload (< 10ms switch)
5. Graceful Shutdown
└─ sleep 30s (drain in-flight requests)
└─ docker stop OLD (keep container for rollback)
Result:
┌──────────────┐ ┌──────────────┐
│ Blue │ │ Green │
│ (OLD) │ Traffic switch │ (NEW) │
│ Stopped │◀───────────────────│ Active │
│ (Rollback) │ Instant │ (Primary) │
└──────────────┘ └──────────────┘
Critical implementation details:
-
set -euo pipefail: Fail fast on any error
set -e # Exit on error
set -u # Exit on undefined variable
set -o pipefail # Exit if any command in pipe fails
- Health check loop: Don't switch traffic until container is truly ready
# Docker health status: starting → healthy → unhealthy
while [ "$(docker inspect --format='{{.State.Health.Status}}' ...)" != "healthy" ]; do
# Wait...
done
-
sedfor config replacement: Fast, atomic text replacement
# Before: server localhost:3011;
# After: server localhost:3012;
sudo sed -i "s/server localhost:3011;/server localhost:3012;/" config.conf
-
nginx -s reload: Graceful reload (no dropped connections)- New workers start with new config
- Old workers finish current requests
- Old workers shut down after requests complete
- Total switch time: < 10ms
30-second drain: Wait for in-flight requests to complete
sleep 30 # Typical request: 1-5s, long-running: up to 30s
docker stop $OLD # Now safe to stop
- Keep stopped container: Easy rollback if needed
# Don't delete, just stop
docker compose stop $OLD
# To rollback: just reverse the Nginx switch
Production Deployment in Action
Typical Deployment Timeline
T+0:00 - GitHub push to main
T+0:05 - GitHub Actions triggered
T+0:15 - Actions build validates (TypeScript compilation)
T+0:20 - SSH into EC2, deploy script starts
T+0:25 - git pull, docker build starts
T+3:00 - New container starts
T+3:30 - Health check passes (5 retries × 10s)
T+3:31 - Nginx config switched
T+3:32 - Traffic now on new container ← ZERO DOWNTIME
T+4:02 - Old container stopped (30s drain)
T+4:03 - Deployment complete ✅
Total time: ~4 minutes
User-visible downtime: 0 seconds
Real Deployment Log
[ec2-user@ip-172-31-38-149 ~]$ /home/ec2-user/deploy-messaging.sh
🔍 Checking for dangling images...
✅ No dangling images found.
🎯 Deployment target: messaging-green (port: 3012)
Fetching latest code...
Already on 'main'
HEAD is now at a7f3d21 Fix: Improve batch logging performance
Building new container...
[+] Building 142.3s (18/18) FINISHED
=> [builder 1/6] FROM node:22-alpine 0.0s
=> [builder 2/6] WORKDIR /app 0.1s
=> [builder 3/6] COPY package*.json ./ 0.1s
=> [builder 4/6] RUN npm install 48.2s
=> [builder 5/6] COPY . . 1.3s
=> [builder 6/6] RUN npm run build 52.1s
=> [runner 1/4] FROM node:22-alpine 0.0s
=> [runner 2/4] WORKDIR /app 0.0s
=> [runner 3/4] COPY --from=builder /app/dist ./dist 0.2s
=> [runner 4/4] RUN apk add --no-cache curl && npm... 28.4s
=> exporting to image 11.9s
Starting container...
[+] Running 1/1
✔ Container messaging-service-messaging-green-1 Started 2.3s
🟡 Waiting for health check... (0/30)
🟡 Waiting for health check... (1/30)
🟡 Waiting for health check... (2/30)
🟡 Waiting for health check... (3/30)
🟡 Waiting for health check... (4/30)
🟡 Waiting for health check... (5/30)
✅ messaging-green container is healthy!
Switching Nginx configuration...
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
✅ Nginx reloaded. Traffic now routing to messaging-green (port: 3012)
Waiting 30s for in-flight requests to complete...
⏳ Waiting for messaging-blue container to stop...
🔒 messaging-blue container stopped (kept for rollback).
🎉 Deployment complete! Active: messaging-green, Standby: messaging-blue
Monitoring During Deployment
Server metrics during a typical deployment:
CPU Usage:
Pre-deploy: 18-22% (normal load)
Building: 65-80% (docker build)
Post-deploy: 20-25% (slightly higher, new container initializing)
Steady: 18-22% (back to normal)
Memory Usage:
Pre-deploy: 3.2 GB / 8 GB (40%)
Both running: 4.8 GB / 8 GB (60%) ← Both containers alive during switch
Post-deploy: 3.4 GB / 8 GB (42%)
Request Error Rate:
During switch: 0.00% ← Zero dropped requests
Response Time:
Pre-deploy: avg 145ms, p95 320ms
During switch: avg 147ms, p95 325ms ← No spike
Post-deploy: avg 143ms, p95 318ms
Key observation: Users experience zero impact during deployment.
Rollback Strategy
Instant Rollback (< 1 minute)
If the new deployment has issues:
# 1. Identify the problem (monitoring alerts)
echo "🚨 Issues detected in new deployment!"
# 2. Determine current active container
if docker compose -p messaging-service ps -q messaging-green | grep -q .; then
# Green is active, rollback to Blue
ACTIVE=messaging-green
ROLLBACK=messaging-blue
ACTIVE_PORT=3012
ROLLBACK_PORT=3011
else
# Blue is active, rollback to Green
ACTIVE=messaging-blue
ROLLBACK=messaging-green
ACTIVE_PORT=3011
ROLLBACK_PORT=3012
fi
echo "🔄 Rolling back from $ACTIVE to $ROLLBACK"
# 3. Switch Nginx back to old container
if [ "$ROLLBACK" == "messaging-blue" ]; then
sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi
# 4. Reload Nginx (instant switch)
sudo nginx -t && sudo nginx -s reload
# 5. Restart old container if stopped
docker compose -p messaging-service start $ROLLBACK
echo "✅ Rollback complete! Active: $ROLLBACK"
Total rollback time: ~30-60 seconds (most of that is restarting the old container if it was stopped)
Why Blue-Green Excels at Rollbacks
| Strategy | Rollback Time | Complexity | Data Loss Risk |
|---|---|---|---|
| Rolling | 5-10 min | High | Medium |
| Canary | 2-5 min | High | Low |
| Blue-Green | 30-60s | Low | None |
The secret: Old container is still alive (just stopped). Starting it takes seconds.
Gotchas and Lessons Learned
1. Health Checks Are Non-Negotiable
Initial mistake: No health check, just sleep 30 after starting container.
Problem: Sometimes app took 40+ seconds to start, leading to:
- Nginx switched before app ready
- 502 Bad Gateway errors
- Partial downtime
Solution: Proper Docker health checks with retries.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
interval: 10s
timeout: 5s
retries: 5 # Must succeed 5 times
start_period: 20s # Grace period
Result: Zero false positives, zero premature switches.
2. Graceful Shutdown Matters
Initial mistake: docker stop immediately after switch.
Problem: In-flight requests aborted mid-processing, causing:
- Failed push notifications
- Incomplete database transactions
- BullMQ jobs marked as failed
Solution: 30-second drain period.
# Wait for requests to complete
sleep 30
# Now safe to stop
docker stop $OLD
Result: Zero dropped requests during deployment.
3. Disk Space Monitoring
Problem: Docker images accumulated, filling disk.
df -h
# /dev/xvda1 100G 92G 8G 92% / ← Uh oh
Solution: Automated cleanup in deploy script.
# Remove dangling images (layers from old builds)
docker image prune -f
# Optional: Weekly cron job for aggressive cleanup
# crontab -e
# 0 2 * * 0 docker system prune -af --volumes
Result: Disk usage stable at 45-50%.
4. Redis Persistence Configuration
Initial mistake: Redis without persistence.
Problem: Container restart = lost BullMQ job queue.
Solution: Redis AOF (Append-Only File).
redis-messaging:
image: redis:7-alpine
command: >
redis-server
--appendonly yes # ← Enable persistence
--requirepass password123
Result: Job queue survives deployments.
5. SELinux Permissions (Amazon Linux)
Problem: Nginx couldn't access Docker containers.
sudo nginx -t
# nginx: [emerg] connect() to localhost:3011 failed (13: Permission denied)
Solution: SELinux policy adjustment.
# Allow Nginx to connect to network
sudo setsebool -P httpd_can_network_connect 1
# Or create custom policy (more secure)
sudo semanage port -a -t http_port_t -p tcp 3011
sudo semanage port -a -t http_port_t -p tcp 3012
Result: Nginx successfully proxies to containers.
6. Git Reset Issues
Initial mistake: git pull failing due to local changes.
git pull
# error: Your local changes to the following files would be overwritten
Solution: Hard reset to remote.
git fetch --all
git reset --hard origin/main # Force match remote
Trade-off: Local changes lost. Solution: Never edit directly on EC2.
Performance Impact Analysis
Resource Usage During Deployment
t3.large specs: 2 vCPU, 8 GB RAM, General Purpose SSD
| Phase | CPU | Memory | Duration |
|---|---|---|---|
| Normal operation | 18-22% | 3.2 GB | - |
| Docker build | 65-80% | 4.0 GB | 2-3 min |
| Both containers | 35-45% | 4.8 GB | 30-60s |
| New container only | 20-25% | 3.4 GB | - |
Conclusion: t3.large has ample headroom for Blue-Green deployment.
Network Traffic
During deployment:
- Inbound: Normal (users unaffected)
- Outbound: +50 MB (npm install, git pull)
- No external Docker registry traffic (build locally)
Database Connections
Concern: Both containers connecting simultaneously?
Reality:
- Old container: Draining, no new connections
- New container: Fresh connections
- Total: Never exceeds normal load
MSSQL connection pool: 50 max, typically use 8-12.
Production Metrics: 6 Months Later
After 6 months and 150+ deployments:
| Metric | Result |
|---|---|
| Total deployments | 156 |
| Failed deployments | 2 (1.3%) |
| Rollbacks needed | 1 (0.6%) |
| User-reported issues | 0 |
| Average deployment time | 3m 47s |
| Downtime per deployment | 0s |
| Total uptime | 99.97% |
Failure causes:
- Docker build failed (TypeScript error) → Caught before switch
- Health check timeout (DB connection issue) → Deployment aborted
- Rollback: Memory leak in new version → Rolled back in 45 seconds
Key insight: The system is self-healing. Failed deployments never affect users.
Cost Analysis
Infrastructure Costs (Monthly)
| Resource | Specs | Cost |
|---|---|---|
| EC2 t3.large | 2 vCPU, 8 GB RAM | $60.74 |
| EBS SSD | 100 GB | $10.00 |
| Data transfer | ~50 GB/month | $4.50 |
| Total | $75.24 |
GitHub Actions: Free tier (2,000 minutes/month) - using ~200 minutes/month
Alternative Costs (for comparison)
| Solution | Monthly Cost | Setup Time |
|---|---|---|
| Current (GitHub Actions + EC2) | $75 | 2 hours |
| Jenkins on EC2 | $95 (t3.medium for Jenkins + t3.large for app) | 1-2 days |
| AWS CodeDeploy | $85 (EC2 + ALB + CodeDeploy) | 4-6 hours |
| ECS Fargate Blue-Green | $120 (Fargate + ALB) | 1 day |
| Kubernetes (EKS) | $220 (EKS + nodes + ALB) | 2-3 days |
Winner: Current solution offers best cost/simplicity ratio for small-scale services.
When NOT to Use This Approach
This Blue-Green + Docker + GitHub Actions setup is great for small-to-medium services, but has limits:
Scale Limitations
❌ Don't use this approach when:
- Traffic > 10,000 RPS: Single EC2 instance bottleneck
- Multi-region deployment: No built-in geo-routing
- 10+ services: Manual script maintenance becomes unwieldy
- Team size > 10: Need centralized CI/CD governance
✅ Better alternatives:
- High scale: Kubernetes (EKS), AWS ECS
- Multi-region: AWS Global Accelerator + ALB
- Microservices: Kubernetes, Service Mesh (Istio)
- Large teams: Jenkins, GitLab CI, ArgoCD
Complexity Trade-offs
This approach optimizes for:
- ✅ Simplicity (1 server, shell script)
- ✅ Cost (minimal infrastructure)
- ✅ Speed (fast iteration)
If you need:
- Advanced traffic management (canary %, A/B testing)
- Auto-scaling (horizontal pod autoscaler)
- Service mesh (Istio, Linkerd)
- Observability (Prometheus, Grafana, Jaeger)
Then invest in Kubernetes. The operational complexity pays off at scale.
Future Improvements
Short-term (planned)
- Automated smoke tests after deployment
# After Nginx switch, before old container stop
curl -f https://messaging.goodtv.co.kr/health || rollback
- Slack notifications
curl -X POST $SLACK_WEBHOOK \
-d "{\"text\": \"✅ Deployment successful: $NEW\"}"
- Rollback button (GitHub Actions workflow_dispatch)
on:
workflow_dispatch:
inputs:
target:
description: 'Container to switch to (blue/green)'
required: true
Long-term (if needed)
- Canary deployment: Route 10% traffic to new version first
- Auto-scaling: Add more EC2 instances behind ALB
- Monitoring: Integrate Prometheus + Grafana
- Database migrations: Separate migration job before deployment
Conclusion
Building a production-grade CI/CD pipeline doesn't require expensive enterprise tools or complex Kubernetes clusters. With GitHub Actions, Docker multi-stage builds, and a well-crafted shell script, I achieved:
- True zero-downtime deployments (0 seconds user impact)
- Fast iteration (4-minute deployments)
- Easy rollbacks (30-60 seconds)
- Low cost ($75/month infrastructure)
- High reliability (99.97% uptime over 6 months)
The key decisions:
- GitHub Actions over Jenkins: Managed infrastructure, zero maintenance
- Blue-Green over Canary: Simpler for long-running jobs, instant rollback
- No Docker Hub: Direct build on EC2, fewer moving parts
- Multi-stage Dockerfile: 3.4x smaller images, faster deployments
- Health checks: No premature switches, reliable deployments
For small-to-medium services (< 10,000 RPS, single region, small team), this architecture hits the sweet spot of simplicity, reliability, and cost.
Key Takeaways
- Blue-Green deployment eliminates downtime for services with long-running processes
- GitHub Actions provides enterprise-grade CI/CD without operational overhead
- Multi-stage Dockerfiles reduce image size 3-4x
- Health checks are mandatory for reliable container switching
- Nginx upstream with backup provides instant traffic switching
- 30-second drain period prevents dropped in-flight requests
- Keeping stopped containers enables instant rollback
- Shell scripts can be production-grade with proper error handling
- Choose simplicity over features until you outgrow it
Top comments (0)