How I implemented production-grade CI/CD for a high-traffic push notification service handling 100K+ daily requests with zero downtime
When your push notification service handles 100,000+ daily requests and can't afford a single second of downtime, deployment strategy becomes critical. After experimenting with various CI/CD approaches, I settled on a GitHub Actions + Docker + Blue-Green deployment pattern that achieves true zero-downtime releases.
Here's the complete implementation story, including why I chose this specific stack over alternatives like Jenkins, why Blue-Green over Canary or Rolling deployments, and all the gotchas I encountered in production.
The Requirements: Zero Tolerance for Downtime
My push notification service has strict uptime requirements:
- 24/7 availability: No maintenance windows allowed
- High traffic: 100,000+ notifications daily
- Long-running processes: BullMQ workers processing jobs for 5-15 minutes
- Database connections: Existing connections must drain gracefully
- No dropped requests: In-flight requests must complete successfully
Traditional deployment approaches all had deal-breakers:
| Strategy | Downtime | Complexity | Issue |
|---|---|---|---|
| Direct replacement | 30-60s | Low | β Unacceptable downtime |
| Rolling update | ~10s | Medium | β Partial outage during rollout |
| Canary | Minimal | High | β Requires traffic splitting infrastructure |
| Blue-Green | 0s | Medium | β Perfect fit |
Architecture Overview
High-Level Flow
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββββββββββ
β GitHub ββββββΆβGitHub ActionsββββββΆβ EC2 SSH ββββββΆβ Deploy β
β Push β β Workflow β β Trigger β β Script β
βββββββββββββββ ββββββββββββββββ βββββββββββββββ ββββββββ¬ββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Docker Compose Build β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β Multi-Stage Dockerfile β β
β β ββββββββββββββββ ββββββββββββββββ β β
β β βBuild Stage βββΆβRuntime Stageβ β β
β β β(Node 22) β β(Optimized) β β β
β β ββββββββββββββββ ββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Blue-Green Container Switch β
β β
β Before: After: β
β ββββββββββββ ββββββββββββ β
β β Blue βββPrimary β Green βββPrimary β
β β :3011 β β :3012 β β
β ββββββββββββ ββββββββββββ β
β ββββββββββββ ββββββββββββ β
β β Green βββBackup β Blue βββBackup β
β β :3012 β β :3011 β β
β ββββββββββββ ββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Nginx Config Switch β
β β
β server localhost:3011; β :3012; β
β server localhost:3012 backup; β :3011 backup;β
β β
β nginx -s reload (instant switch) β
βββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Graceful Old Container Shutdown β
β β
β Wait 30s for in-flight requests β
β Stop old container (keep for rollback) β
ββββββββββββββββββββββββββββββββββββββββββββββββ
Infrastructure Components
βββββββββββββββββββββββββββββββββββ
β AWS EC2 Instance β
β (Amazon Linux 2023, t3.large) β
βββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββ βββββββββββββββββ ββββββββββββββββ
β Nginx β β Docker β β Shell β
β (Reverse β β Compose β β Scripts β
β Proxy) β β β β β
β β β βββββββββββββ β β deploy- β
β Port 443 βββββββββΆββ Blue β β β messaging.sh β
β (SSL/TLS) β β β :3011 β β β β
β β β βββββββββββββ β ββββββββββββββββ
β Upstream: β β β
β :3011/:3012 β β βββββββββββββ β
βββββββββββββββββ β β Green β β
β β :3012 β β
β βββββββββββββ β
β β
β βββββββββββββ β
β β Redis β β
β β :6380 β β
β βββββββββββββ β
βββββββββββββββββ
Why This Stack? Decision Breakdown
GitHub Actions vs Jenkins vs GitLab CI
I evaluated several CI/CD platforms:
Jenkins
Pros:
- Mature ecosystem
- Highly customizable
- Self-hosted (complete control)
Cons:
- β Requires dedicated server maintenance
- β Plugin hell (compatibility issues)
- β Steep learning curve
- β Need to manage Jenkins updates/security
GitLab CI
Pros:
- Excellent integration with GitLab
- Strong container registry
- Built-in Kubernetes support
Cons:
- β My code is on GitHub (migration friction)
- β GitLab Runner setup required
- β Additional cost for private repos
GitHub Actions (Winner) β
Pros:
- β Native GitHub integration (zero setup)
- β Free for public repos, generous free tier for private
- β Managed infrastructure (no server to maintain)
- β Massive marketplace of actions
- β YAML-based (simple, declarative)
- β Built-in secrets management
- β Excellent documentation
Cons:
- Vendor lock-in (acceptable trade-off)
- Minute limits on free tier (not an issue for my usage)
The deciding factor: I could set up the entire pipeline in 30 minutes vs 2-3 days for Jenkins setup, configuration, and maintenance. For a solo developer or small team, this is a no-brainer.
Blue-Green vs Canary vs Rolling Deployments
Rolling Deployment
How it works: Gradually replace instances one-by-one
Pros:
- Efficient resource usage
- Gradual rollout
Cons:
- β Mixed versions running simultaneously (API compatibility issues)
- β Partial downtime as each instance restarts
- β Complex rollback (need to track which instances updated)
- β Long deployment time for multiple instances
Canary Deployment
How it works: Route small percentage of traffic to new version
Pros:
- Test with real traffic
- Gradual risk mitigation
- Easy to catch issues early
Cons:
- β Requires sophisticated traffic splitting (ALB, Istio, etc.)
- β Complex metrics/monitoring setup needed
- β Need automated rollback logic
- β Infrastructure overhead (traffic routing, health checks)
- β Overkill for small-scale services
Blue-Green Deployment (Winner) β
How it works: Maintain two identical environments, switch instantly
Pros:
- β True zero-downtime (instant switch)
- β Easy rollback (just switch back)
- β Simple implementation (Nginx config change)
- β Full testing before switch
- β Clear state: old or new (no mixed versions)
- β Works great with Docker Compose
Cons:
- Double resource usage (acceptable on t3.large)
- Requires health checks
The deciding factor: For a push notification service with long-running jobs, Blue-Green guarantees that:
- No in-flight requests are dropped
- BullMQ workers complete their jobs
- Database connections drain properly
- Rollback is instant if issues occur
The resource cost is worth the operational simplicity and reliability.
Why Not Docker Hub?
Many tutorials use Docker Hub as an intermediary:
Build β Push to Docker Hub β Pull on EC2 β Run
Why I skipped it:
- Security: No credentials stored in GitHub Actions for registry
- Simplicity: Direct build on EC2 (one less moving part)
- Speed: No image push/pull over internet
- Cost: Docker Hub rate limits on free tier
- Privacy: Source code never leaves my infrastructure
Trade-off: EC2 must have enough resources for building. Solution: t3.large with 2 vCPU / 8GB RAM handles builds fine.
Implementation: Step by Step
Step 1: Multi-Stage Dockerfile (Optimized)
Key insight: Separate build and runtime stages to minimize image size and attack surface.
# Dockerfile
# ============= Stage 1: Build =============
FROM node:22-alpine AS builder
WORKDIR /app
# Copy dependency files first (layer caching)
COPY package*.json ./
RUN npm install
# Copy source code and build
COPY . .
RUN npm run build
# ============= Stage 2: Runtime =============
FROM node:22-alpine AS runner
WORKDIR /app
# Copy only production artifacts from builder
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/package*.json ./
# Install only production dependencies + curl for health checks
RUN apk add --no-cache curl \
&& npm install --only=production
EXPOSE 3002
CMD ["node", "dist/main"]
Why Multi-Stage?
| Metric | Single-Stage | Multi-Stage | Improvement |
|---|---|---|---|
| Image size | ~1.2 GB | ~350 MB | 3.4x smaller |
| Build dependencies | Included | Excluded | Attack surface β |
| TypeScript files | Included | Excluded | Security β |
| Layer caching | Poor | Excellent | Build speed β |
Pro tips:
-
node:22-alpinevsnode:22: Alpine is 4x smaller (120MB vs 500MB base) -
npm install --only=production: Excludes devDependencies (TypeScript, Jest, etc.) -
COPY package*.jsonbefore source: Docker layer caching speeds up builds -
apk add curl: Needed for Docker health checks
Step 2: Docker Compose with Health Checks
Critical requirement: Know when container is actually ready (not just "running")
# docker-compose.yml
services:
### Blue Container
messaging-blue:
build: .
env_file: .env
depends_on:
redis-messaging:
condition: service_healthy # Wait for Redis
networks: [messaging-net]
ports:
- "3011:3002"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
interval: 10s # Check every 10 seconds
timeout: 5s # Timeout after 5 seconds
retries: 5 # Must succeed 5 times
start_period: 20s # Grace period for app startup
### Green Container
messaging-green:
build: .
env_file: .env
depends_on:
redis-messaging:
condition: service_healthy
networks: [messaging-net]
ports:
- "3012:3002"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
interval: 10s
timeout: 5s
retries: 5
start_period: 20s
### Redis (shared by both containers)
redis-messaging:
container_name: redis-messaging
image: redis:7-alpine
command: >
redis-server
--appendonly yes
--requirepass password123
--maxmemory 512mb
--maxmemory-policy noeviction
ports:
- "6380:6379"
restart: always
networks: [messaging-net]
healthcheck:
test: ["CMD", "redis-cli", "-a", "password123", "ping"]
interval: 10s
timeout: 5s
retries: 5
networks:
messaging-net:
driver: bridge
Key Design Decisions:
- Separate ports (3011/3012): Blue and Green can run simultaneously
- Shared Redis: Both containers use same Redis instance (job queue persistence)
-
Health checks: Deploy script waits for
healthystatus before switching - start_period: 20s: Grace period for NestJS app initialization
- retries: 5: Must succeed 5 consecutive checks (avoids flaky switches)
Why /health/live endpoint?
// src/main.ts (NestJS)
async function bootstrap() {
const app = await NestFactory.create(AppModule);
// Health check endpoint (simple but effective)
app.get('/health/live', (req, res) => {
res.status(200).send('OK');
});
await app.listen(3002, '0.0.0.0');
}
This endpoint confirms:
- β NestJS app started
- β HTTP server listening
- β Ready to handle requests
For production, consider more sophisticated health checks:
app.get('/health/ready', async (req, res) => {
// Check database connection
const dbOk = await checkDatabase();
// Check Redis connection
const redisOk = await checkRedis();
// Check external API
const fcmOk = await checkFirebase();
if (dbOk && redisOk && fcmOk) {
res.status(200).json({ status: 'ready' });
} else {
res.status(503).json({ status: 'not ready' });
}
});
Step 3: Nginx Configuration (Traffic Router)
Nginx acts as the traffic cop, instantly routing all requests to the active container.
# /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
# Blue-Green upstream definition
upstream messaging-server {
server localhost:3011; # Blue (initially primary)
server localhost:3012 backup; # Green (initially backup)
}
server {
listen 443 ssl;
server_name messaging.goodtv.co.kr;
# SSL certificates
ssl_certificate /etc/nginx/ssl/goodtv.co.kr.pem;
ssl_certificate_key /etc/nginx/ssl/WILD.goodtv.co.kr.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_prefer_server_ciphers on;
ssl_ciphers HIGH:!aNULL:!MD5;
location / {
proxy_pass http://messaging-server;
proxy_http_version 1.1;
# WebSocket support (for long-polling if needed)
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
# Forward client info
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto https;
proxy_set_header Cookie $http_cookie;
proxy_redirect off;
}
access_log /var/log/nginx/api-ssl-access.log;
error_log /var/log/nginx/api-ssl-error.log;
}
How Nginx backup works:
Normal state:
Primary: :3011 β 100% traffic
Backup: :3012 β 0% traffic (only used if :3011 fails)
After deployment switch:
Primary: :3012 β 100% traffic
Backup: :3011 β 0% traffic (old version, kept for rollback)
Key benefit: Switching traffic is a simple text replacement + Nginx reload (< 10ms).
Step 4: GitHub Actions Workflow
Trigger on main branch push, SSH into EC2, run deploy script.
# .github/workflows/ci-cd-messaging.yml
name: Messaging CI/CD
on:
push:
branches: [ main ]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
# Step 1: Checkout code
- name: Checkout repository
uses: actions/checkout@v3
# Step 2: Setup Node.js (for local build/test if needed)
- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '22'
# Step 3: Install dependencies (optional, for tests)
- name: Install dependencies
run: npm install
# Step 4: Build project (validates TypeScript)
- name: Build project
run: npm run build
# Step 5: SSH into EC2 and trigger deploy script
- name: Deploy to EC2
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SSH_HOST }}
port: ${{ secrets.SSH_PORT }}
username: ${{ secrets.SSH_USER }}
key: ${{ secrets.SSH_KEY }}
script: |
/home/ec2-user/deploy-messaging.sh
GitHub Secrets Setup:
Settings β Secrets and variables β Actions β New repository secret
SSH_HOST: 54.123.45.67 (EC2 public IP)
SSH_PORT: 22
SSH_USER: ec2-user
SSH_KEY: <contents of private key>
Why build twice (GitHub Actions + EC2)?
- GitHub Actions build: Validation only (catch TypeScript errors early)
- EC2 build: Actual deployment (ensures consistency with production environment)
Alternative approach: Build on GitHub Actions, push to Docker registry, pull on EC2. I skipped this for simplicity.
Step 5: The Deploy Script (The Magic)
This script orchestrates the entire Blue-Green switch.
#!/bin/bash
# /home/ec2-user/deploy-messaging.sh
set -euo pipefail # Exit on error, undefined variable, or pipe failure
# ============= 0. Cleanup Dangling Images =============
echo "π Checking for dangling images..."
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)
if [ "$DANGLING_COUNT" -gt 0 ]; then
echo "π§Ή Found $DANGLING_COUNT dangling images. Cleaning up..."
docker image prune -f
else
echo "β
No dangling images found."
fi
# Navigate to project directory
cd /home/ec2-user/push-messaging-server
# ============= 1. Remove Old Stopped Containers =============
for OLD_COLOR in messaging-blue messaging-green; do
CONTAINER_NAME="messaging-service-${OLD_COLOR}-1"
if docker inspect "$CONTAINER_NAME" >/dev/null 2>&1; then
STATUS=$(docker inspect --format='{{.State.Status}}' "$CONTAINER_NAME")
if [ "$STATUS" = "exited" ]; then
echo "π Removing old stopped container: $CONTAINER_NAME"
docker compose -p messaging-service rm -f "$OLD_COLOR"
fi
fi
done
# ============= 2. Determine Blue-Green Target =============
CURRENT=$(docker compose -p messaging-service ps -q messaging-blue | wc -l)
if [ "$CURRENT" -gt 0 ]; then
# Blue is running β deploy Green
NEW=messaging-green
OLD=messaging-blue
NEW_PORT=3012
OLD_PORT=3011
else
# Green is running (or nothing) β deploy Blue
NEW=messaging-blue
OLD=messaging-green
NEW_PORT=3011
OLD_PORT=3012
fi
echo "π― Deployment target: $NEW (port: $NEW_PORT)"
# ============= 3. Pull Latest Code & Build New Container =============
git fetch --all
git checkout main
git reset --hard origin/main
docker compose -p messaging-service build $NEW
docker compose -p messaging-service up -d $NEW || {
echo "π¨ Container startup failed. Showing logs:"
docker logs messaging-service-$NEW-1
exit 1
}
# ============= 4. Wait for Health Check =============
MAX_RETRIES=30
COUNT=0
while [ "$(docker inspect --format='{{.State.Health.Status}}' messaging-service-$NEW-1)" != "healthy" ]; do
if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
echo "β Health check failed: $NEW container not healthy."
docker logs messaging-service-$NEW-1
docker compose -p messaging-service stop $NEW
exit 1
fi
echo "π‘ Waiting for health check... ($COUNT/$MAX_RETRIES)"
sleep 5
COUNT=$((COUNT + 1))
done
echo "β
$NEW container is healthy!"
# ============= 5. Switch Nginx Configuration =============
if [ "$NEW" == "messaging-green" ]; then
# Switch to Green
sudo sed -i "s/^ *server localhost:3011;/server localhost:3012;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/^ *server localhost:3012 backup;/server localhost:3011 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
# Switch to Blue
sudo sed -i "s/^ *server localhost:3012;/server localhost:3011;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/^ *server localhost:3011 backup;/server localhost:3012 backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi
# ============= 6. Reload Nginx =============
if ! sudo nginx -t; then
echo "β Nginx configuration test failed. Aborting."
exit 1
fi
sudo nginx -s reload
echo "β
Nginx reloaded. Traffic now routing to $NEW (port: $NEW_PORT)"
# ============= 7. Gracefully Stop Old Container =============
sleep 30 # Wait for in-flight requests to complete
docker compose -p messaging-service stop $OLD || true
if docker inspect messaging-service-$OLD-1 >/dev/null 2>&1; then
echo "β³ Waiting for $OLD container to stop..."
MAX_RETRIES=30
COUNT=0
while [ "$(docker inspect --format='{{.State.Status}}' messaging-service-$OLD-1)" != "exited" ]; do
if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
echo "β Failed to stop $OLD container gracefully."
docker logs messaging-service-$OLD-1
break
fi
sleep 2
COUNT=$((COUNT + 1))
done
echo "π $OLD container stopped (kept for rollback)."
else
echo "π‘ No $OLD container found. Skipping."
fi
echo "π Deployment complete! Active: $NEW, Standby: $OLD"
Script Flow Visualization:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Deploy Script Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Cleanup
ββ Remove dangling images (free disk space)
2. Determine Target
ββ Blue running? β Deploy Green
ββ Green running? β Deploy Blue
ββ Nothing running? β Deploy Blue
3. Build & Start New Container
ββ git pull latest code
ββ docker compose build NEW
ββ docker compose up -d NEW
ββ Wait for health check (up to 2.5 min)
4. Switch Traffic (instant)
ββ sed: Replace port in Nginx config
ββ nginx -t (validate config)
ββ nginx -s reload (< 10ms switch)
5. Graceful Shutdown
ββ sleep 30s (drain in-flight requests)
ββ docker stop OLD (keep container for rollback)
Result:
ββββββββββββββββ ββββββββββββββββ
β Blue β β Green β
β (OLD) β Traffic switch β (NEW) β
β Stopped ββββββββββββββββββββββ Active β
β (Rollback) β Instant β (Primary) β
ββββββββββββββββ ββββββββββββββββ
Critical implementation details:
-
set -euo pipefail: Fail fast on any error
set -e # Exit on error
set -u # Exit on undefined variable
set -o pipefail # Exit if any command in pipe fails
- Health check loop: Don't switch traffic until container is truly ready
# Docker health status: starting β healthy β unhealthy
while [ "$(docker inspect --format='{{.State.Health.Status}}' ...)" != "healthy" ]; do
# Wait...
done
-
sedfor config replacement: Fast, atomic text replacement
# Before: server localhost:3011;
# After: server localhost:3012;
sudo sed -i "s/server localhost:3011;/server localhost:3012;/" config.conf
-
nginx -s reload: Graceful reload (no dropped connections)- New workers start with new config
- Old workers finish current requests
- Old workers shut down after requests complete
- Total switch time: < 10ms
30-second drain: Wait for in-flight requests to complete
sleep 30 # Typical request: 1-5s, long-running: up to 30s
docker stop $OLD # Now safe to stop
- Keep stopped container: Easy rollback if needed
# Don't delete, just stop
docker compose stop $OLD
# To rollback: just reverse the Nginx switch
Production Deployment in Action
Typical Deployment Timeline
T+0:00 - GitHub push to main
T+0:05 - GitHub Actions triggered
T+0:15 - Actions build validates (TypeScript compilation)
T+0:20 - SSH into EC2, deploy script starts
T+0:25 - git pull, docker build starts
T+3:00 - New container starts
T+3:30 - Health check passes (5 retries Γ 10s)
T+3:31 - Nginx config switched
T+3:32 - Traffic now on new container β ZERO DOWNTIME
T+4:02 - Old container stopped (30s drain)
T+4:03 - Deployment complete β
Total time: ~4 minutes
User-visible downtime: 0 seconds
Real Deployment Log
[ec2-user@ip-172-31-38-149 ~]$ /home/ec2-user/deploy-messaging.sh
π Checking for dangling images...
β
No dangling images found.
π― Deployment target: messaging-green (port: 3012)
Fetching latest code...
Already on 'main'
HEAD is now at a7f3d21 Fix: Improve batch logging performance
Building new container...
[+] Building 142.3s (18/18) FINISHED
=> [builder 1/6] FROM node:22-alpine 0.0s
=> [builder 2/6] WORKDIR /app 0.1s
=> [builder 3/6] COPY package*.json ./ 0.1s
=> [builder 4/6] RUN npm install 48.2s
=> [builder 5/6] COPY . . 1.3s
=> [builder 6/6] RUN npm run build 52.1s
=> [runner 1/4] FROM node:22-alpine 0.0s
=> [runner 2/4] WORKDIR /app 0.0s
=> [runner 3/4] COPY --from=builder /app/dist ./dist 0.2s
=> [runner 4/4] RUN apk add --no-cache curl && npm... 28.4s
=> exporting to image 11.9s
Starting container...
[+] Running 1/1
β Container messaging-service-messaging-green-1 Started 2.3s
π‘ Waiting for health check... (0/30)
π‘ Waiting for health check... (1/30)
π‘ Waiting for health check... (2/30)
π‘ Waiting for health check... (3/30)
π‘ Waiting for health check... (4/30)
π‘ Waiting for health check... (5/30)
β
messaging-green container is healthy!
Switching Nginx configuration...
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
β
Nginx reloaded. Traffic now routing to messaging-green (port: 3012)
Waiting 30s for in-flight requests to complete...
β³ Waiting for messaging-blue container to stop...
π messaging-blue container stopped (kept for rollback).
π Deployment complete! Active: messaging-green, Standby: messaging-blue
Monitoring During Deployment
Server metrics during a typical deployment:
CPU Usage:
Pre-deploy: 18-22% (normal load)
Building: 65-80% (docker build)
Post-deploy: 20-25% (slightly higher, new container initializing)
Steady: 18-22% (back to normal)
Memory Usage:
Pre-deploy: 3.2 GB / 8 GB (40%)
Both running: 4.8 GB / 8 GB (60%) β Both containers alive during switch
Post-deploy: 3.4 GB / 8 GB (42%)
Request Error Rate:
During switch: 0.00% β Zero dropped requests
Response Time:
Pre-deploy: avg 145ms, p95 320ms
During switch: avg 147ms, p95 325ms β No spike
Post-deploy: avg 143ms, p95 318ms
Key observation: Users experience zero impact during deployment.
Rollback Strategy
Instant Rollback (< 1 minute)
If the new deployment has issues:
# 1. Identify the problem (monitoring alerts)
echo "π¨ Issues detected in new deployment!"
# 2. Determine current active container
if docker compose -p messaging-service ps -q messaging-green | grep -q .; then
# Green is active, rollback to Blue
ACTIVE=messaging-green
ROLLBACK=messaging-blue
ACTIVE_PORT=3012
ROLLBACK_PORT=3011
else
# Blue is active, rollback to Green
ACTIVE=messaging-blue
ROLLBACK=messaging-green
ACTIVE_PORT=3011
ROLLBACK_PORT=3012
fi
echo "π Rolling back from $ACTIVE to $ROLLBACK"
# 3. Switch Nginx back to old container
if [ "$ROLLBACK" == "messaging-blue" ]; then
sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
else
sudo sed -i "s/server localhost:$ACTIVE_PORT;/server localhost:$ROLLBACK_PORT;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
sudo sed -i "s/server localhost:$ROLLBACK_PORT backup;/server localhost:$ACTIVE_PORT backup;/" /etc/nginx/conf.d/messaging.goodtv.co.kr.conf
fi
# 4. Reload Nginx (instant switch)
sudo nginx -t && sudo nginx -s reload
# 5. Restart old container if stopped
docker compose -p messaging-service start $ROLLBACK
echo "β
Rollback complete! Active: $ROLLBACK"
Total rollback time: ~30-60 seconds (most of that is restarting the old container if it was stopped)
Why Blue-Green Excels at Rollbacks
| Strategy | Rollback Time | Complexity | Data Loss Risk |
|---|---|---|---|
| Rolling | 5-10 min | High | Medium |
| Canary | 2-5 min | High | Low |
| Blue-Green | 30-60s | Low | None |
The secret: Old container is still alive (just stopped). Starting it takes seconds.
Gotchas and Lessons Learned
1. Health Checks Are Non-Negotiable
Initial mistake: No health check, just sleep 30 after starting container.
Problem: Sometimes app took 40+ seconds to start, leading to:
- Nginx switched before app ready
- 502 Bad Gateway errors
- Partial downtime
Solution: Proper Docker health checks with retries.
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3002/health/live"]
interval: 10s
timeout: 5s
retries: 5 # Must succeed 5 times
start_period: 20s # Grace period
Result: Zero false positives, zero premature switches.
2. Graceful Shutdown Matters
Initial mistake: docker stop immediately after switch.
Problem: In-flight requests aborted mid-processing, causing:
- Failed push notifications
- Incomplete database transactions
- BullMQ jobs marked as failed
Solution: 30-second drain period.
# Wait for requests to complete
sleep 30
# Now safe to stop
docker stop $OLD
Result: Zero dropped requests during deployment.
3. Disk Space Monitoring
Problem: Docker images accumulated, filling disk.
df -h
# /dev/xvda1 100G 92G 8G 92% / β Uh oh
Solution: Automated cleanup in deploy script.
# Remove dangling images (layers from old builds)
docker image prune -f
# Optional: Weekly cron job for aggressive cleanup
# crontab -e
# 0 2 * * 0 docker system prune -af --volumes
Result: Disk usage stable at 45-50%.
4. Redis Persistence Configuration
Initial mistake: Redis without persistence.
Problem: Container restart = lost BullMQ job queue.
Solution: Redis AOF (Append-Only File).
redis-messaging:
image: redis:7-alpine
command: >
redis-server
--appendonly yes # β Enable persistence
--requirepass password123
Result: Job queue survives deployments.
5. SELinux Permissions (Amazon Linux)
Problem: Nginx couldn't access Docker containers.
sudo nginx -t
# nginx: [emerg] connect() to localhost:3011 failed (13: Permission denied)
Solution: SELinux policy adjustment.
# Allow Nginx to connect to network
sudo setsebool -P httpd_can_network_connect 1
# Or create custom policy (more secure)
sudo semanage port -a -t http_port_t -p tcp 3011
sudo semanage port -a -t http_port_t -p tcp 3012
Result: Nginx successfully proxies to containers.
6. Git Reset Issues
Initial mistake: git pull failing due to local changes.
git pull
# error: Your local changes to the following files would be overwritten
Solution: Hard reset to remote.
git fetch --all
git reset --hard origin/main # Force match remote
Trade-off: Local changes lost. Solution: Never edit directly on EC2.
Performance Impact Analysis
Resource Usage During Deployment
t3.large specs: 2 vCPU, 8 GB RAM, General Purpose SSD
| Phase | CPU | Memory | Duration |
|---|---|---|---|
| Normal operation | 18-22% | 3.2 GB | - |
| Docker build | 65-80% | 4.0 GB | 2-3 min |
| Both containers | 35-45% | 4.8 GB | 30-60s |
| New container only | 20-25% | 3.4 GB | - |
Conclusion: t3.large has ample headroom for Blue-Green deployment.
Network Traffic
During deployment:
- Inbound: Normal (users unaffected)
- Outbound: +50 MB (npm install, git pull)
- No external Docker registry traffic (build locally)
Database Connections
Concern: Both containers connecting simultaneously?
Reality:
- Old container: Draining, no new connections
- New container: Fresh connections
- Total: Never exceeds normal load
MSSQL connection pool: 50 max, typically use 8-12.
Production Metrics: 6 Months Later
After 6 months and 150+ deployments:
| Metric | Result |
|---|---|
| Total deployments | 156 |
| Failed deployments | 2 (1.3%) |
| Rollbacks needed | 1 (0.6%) |
| User-reported issues | 0 |
| Average deployment time | 3m 47s |
| Downtime per deployment | 0s |
| Total uptime | 99.97% |
Failure causes:
- Docker build failed (TypeScript error) β Caught before switch
- Health check timeout (DB connection issue) β Deployment aborted
- Rollback: Memory leak in new version β Rolled back in 45 seconds
Key insight: The system is self-healing. Failed deployments never affect users.
Cost Analysis
Infrastructure Costs (Monthly)
| Resource | Specs | Cost |
|---|---|---|
| EC2 t3.large | 2 vCPU, 8 GB RAM | $60.74 |
| EBS SSD | 100 GB | $10.00 |
| Data transfer | ~50 GB/month | $4.50 |
| Total | $75.24 |
GitHub Actions: Free tier (2,000 minutes/month) - using ~200 minutes/month
Alternative Costs (for comparison)
| Solution | Monthly Cost | Setup Time |
|---|---|---|
| Current (GitHub Actions + EC2) | $75 | 2 hours |
| Jenkins on EC2 | $95 (t3.medium for Jenkins + t3.large for app) | 1-2 days |
| AWS CodeDeploy | $85 (EC2 + ALB + CodeDeploy) | 4-6 hours |
| ECS Fargate Blue-Green | $120 (Fargate + ALB) | 1 day |
| Kubernetes (EKS) | $220 (EKS + nodes + ALB) | 2-3 days |
Winner: Current solution offers best cost/simplicity ratio for small-scale services.
When NOT to Use This Approach
This Blue-Green + Docker + GitHub Actions setup is great for small-to-medium services, but has limits:
Scale Limitations
β Don't use this approach when:
- Traffic > 10,000 RPS: Single EC2 instance bottleneck
- Multi-region deployment: No built-in geo-routing
- 10+ services: Manual script maintenance becomes unwieldy
- Team size > 10: Need centralized CI/CD governance
β Better alternatives:
- High scale: Kubernetes (EKS), AWS ECS
- Multi-region: AWS Global Accelerator + ALB
- Microservices: Kubernetes, Service Mesh (Istio)
- Large teams: Jenkins, GitLab CI, ArgoCD
Complexity Trade-offs
This approach optimizes for:
- β Simplicity (1 server, shell script)
- β Cost (minimal infrastructure)
- β Speed (fast iteration)
If you need:
- Advanced traffic management (canary %, A/B testing)
- Auto-scaling (horizontal pod autoscaler)
- Service mesh (Istio, Linkerd)
- Observability (Prometheus, Grafana, Jaeger)
Then invest in Kubernetes. The operational complexity pays off at scale.
Future Improvements
Short-term (planned)
- Automated smoke tests after deployment
# After Nginx switch, before old container stop
curl -f https://messaging.goodtv.co.kr/health || rollback
- Slack notifications
curl -X POST $SLACK_WEBHOOK \
-d "{\"text\": \"β
Deployment successful: $NEW\"}"
- Rollback button (GitHub Actions workflow_dispatch)
on:
workflow_dispatch:
inputs:
target:
description: 'Container to switch to (blue/green)'
required: true
Long-term (if needed)
- Canary deployment: Route 10% traffic to new version first
- Auto-scaling: Add more EC2 instances behind ALB
- Monitoring: Integrate Prometheus + Grafana
- Database migrations: Separate migration job before deployment
Conclusion
Building a production-grade CI/CD pipeline doesn't require expensive enterprise tools or complex Kubernetes clusters. With GitHub Actions, Docker multi-stage builds, and a well-crafted shell script, I achieved:
- True zero-downtime deployments (0 seconds user impact)
- Fast iteration (4-minute deployments)
- Easy rollbacks (30-60 seconds)
- Low cost ($75/month infrastructure)
- High reliability (99.97% uptime over 6 months)
The key decisions:
- GitHub Actions over Jenkins: Managed infrastructure, zero maintenance
- Blue-Green over Canary: Simpler for long-running jobs, instant rollback
- No Docker Hub: Direct build on EC2, fewer moving parts
- Multi-stage Dockerfile: 3.4x smaller images, faster deployments
- Health checks: No premature switches, reliable deployments
For small-to-medium services (< 10,000 RPS, single region, small team), this architecture hits the sweet spot of simplicity, reliability, and cost.
Key Takeaways
- Blue-Green deployment eliminates downtime for services with long-running processes
- GitHub Actions provides enterprise-grade CI/CD without operational overhead
- Multi-stage Dockerfiles reduce image size 3-4x
- Health checks are mandatory for reliable container switching
- Nginx upstream with backup provides instant traffic switching
- 30-second drain period prevents dropped in-flight requests
- Keeping stopped containers enables instant rollback
- Shell scripts can be production-grade with proper error handling
- Choose simplicity over features until you outgrow it
Top comments (0)