Evolving from SSH-based GitHub Actions to self-hosted runners for enhanced security, performance, and handling both backend API and internal admin frontend deployments
The Problem: Security vs Automation Trade-off
In my previous post Building a Zero-Downtime CI/CD Pipeline: Blue-Green Deployments for 100K+ Daily Requests, I implemented a robust CI/CD pipeline using GitHub Actions with SSH deployment to achieve zero-downtime Blue-Green deployments. The system worked flawlessly for months, handling 100,000+ daily push notifications with 99.97% uptime.
But then we hit a critical security requirement.
Our security audit revealed a significant vulnerability: SSH access was open to the entire internet (0.0.0.0/0). While this made GitHub Actions deployment seamless, it violated security best practices. The mandate was clear: restrict SSH access to our office network only (203.0.113.0/24).
The Immediate Consequence
The morning after implementing the security group restriction:
Run appleboy/ssh-action@v1
======= CLI Version Information =======
Drone SSH version 1.8.1
=======================================
2025/11/20 02:23:29 dial tcp ***:***: i/o timeout
Error: Process completed with exit code 1.
Deployment pipeline: completely broken.
GitHub Actions runners operate from dynamic IP addresses across multiple data centers worldwide. Whitelisting GitHub's IP ranges meant maintaining and updating a list of 100+ CIDR blocks that change periodically—an operational nightmare.
The Dilemma
I faced three options:
| Option | Pros | Cons |
|---|---|---|
| Keep SSH open | Simple, works | ❌ Security risk |
| Whitelist GitHub IPs | Somewhat secure | ❌ Maintenance nightmare, brittle |
| Self-hosted runner | ✅ Secure, no SSH needed | Learning curve |
The decision was clear: eliminate SSH dependency entirely by running GitHub Actions directly on the EC2 instance.
Solution Architecture: Self-Hosted Runner
Conceptual Shift
Before (SSH-based):
┌─────────────┐ SSH over ┌──────────────┐ Execute ┌─────────────┐
│ GitHub │ Internet │ EC2 │ Script │ Docker │
│ Actions ├───────────────▶│ (Target) ├──────────────▶│ Containers │
│ Runner │ (Security │ │ │ │
│ (Cloud) │ Risk!) │ │ │ │
└─────────────┘ └──────────────┘ └─────────────┘
After (Self-hosted runner):
┌─────────────┐ GitHub API ┌──────────────┐ Local Exec ┌─────────────┐
│ GitHub │ (HTTPS) │ EC2 │ │ Docker │
│ Repository │◀──────────────▶│ Runner ├──────────────▶│ Containers │
│ │ Pull Jobs │ (Installed) │ Direct │ │
└─────────────┘ └──────────────┘ └─────────────┘
Key Benefits
- No SSH required: Runner pulls jobs via HTTPS (port 443)
-
Office network restriction maintained: SSH limited to
203.0.113.0/24 - Zero network latency: Local execution (no SSH overhead)
- Simplified secrets management: No SSH keys in GitHub Secrets
- Improved security posture: Reduced attack surface
Dual Deployment Architecture: Backend + Internal Frontend
Our infrastructure now handles two types of deployments:
1. Backend API (Public Domain)
-
Domain:
api.example.com(HTTPS with SSL/TLS) - Purpose: Push notification REST API serving mobile apps
- Access: Public internet (with rate limiting)
- Technology: NestJS backend with Redis queue
2. Internal Admin Frontend (IP-Only Access)
-
Access: Direct IP
203.0.113.10(HTTP only) - Purpose: Internal admin dashboard for managing push notifications
-
Restriction: Office network only (
203.0.113.0/24) - Technology: Next.js frontend
-
Backend Communication: Connects to
api.example.comfor data
Architecture Diagram
Internet
│
▼
┌────────────────┐
│ AWS Security │
│ Group │
└────────┬───────┘
│
┌─────────────┴─────────────┐
│ │
▼ (HTTPS 443) ▼ (HTTP 80 - Office IP Only)
┌─────────────────┐ ┌──────────────────────┐
│ Public API │ │ Internal Admin UI │
│ api.example.com│ │ 203.0.113.10:80 │
│ │ │ │
│ Nginx:443 │ │ Nginx:80 │
│ ↓ │ │ ↓ │
│ Backend:3011/ │◀────────┤ Frontend:8011/8012 │
│ 3012 │ │ (Blue-Green) │
│ (Blue-Green) │ │ │
└─────────────────┘ └──────────────────────┘
▲ ▲
│ │
Self-Hosted Self-Hosted
Runner (backend) Runner (frontend)
│ │
└───────────┬───────────────┘
│
GitHub Actions
(Workflow Dispatch)
AWS Security Group Configuration
Critical security configuration:
# Backend EC2 Security Group (Public API)
Inbound Rules:
┌──────────┬─────────┬──────────────────┬─────────────────────┐
│ Protocol │ Port │ Source │ Description │
├──────────┼─────────┼──────────────────┼─────────────────────┤
│ TCP │ 22 │ 203.0.113.0/24 │ Office SSH Only │
│ TCP │ 443 │ 0.0.0.0/0 │ HTTPS (Public API) │
│ TCP │ 80 │ 0.0.0.0/0 │ HTTP (Redirect) │
└──────────┴─────────┴──────────────────┴─────────────────────┘
# Frontend EC2 Security Group (Internal Admin)
Inbound Rules:
┌──────────┬─────────┬──────────────────┬─────────────────────┐
│ Protocol │ Port │ Source │ Description │
├──────────┼─────────┼──────────────────┼─────────────────────┤
│ TCP │ 22 │ 203.0.113.0/24 │ Office SSH Only │
│ TCP │ 80 │ 203.0.113.0/24 │ Internal Admin UI │
│ TCP │ 443 │ 0.0.0.0/0 │ Backend API Access │
└──────────┴─────────┴──────────────────┴─────────────────────┘
Outbound Rules (Both):
┌──────────┬─────────┬──────────────────┬─────────────────────┐
│ Protocol │ Port │ Destination │ Description │
├──────────┼─────────┼──────────────────┼─────────────────────┤
│ All │ All │ 0.0.0.0/0 │ Allow all outbound │
└──────────┴─────────┴──────────────────┴─────────────────────┘
Key design decisions:
-
Backend (Public API):
- Port 443 open to internet (public REST API)
- SSL/TLS termination at Nginx
- Domain-based access (
api.example.com)
-
Frontend (Internal Admin):
- Port 80 restricted to office network only
- No SSL required (internal network)
- IP-based access (
203.0.113.10) - Frontend fetches data from backend via public API
-
SSH Security:
- Both instances: SSH limited to office network
- No public SSH access
- Self-hosted runners eliminate need for GitHub Actions SSH
Implementation: Step-by-Step
Phase 1: Installing GitHub Actions Runner on EC2
Prerequisites:
- EC2 instance already running
- SSH access from office network (
203.0.113.0/24) - Sudo privileges
Step 1.1: Generate Runner Token
Navigate to your GitHub repository:
Repository → Settings → Actions → Runners → New self-hosted runner
Select Linux and x64 architecture. GitHub will display setup commands and generate a registration token (valid for 1 hour).
Important: Copy the token—you'll need it in the next step.
Step 1.2: Download and Configure Runner (Backend)
SSH into your backend EC2 instance from your office network:
# Connect from office network (203.0.113.0/24)
ssh -i your-key.pem ec2-user@203.0.113.10
# Create runner directory
mkdir -p /home/ec2-user/actions-runner-backend
cd /home/ec2-user/actions-runner-backend
# Download latest runner (check GitHub for current version)
curl -o actions-runner-linux-x64-2.321.0.tar.gz -L \
https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz
# Validate hash (security best practice)
echo "29fc8cf2dab4c195bb147384e7e2c94cfd4d4022c793b346a6175435265aa278 actions-runner-linux-x64-2.321.0.tar.gz" | shasum -a 256 -c
# Extract
tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz
Step 1.3: Configure Backend Runner
# Run configuration script
./config.sh \
--url https://github.com/YOUR_USERNAME/backend-api \
--token YOUR_GENERATED_TOKEN \
--name backend-production-runner \
--work _work \
--labels backend,production,api
Configuration prompts and responses:
Enter the name of the runner group to add this runner to: [press Enter for Default]
→ [Enter] (use default group)
Enter the name of runner: [press Enter for backend-production-runner]
→ [Enter] (use specified name)
This runner will have the following labels: 'self-hosted', 'Linux', 'X64', 'backend', 'production', 'api'
→ Confirmed
Enter name of work folder: [press Enter for _work]
→ [Enter] (use default)
Step 1.4: Configure Frontend Runner (Separate Instance)
For the internal admin frontend, repeat the process on the frontend EC2:
# SSH into frontend EC2
ssh -i your-key.pem ec2-user@203.0.113.11
# Create separate runner directory
mkdir -p /home/ec2-user/actions-runner-frontend
cd /home/ec2-user/actions-runner-frontend
# Download and extract runner (same as above)
curl -o actions-runner-linux-x64-2.321.0.tar.gz -L \
https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz
tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz
# Configure with different labels
./config.sh \
--url https://github.com/YOUR_USERNAME/admin-frontend \
--token YOUR_GENERATED_TOKEN_FOR_FRONTEND \
--name frontend-production-runner \
--work _work \
--labels frontend,production,admin
Step 1.5: Install as System Service (Both Runners)
Critical for production: Configure the runners to start automatically on system boot and restart on failure.
Backend runner:
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh install ec2-user
sudo ./svc.sh start
sudo ./svc.sh status
Frontend runner:
cd /home/ec2-user/actions-runner-frontend
sudo ./svc.sh install ec2-user
sudo ./svc.sh start
sudo ./svc.sh status
Expected output:
● actions.runner.YOUR-ORG-backend-api.backend-production-runner.service
Loaded: loaded
Active: active (running) since Thu 2025-11-20 11:30:00 KST; 30s ago
Main PID: 123456 (runsvc.sh)
Status: "Running"
Key indicator: Active: active (running) confirms the runner is operational.
Step 1.6: Verify Runner Registration
Return to GitHub:
Repository → Settings → Actions → Runners
You should see:
Self-hosted runners (2)
┌─────────────────────────────────────────────────┐
│ ● backend-production-runner │
│ Idle │
│ Linux X64 │
│ Labels: self-hosted, Linux, X64, backend, │
│ production, api │
│ Last seen: less than a minute ago │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ ● frontend-production-runner │
│ Idle │
│ Linux X64 │
│ Labels: self-hosted, Linux, X64, frontend, │
│ production, admin │
│ Last seen: less than a minute ago │
└─────────────────────────────────────────────────┘
Status meanings:
- ● (green dot): Online and ready
- Idle: Waiting for jobs
- Last seen: less than a minute ago: Healthy connection
Phase 2: Update GitHub Actions Workflows
The workflow changes are remarkably simple—only two modifications needed.
Backend Workflow (API)
Before (SSH-based deployment):
# .github/workflows/ci-cd-backend.yml
name: Backend CI/CD
on:
push:
branches: [ main ]
jobs:
build-and-deploy:
runs-on: ubuntu-latest # ← GitHub-hosted runner
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '22'
- name: Install dependencies
run: npm install
- name: Build project
run: npm run build
# ❌ SSH connection required
- name: Deploy to EC2
uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.SSH_HOST }}
port: ${{ secrets.SSH_PORT }}
username: ${{ secrets.SSH_USER }}
key: ${{ secrets.SSH_KEY }}
script: |
/home/ec2-user/deploy-backend.sh
After (Self-hosted runner):
# .github/workflows/ci-cd-backend.yml
name: Backend CI/CD
on:
push:
branches: [ main ]
jobs:
build-and-deploy:
runs-on: [self-hosted, backend, production] # ✅ Use backend runner
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '22'
- name: Install dependencies
run: npm install
- name: Build project
run: npm run build
# ✅ Direct execution (no SSH)
- name: Deploy
run: /home/ec2-user/deploy-backend.sh
Frontend Workflow (Internal Admin)
# .github/workflows/ci-cd-frontend.yml
name: Admin Frontend CI/CD
on:
push:
branches: [ main ]
workflow_dispatch: # Allow manual triggers
jobs:
build-and-deploy:
runs-on: [self-hosted, frontend, production] # ✅ Use frontend runner
steps:
- name: Checkout repository
uses: actions/checkout@v3
- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '22'
- name: Clean and Install dependencies
run: |
rm -rf node_modules package-lock.json
npm install
- name: Run linter
run: npm run lint
continue-on-error: true
- name: Deploy to Production
run: |
echo "🚀 Starting Blue-Green Deployment..."
bash /home/ec2-user/deploy-frontend.sh
working-directory: ${{ github.workspace }}
What changed:
-
runs-on: ubuntu-latest→runs-on: [self-hosted, backend/frontend, production] - Removed entire
appleboy/ssh-action@v1step - Direct script execution:
/home/ec2-user/deploy-*.sh - Frontend: Added
workflow_dispatchfor manual deployments
What stayed the same:
- All other steps (checkout, Node.js setup, build)
- The deploy scripts themselves require zero modifications
- Blue-Green deployment logic unchanged
Phase 3: Frontend-Specific Configuration
Nginx Configuration for IP-Based Access
Critical: The admin frontend is accessible only via IP from office network.
# /etc/nginx/conf.d/admin-frontend.conf
# Blue-Green Upstream
upstream admin-frontend-server {
server 127.0.0.1:8011; # Blue (initially primary)
server 127.0.0.1:8012 backup; # Green (initially backup)
}
server {
listen 80;
server_name 203.0.113.11; # EC2 Private IP
# 🔒 IP Whitelist (allow/deny method)
allow 203.0.113.0/24; # Office network
allow 127.0.0.1; # Localhost
deny all; # Block everything else
# 🔥 Security Headers
add_header X-Frame-Options "DENY" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
add_header Referrer-Policy "no-referrer" always;
add_header X-Robots-Tag "noindex, nofollow, noarchive" always;
# 🔥 Cache Prevention
add_header Cache-Control "no-store, no-cache, must-revalidate, proxy-revalidate" always;
add_header Pragma "no-cache" always;
add_header Expires "0" always;
# 🔥 Forward Client IP
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_set_header Host $host;
location / {
proxy_pass http://admin-frontend-server;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_cache_bypass $http_upgrade;
proxy_buffering off;
proxy_read_timeout 300s;
proxy_connect_timeout 75s;
proxy_send_timeout 300s;
}
# Health check endpoint (localhost only)
location /api/health {
proxy_pass http://admin-frontend-server/api/health;
access_log off;
# Allow only from localhost (for Docker health checks)
satisfy any;
allow 127.0.0.1;
allow ::1;
deny all;
}
# Static assets caching
location /_next/static {
proxy_pass http://admin-frontend-server;
add_header Cache-Control "public, max-age=31536000, immutable";
}
location /_next/image {
proxy_pass http://admin-frontend-server;
add_header Cache-Control "public, max-age=86400";
}
# Block search engine indexing
location = /robots.txt {
proxy_pass http://admin-frontend-server/robots.txt;
add_header X-Robots-Tag "noindex, nofollow" always;
}
access_log /var/log/nginx/admin-frontend-access.log;
error_log /var/log/nginx/admin-frontend-error.log;
}
Key security features:
- IP Whitelist: Only office network can access
- No Domain: Direct IP access prevents DNS-based discovery
- Security Headers: Prevent XSS, clickjacking, MIME sniffing
- No Caching: Ensures latest version always loads
- Robots.txt: Blocks search engine indexing
Frontend Environment Configuration
# .env.production
# Backend API URL (public domain with SSL)
NEXT_PUBLIC_API_BASE_URL=https://api.example.com
# Frontend URL (EC2 IP, no SSL)
NEXT_PUBLIC_SITE_URL=http://203.0.113.11
# Cookie domain (empty for IP-based access)
NEXT_PUBLIC_COOKIE_DOMAIN=
# IP whitelist for additional client-side validation
NEXT_PUBLIC_ALLOWED_IPS=203.0.113.0/24,127.0.0.1
# Enable IP check
NEXT_PUBLIC_ENABLE_IP_CHECK=true
# Node environment
NODE_ENV=production
# Disable telemetry
NEXT_TELEMETRY_DISABLED=1
How frontend communicates with backend:
// Frontend makes API calls to public domain
const API_BASE = process.env.NEXT_PUBLIC_API_BASE_URL; // https://api.example.com
async function sendNotification(data) {
const response = await fetch(`${API_BASE}/api/notifications`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${token}`,
},
body: JSON.stringify(data),
});
return response.json();
}
Architecture flow:
Office Employee (203.0.113.50)
↓
(HTTP 80)
↓
Admin Frontend (203.0.113.11:80)
↓
(HTTPS 443)
↓
Backend API (api.example.com:443)
↓
(Process)
↓
Push Notification Service
Phase 4: Deploy Script Optimization
Backend Deploy Script
The backend deploy script remains largely the same as Part 1, with enhanced logging:
#!/bin/bash
# /home/ec2-user/deploy-backend.sh
set -euo pipefail
# Enhanced logging with timestamps
LOG_DIR="/var/log/backend-deploy"
LOG_FILE="$LOG_DIR/deploy-$(date +%Y%m%d-%H%M%S).log"
mkdir -p "$LOG_DIR"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
log "================================================"
log "🚀 Backend Deployment started by GitHub Actions"
log "Repository: $GITHUB_REPOSITORY"
log "Commit: $GITHUB_SHA"
log "Actor: $GITHUB_ACTOR"
log "================================================"
# Navigate to project directory
cd /home/ec2-user/backend-api
# Cleanup dangling images
echo "🔍 Checking for dangling images..."
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)
if [ "$DANGLING_COUNT" -gt 0 ]; then
echo "🧹 Cleaning $DANGLING_COUNT dangling images..."
docker image prune -f
fi
# [Rest of Blue-Green deployment logic from Part 1...]
# (Omitted for brevity - same as original deploy script)
log "✅ Backend deployment completed successfully"
Frontend Deploy Script
The frontend deploy script is similar but adapted for Next.js:
#!/bin/bash
# /home/ec2-user/deploy-frontend.sh
set -euo pipefail
echo "🚀 Starting Frontend Blue-Green Deployment..."
# Navigate to project directory
cd /home/ec2-user/admin-frontend
# Cleanup
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)
if [ "$DANGLING_COUNT" -gt 0 ]; then
echo "🧹 $DANGLING_COUNT dangling images found. Cleaning..."
docker image prune -f
fi
# Remove old stopped containers
for OLD_COLOR in admin-frontend-blue admin-frontend-green; do
CONTAINER_NAME="admin-frontend-${OLD_COLOR}-1"
if docker inspect "$CONTAINER_NAME" >/dev/null 2>&1; then
STATUS=$(docker inspect --format='{{.State.Status}}' "$CONTAINER_NAME")
if [ "$STATUS" = "exited" ]; then
echo "🗑 Removing stopped container: $CONTAINER_NAME"
docker compose -p admin-frontend rm -f "$OLD_COLOR"
fi
fi
done
# Determine Blue-Green target
CURRENT=$(docker compose -p admin-frontend ps -q admin-frontend-blue | wc -l)
if [ "$CURRENT" -gt 0 ]; then
NEW=admin-frontend-green
OLD=admin-frontend-blue
NEW_PORT=8012
OLD_PORT=8011
else
NEW=admin-frontend-blue
OLD=admin-frontend-green
NEW_PORT=8011
OLD_PORT=8012
fi
echo "🎯 Target: $NEW (port: $NEW_PORT)"
# Build and start new container
docker compose -p admin-frontend build $NEW
docker compose -p admin-frontend up -d $NEW || {
echo "🚨 Container failed to start"
docker logs admin-frontend-$NEW-1
exit 1
}
# Wait for health check
MAX_RETRIES=30
COUNT=0
while [ "$(docker inspect --format='{{.State.Health.Status}}' admin-frontend-$NEW-1)" != "healthy" ]; do
if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
echo "❌ Health check failed"
docker logs admin-frontend-$NEW-1
docker compose -p admin-frontend stop $NEW
exit 1
fi
echo "🟡 Health check waiting... ($COUNT/$MAX_RETRIES)"
sleep 5
COUNT=$((COUNT + 1))
done
echo "✅ $NEW container is healthy"
# Switch Nginx configuration
if [ "$NEW" == "admin-frontend-green" ]; then
sudo sed -i "s/^ *server 127.0.0.1:8011;/server 127.0.0.1:8012;/" /etc/nginx/conf.d/admin-frontend.conf
sudo sed -i "s/^ *server 127.0.0.1:8012 backup;/server 127.0.0.1:8011 backup;/" /etc/nginx/conf.d/admin-frontend.conf
else
sudo sed -i "s/^ *server 127.0.0.1:8012;/server 127.0.0.1:8011;/" /etc/nginx/conf.d/admin-frontend.conf
sudo sed -i "s/^ *server 127.0.0.1:8011 backup;/server 127.0.0.1:8012 backup;/" /etc/nginx/conf.d/admin-frontend.conf
fi
# Reload Nginx
if ! sudo nginx -t; then
echo "❌ Nginx config test failed"
exit 1
fi
sudo nginx -s reload
echo "✅ Nginx reloaded"
# Graceful shutdown of old container
sleep 30
docker compose -p admin-frontend stop $OLD || true
echo "🎉 Deployment complete! $NEW active on port $NEW_PORT"
Key differences from backend script:
- Different container names (
admin-frontend-*) - Different ports (8011/8012 vs 3011/3012)
- Different Nginx config file
- No git operations (runner workspace already has code)
Production Testing
First Deployment with Self-Hosted Runners
Backend Deployment
# Triggered by: git push origin main
# Monitoring: GitHub Actions UI + EC2 terminal
[GitHub Actions UI - Backend]
Run build-and-deploy
Runner: backend-production-runner (self-hosted)
✓ Checkout repository (2s)
✓ Set up Node.js (1s)
✓ Install dependencies (35s)
✓ Build project (48s)
→ Deploy (running...)
[EC2 Backend Terminal]
[2025-11-20 14:30:22] 🚀 Backend Deployment started by GitHub Actions
[2025-11-20 14:30:22] Repository: yourorg/backend-api
[2025-11-20 14:30:22] Commit: a7f3d21c8f9e4b2d1a5c8e6f3d9b7a4e2c8d5f1a
[2025-11-20 14:30:22] Actor: developer-name
[2025-11-20 14:30:23] 🔍 Checking for dangling images...
[2025-11-20 14:30:24] ✅ No dangling images found.
[2025-11-20 14:30:25] 🎯 Target: messaging-green (port: 3012)
[2025-11-20 14:33:30] ✅ messaging-green container is healthy
[2025-11-20 14:33:32] ✅ Nginx reloaded
[2025-11-20 14:34:05] ✅ Backend deployment completed successfully
Frontend Deployment
[GitHub Actions UI - Frontend]
Run build-and-deploy
Runner: frontend-production-runner (self-hosted)
✓ Checkout repository (2s)
✓ Set up Node.js (1s)
✓ Clean and Install dependencies (42s)
✓ Run linter (12s)
→ Deploy to Production (running...)
[EC2 Frontend Terminal]
🚀 Starting Frontend Blue-Green Deployment...
🔍 Checking for dangling images...
✅ No dangling images found.
🎯 Target: admin-frontend-green (port: 8012)
🔨 Building Docker image...
🚀 Starting container admin-frontend-green...
🟡 Health check waiting... (0/30)
🟡 Health check waiting... (5/30)
✅ admin-frontend-green container is healthy
🔄 Switching Nginx configuration...
✅ Nginx reloaded
⏳ Graceful shutdown of old container...
🎉 Deployment complete! admin-frontend-green active on port 8012
Result: Both deployments flawless. Zero downtime. No SSH connection required.
Performance Comparison
| Metric | SSH-based (Part 1) | Self-hosted Runner | Improvement |
|---|---|---|---|
| Backend deployment time | 4m 15s | 3m 43s | 32s faster (12%) |
| Frontend deployment time | N/A | 4m 12s | New capability |
| Network latency | ~200ms (SSH) | 0ms (local) | Eliminated |
| Secrets required | 4 (SSH keys) | 0 | Simplified |
| Security attack surface | SSH exposed | SSH restricted | ✅ Enhanced |
| Failure points | SSH timeout, network | Runner availability | ✅ Reduced |
Why faster?
- No SSH handshake: Eliminates 200-500ms connection overhead
- Local filesystem: Code already checked out by runner
- No SSH action overhead: Direct script execution
- Parallel capable: Both runners can deploy simultaneously
Security Improvements
Attack Surface Reduction
Before (SSH-based):
Attack Vectors:
1. SSH brute force (port 22 exposed to internet)
2. SSH key compromise in GitHub Secrets
3. Man-in-the-middle on SSH connection
4. GitHub Actions runner IP spoofing
Required Secrets:
- SSH_HOST (EC2 public IP)
- SSH_PORT (22)
- SSH_USER (ec2-user)
- SSH_KEY (Private key - 2048+ characters)
Security Concerns:
- Frontend accessible from internet
- No IP-based access control
- Domain required for all services
After (Self-hosted runner + IP restrictions):
Attack Vectors:
1. ✅ ELIMINATED: SSH only accessible from office network
2. ✅ ELIMINATED: No SSH keys in GitHub Secrets
3. ✅ ELIMINATED: No SSH connection to intercept
4. ✅ ELIMINATED: Frontend only accessible from office IP
5. Runner authentication via GitHub token (managed by GitHub)
Required Secrets:
- None (runner token managed locally on EC2)
Security Enhancements:
- Frontend: IP whitelist at AWS Security Group level
- Frontend: IP whitelist at Nginx level (defense in depth)
- Backend: Public API with rate limiting
- Both: SSH restricted to office network only
Defense in Depth: Frontend Access Control
Multiple layers of security:
Layer 1: AWS Security Group
↓ (Allow 203.0.113.0/24 only)
Layer 2: Nginx IP Whitelist
↓ (allow 203.0.113.0/24; deny all;)
Layer 3: Application Middleware
↓ (Optional: Additional IP validation)
Layer 4: Security Headers
↓ (X-Robots-Tag, X-Frame-Options, etc.)
Frontend Application
Next.js Middleware for additional protection:
// middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
const ALLOWED_IPS = process.env.NEXT_PUBLIC_ALLOWED_IPS?.split(',') || [];
export function middleware(request: NextRequest) {
const clientIP = request.headers.get('x-real-ip') ||
request.headers.get('x-forwarded-for')?.split(',')[0];
// IP check (if enabled)
const ipCheckEnabled = process.env.NEXT_PUBLIC_ENABLE_IP_CHECK === 'true';
if (ipCheckEnabled && clientIP) {
const isAllowed = ALLOWED_IPS.some(allowed => {
if (allowed.includes('/')) {
// CIDR notation check (simplified)
return clientIP.startsWith(allowed.split('/')[0].split('.').slice(0, 3).join('.'));
}
return clientIP === allowed;
});
if (!isAllowed) {
return new NextResponse('Access Denied', { status: 403 });
}
}
// Security headers
const response = NextResponse.next();
response.headers.set('X-Robots-Tag', 'noindex, nofollow, noarchive');
response.headers.set('X-Frame-Options', 'DENY');
response.headers.set('X-Content-Type-Options', 'nosniff');
response.headers.set('Referrer-Policy', 'no-referrer');
return response;
}
export const config = {
matcher: '/((?!_next/static|_next/image|favicon.ico).*)',
};
Audit Compliance
Our security audit requirements:
| Requirement | SSH-based | Self-hosted + IP | Status |
|---|---|---|---|
| SSH restricted to internal network | ❌ | ✅ | Pass |
| No credentials in external systems | ❌ | ✅ | Pass |
| Internal admin UI not public | ❌ | ✅ | Pass |
| IP-based access control | ❌ | ✅ | Pass |
| Deployment traceability | ⚠️ Partial | ✅ Full | Pass |
| Principle of least privilege | ❌ | ✅ | Pass |
Result: Security audit approved with zero exceptions.
Operational Benefits
1. Simplified Troubleshooting
Before: When deployment failed, debugging required:
# 1. Check GitHub Actions logs (truncated, hard to read)
# 2. SSH into EC2 to see full logs
# 3. Context switching between interfaces
After: All logs in one place (GitHub Actions UI shows full output)
# Runner executes locally, so all stdout/stderr captured
# Full deploy script logs visible in GitHub Actions UI
# No SSH required for debugging
2. No More "Connection Timed Out" Errors
Before: Intermittent SSH failures
Error: dial tcp 203.0.113.10:22: i/o timeout
Error: ssh: handshake failed: read tcp: i/o timeout
After: Zero network-related failures (100+ deployments without incident)
3. Faster Iteration During Development
Deployment testing workflow:
Before:
1. Make code change
2. git push
3. Wait for GitHub Actions
4. GitHub Actions SSH to EC2
5. Wait for deployment
6. Check results
Total: ~5 minutes per iteration
After:
1. Make code change
2. git push
3. Wait for GitHub Actions
4. Runner executes locally
5. Check results
Total: ~3.5 minutes per iteration
Impact: 30% faster feedback loop during deployment script development.
4. Parallel Deployments
New capability: Deploy backend and frontend simultaneously
# Both can run in parallel without conflicts
Backend Runner → /home/ec2-user/deploy-backend.sh
Frontend Runner → /home/ec2-user/deploy-frontend.sh
# Total deployment time: max(backend, frontend) instead of sum
5. Cost Savings
GitHub Actions minutes:
| Plan | SSH-based Usage | Self-hosted Usage | Savings |
|---|---|---|---|
| Free Tier | 2,000 min/month | Unlimited | N/A |
| Paid Plan | $0.008/min | $0/min | 100% |
Our usage: ~400 minutes/month for builds and deployments (both services).
Cost impact:
- SSH-based: Free tier sufficient, but approaching limit
- Self-hosted: Zero GitHub Actions minutes consumed for deployment step
- Long-term benefit: As team grows, no risk of hitting GitHub Actions limits
Lessons Learned
1. Self-Hosted Runners Are Not "Set and Forget"
Initial assumption: Install runner, it works forever.
Reality: Runners need maintenance:
# Runner updates (every 2-3 months)
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh stop
./config.sh remove --token REMOVAL_TOKEN
# Download new version
./config.sh --url ... --token NEW_TOKEN
sudo ./svc.sh install ec2-user
sudo ./svc.sh start
# Disk space monitoring (runners accumulate _work artifacts)
du -sh /home/ec2-user/actions-runner-*/_work
# Output: 4.2G backend, 3.8G frontend (after 3 months)
# Cleanup old artifacts (automated in cron)
find /home/ec2-user/actions-runner-*/_work -type f -mtime +7 -delete
Solution: Create maintenance schedule (monthly check-in).
2. Runner Restart After EC2 Reboot
Problem: EC2 maintenance reboot → runner offline.
Solution: Systemd service (we already did this in Step 1.5)
# Verify auto-start on boot
sudo systemctl is-enabled actions.runner.*.service
# Output: enabled ✅
Test:
sudo reboot
# Wait 2 minutes
ssh ec2-user@203.0.113.10
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh status
# Output: active (running) ✅
3. Separate Runners for Backend and Frontend
Why separate runners?
- Isolation: Backend deployment doesn't block frontend
- Resource management: Each runner has dedicated resources
- Label-based targeting: Workflows explicitly target correct runner
- Security: Frontend runner has no access to backend secrets
Label strategy:
# Backend workflow
runs-on: [self-hosted, backend, production]
# Frontend workflow
runs-on: [self-hosted, frontend, production]
4. IP-Only Access Considerations
Challenge: No SSL/TLS for IP-based access
Our approach:
- Internal admin UI: HTTP only (acceptable for office network)
- Backend API: HTTPS with domain (public access)
- Frontend → Backend: HTTPS (secure communication)
Alternative (if SSL required for frontend):
- Use self-signed certificate
- Add to browser trust store on office machines
- Update Nginx to use self-signed cert
server {
listen 443 ssl;
server_name 203.0.113.11;
ssl_certificate /etc/nginx/ssl/self-signed.crt;
ssl_certificate_key /etc/nginx/ssl/self-signed.key;
# ... rest of config
}
5. Frontend Docker Build Optimization
Challenge: Next.js builds are slow
Solution: Standalone output + multi-stage Dockerfile
# Dockerfile for Next.js
FROM node:22-alpine AS deps
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
FROM node:22-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build
FROM node:22-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
# Copy standalone build
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static
USER nextjs
EXPOSE 8000
CMD ["node", "server.js"]
next.config.ts:
const nextConfig = {
output: 'standalone', // ✅ Critical for Docker optimization
// ... other config
};
Result: Image size reduced from 1.2GB to 280MB.
Monitoring and Observability
Runner Health Check
Created monitoring script to ensure runners stay healthy:
#!/bin/bash
# /home/ec2-user/monitor-runners.sh
LOG_FILE="/var/log/github-runner-monitor.log"
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Check backend runner
if ! pgrep -f "Runner.Listener.*backend" > /dev/null; then
log "⚠️ Backend runner not running. Restarting..."
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh restart
fi
# Check frontend runner
if ! pgrep -f "Runner.Listener.*frontend" > /dev/null; then
log "⚠️ Frontend runner not running. Restarting..."
cd /home/ec2-user/actions-runner-frontend
sudo ./svc.sh restart
fi
# Check disk space
DISK_USAGE=$(df -h /home | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 85 ]; then
log "⚠️ Disk usage: ${DISK_USAGE}%. Cleaning up..."
docker system prune -f --volumes
find /home/ec2-user/actions-runner-*/_work -type f -mtime +7 -delete
log "✅ Cleanup completed"
fi
log "✅ All runners healthy"
Schedule with cron:
# Every 5 minutes
*/5 * * * * /home/ec2-user/monitor-runners.sh
Production Metrics: 3 Months Post-Migration
After 90 days of production use with self-hosted runners:
Backend (API)
| Metric | Before (SSH) | After (Self-hosted) | Change |
|---|---|---|---|
| Deployments | 156 | 178 | +14% |
| Average deploy time | 3m 47s | 3m 12s | -35s (15%) |
| Failed deployments | 5 (3.2%) | 2 (1.1%) | -65% |
| SSH timeout errors | 3 | 0 | -100% |
| Uptime | 99.97% | 99.98% | +0.01% |
Frontend (Admin)
| Metric | Result |
|---|---|
| Deployments | 142 |
| Average deploy time | 4m 12s |
| Failed deployments | 1 (0.7%) |
| Access violations | 0 (IP restrictions working) |
| Uptime | 99.99% |
Combined Benefits
Key improvements:
- Deployment reliability: 96.8% → 98.9%
- All failures now application-level (not infrastructure)
- Zero network-induced deployment failures
- Faster feedback loops encourage smaller, more frequent deployments
- Internal admin UI: 100% protection from unauthorized access
Cost-Benefit Analysis
Time Investment
| Phase | Time Required |
|---|---|
| Research and planning | 2 hours |
| Backend runner installation | 30 minutes |
| Frontend runner installation | 30 minutes |
| Workflow migration (both) | 40 minutes |
| Frontend security configuration | 1 hour |
| Testing and validation | 2 hours |
| Documentation | 1.5 hours |
| Total | ~8 hours |
Ongoing Maintenance
| Task | Frequency | Time |
|---|---|---|
| Runner version updates (both) | Monthly | 30 min |
| Health check monitoring | Daily (automated) | 2 min |
| Log cleanup | Weekly (automated) | 0 min |
| Total monthly | ~30-40 min |
Return on Investment
Benefits quantified:
- Eliminated SSH timeout issues: ~4 hours/month previously spent debugging
- Faster deployments: 35s × 320 deployments = ~187 minutes saved (3.1 hours)
- Reduced context switching: ~45 minutes/month (no SSH for log viewing)
- GitHub Actions minutes saved: 320 minutes/month ($2.56 at paid rates)
- Security compliance: Eliminated audit exceptions (immeasurable value)
Total time saved per month: ~7.5 hours
Payback period: 1 month
Migration Checklist for Others
Pre-Migration
- [ ] Verify EC2 instances have sufficient resources
- [ ] Ensure outbound HTTPS (port 443) is allowed in security groups
- [ ] Document current SSH-based workflow
- [ ] Plan IP restrictions for internal services
- [ ] Create rollback plan
- [ ] Test runner installation on staging environment
Migration Steps - Backend
- [ ] Install GitHub Actions runner on backend EC2
- [ ] Configure runner as systemd service
- [ ] Verify runner appears online in GitHub
- [ ] Update workflow file (
runs-on: [self-hosted, backend]) - [ ] Remove SSH action step from workflow
- [ ] Test deployment on non-production branch
Migration Steps - Frontend (if applicable)
- [ ] Install separate runner on frontend EC2 (or same instance with different labels)
- [ ] Configure AWS Security Group (restrict port 80 to office network)
- [ ] Configure Nginx IP whitelist
- [ ] Set up robots.txt to block indexing
- [ ] Add security headers to Nginx
- [ ] Configure environment variables for IP-only access
- [ ] Update workflow file for frontend
- [ ] Test deployment and verify IP restrictions
Post-Migration
- [ ] Remove SSH-related secrets from GitHub
- [ ] Update security group rules (restrict SSH to office network)
- [ ] Set up runner monitoring (health checks, disk space)
- [ ] Create runner maintenance schedule
- [ ] Document new deployment process
- [ ] Update incident response procedures
Rollback Plan
If issues occur, revert quickly:
# 1. Stop self-hosted runners
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh stop
cd /home/ec2-user/actions-runner-frontend
sudo ./svc.sh stop
# 2. Revert workflow files
git revert <commit-hash>
# 3. Restore security group rules (allow GitHub IPs for SSH)
# 4. Verify old workflow works
When to Use Self-Hosted Runners
✅ Good Use Cases
- Security requirements: Need to restrict SSH or other access
- Private networks: Need to access internal services (databases, APIs)
- Cost optimization: Heavy CI/CD usage (> 2,000 min/month)
- Performance: Deployment speed matters (low latency)
- Compliance: Data must stay within specific infrastructure
- Internal tools: Admin dashboards, internal APIs not meant for public
- IP-based access control: Services restricted to office network
❌ Not Recommended When
- Multiple repositories: Managing runners per repo is tedious (use runner groups)
- Varying workloads: Idle runners waste resources (GitHub-hosted scales to zero)
- Small team, low frequency: Overhead not worth it (< 10 deployments/month)
- No ops expertise: Runner maintenance requires systems knowledge
- Highly dynamic IPs: If EC2 IP changes frequently (use ALB or Elastic IP)
Conclusion
Migrating from SSH-based GitHub Actions to self-hosted runners eliminated our deployment's weakest link: the network dependency and security exposure. By running the CI/CD pipeline directly on our EC2 instances and implementing IP-based access control for our internal admin frontend, we achieved:
Security:
- ✅ SSH restricted to office network only (
203.0.113.0/24) - ✅ Zero credentials stored in GitHub Secrets
- ✅ Reduced attack surface by 75% (no SSH exposure)
- ✅ Internal admin UI protected by IP whitelist (AWS + Nginx)
- ✅ Defense in depth: multiple layers of access control
- ✅ Passed security audit with zero exceptions
Performance:
- ✅ 15% faster deployments (3m 47s → 3m 12s)
- ✅ 65% fewer failed deployments (3.2% → 1.1%)
- ✅ 100% elimination of network-induced failures
- ✅ 28% reduction in GitHub Actions minutes consumed
- ✅ Parallel deployment capability for multiple services
Operational:
- ✅ Simplified troubleshooting (unified logs)
- ✅ Easier debugging (local execution)
- ✅ No SSH key rotation headaches
- ✅ Faster development iteration (30% faster feedback)
- ✅ Support for both public and internal services
Architecture:
- ✅ Backend: Public API with domain and SSL
- ✅ Frontend: IP-only access for internal tools
- ✅ Clean separation of concerns
- ✅ Flexible deployment strategy
Cost:
- Initial investment: ~8 hours
- Ongoing maintenance: ~30-40 minutes/month
- ROI: Pays for itself in 1 month
The migration was straightforward, required minimal changes to our existing Blue-Green deployment architecture, and provided immediate benefits. Our 99.98% backend uptime and 99.99% frontend uptime prove that zero-downtime deployments don't require complex orchestration tools—just thoughtful architecture and the right tool for the job.
Key takeaway: Don't let network dependencies be your deployment's Achilles' heel, and don't expose internal tools to the public internet. If you control the infrastructure, running CI/CD locally with proper network restrictions is simpler, faster, and vastly more secure than remote execution over SSH.
Top comments (0)