Sangwoo Lee

Posted on Dec 22, 2025

Eliminating SSH Dependencies: Migrating to Self-Hosted GitHub Actions Runners for Secure Blue-Green Deployments

#docker #githubactions #cicd #devops

Evolving from SSH-based GitHub Actions to self-hosted runners for enhanced security, performance, and handling both backend API and internal admin frontend deployments

The Problem: Security vs Automation Trade-off

In my previous post Building a Zero-Downtime CI/CD Pipeline: Blue-Green Deployments for 100K+ Daily Requests, I implemented a robust CI/CD pipeline using GitHub Actions with SSH deployment to achieve zero-downtime Blue-Green deployments. The system worked flawlessly for months, handling 100,000+ daily push notifications with 99.97% uptime.

But then we hit a critical security requirement.

Our security audit revealed a significant vulnerability: SSH access was open to the entire internet (0.0.0.0/0). While this made GitHub Actions deployment seamless, it violated security best practices. The mandate was clear: restrict SSH access to our office network only (203.0.113.0/24).

The Immediate Consequence

The morning after implementing the security group restriction:

Run appleboy/ssh-action@v1
======= CLI Version Information =======
Drone SSH version 1.8.1
=======================================
2025/11/20 02:23:29 dial tcp ***:***: i/o timeout
Error: Process completed with exit code 1.

Deployment pipeline: completely broken.

GitHub Actions runners operate from dynamic IP addresses across multiple data centers worldwide. Whitelisting GitHub's IP ranges meant maintaining and updating a list of 100+ CIDR blocks that change periodically—an operational nightmare.

The Dilemma

I faced three options:

Option	Pros	Cons
Keep SSH open	Simple, works	❌ Security risk
Whitelist GitHub IPs	Somewhat secure	❌ Maintenance nightmare, brittle
Self-hosted runner	✅ Secure, no SSH needed	Learning curve

The decision was clear: eliminate SSH dependency entirely by running GitHub Actions directly on the EC2 instance.

Solution Architecture: Self-Hosted Runner

Conceptual Shift

Before (SSH-based):

┌─────────────┐    SSH over    ┌──────────────┐    Execute    ┌─────────────┐
│   GitHub    │   Internet     │      EC2     │    Script     │   Docker    │
│   Actions   ├───────────────▶│   (Target)   ├──────────────▶│  Containers │
│   Runner    │   (Security    │              │               │             │
│  (Cloud)    │    Risk!)      │              │               │             │
└─────────────┘                └──────────────┘               └─────────────┘

After (Self-hosted runner):

┌─────────────┐   GitHub API   ┌──────────────┐   Local Exec  ┌─────────────┐
│   GitHub    │   (HTTPS)      │      EC2     │               │   Docker    │
│  Repository │◀──────────────▶│   Runner     ├──────────────▶│  Containers │
│             │   Pull Jobs    │  (Installed) │   Direct      │             │
└─────────────┘                └──────────────┘               └─────────────┘

Key Benefits

No SSH required: Runner pulls jobs via HTTPS (port 443)
Office network restriction maintained: SSH limited to 203.0.113.0/24
Zero network latency: Local execution (no SSH overhead)
Simplified secrets management: No SSH keys in GitHub Secrets
Improved security posture: Reduced attack surface

Dual Deployment Architecture: Backend + Internal Frontend

Our infrastructure now handles two types of deployments:

1. Backend API (Public Domain)

Domain: api.example.com (HTTPS with SSL/TLS)
Purpose: Push notification REST API serving mobile apps
Access: Public internet (with rate limiting)
Technology: NestJS backend with Redis queue

2. Internal Admin Frontend (IP-Only Access)

Access: Direct IP 203.0.113.10 (HTTP only)
Purpose: Internal admin dashboard for managing push notifications
Restriction: Office network only (203.0.113.0/24)
Technology: Next.js frontend
Backend Communication: Connects to api.example.com for data

Architecture Diagram

                    Internet
                       │
                       ▼
              ┌────────────────┐
              │  AWS Security  │
              │     Group      │
              └────────┬───────┘
                       │
         ┌─────────────┴─────────────┐
         │                           │
         ▼ (HTTPS 443)               ▼ (HTTP 80 - Office IP Only)
┌─────────────────┐         ┌──────────────────────┐
│  Public API     │         │  Internal Admin UI   │
│  api.example.com│         │  203.0.113.10:80     │
│                 │         │                      │
│  Nginx:443      │         │  Nginx:80            │
│    ↓            │         │    ↓                 │
│  Backend:3011/  │◀────────┤  Frontend:8011/8012  │
│         3012    │         │  (Blue-Green)        │
│ (Blue-Green)    │         │                      │
└─────────────────┘         └──────────────────────┘
         ▲                           ▲
         │                           │
    Self-Hosted              Self-Hosted
    Runner (backend)         Runner (frontend)
         │                           │
         └───────────┬───────────────┘
                     │
              GitHub Actions
              (Workflow Dispatch)

AWS Security Group Configuration

Critical security configuration:

# Backend EC2 Security Group (Public API)
Inbound Rules:
┌──────────┬─────────┬──────────────────┬─────────────────────┐
│ Protocol │ Port    │ Source           │ Description         │
├──────────┼─────────┼──────────────────┼─────────────────────┤
│ TCP      │ 22      │ 203.0.113.0/24   │ Office SSH Only     │
│ TCP      │ 443     │ 0.0.0.0/0        │ HTTPS (Public API)  │
│ TCP      │ 80      │ 0.0.0.0/0        │ HTTP (Redirect)     │
└──────────┴─────────┴──────────────────┴─────────────────────┘

# Frontend EC2 Security Group (Internal Admin)
Inbound Rules:
┌──────────┬─────────┬──────────────────┬─────────────────────┐
│ Protocol │ Port    │ Source           │ Description         │
├──────────┼─────────┼──────────────────┼─────────────────────┤
│ TCP      │ 22      │ 203.0.113.0/24   │ Office SSH Only     │
│ TCP      │ 80      │ 203.0.113.0/24   │ Internal Admin UI   │
│ TCP      │ 443     │ 0.0.0.0/0        │ Backend API Access  │
└──────────┴─────────┴──────────────────┴─────────────────────┘

Outbound Rules (Both):
┌──────────┬─────────┬──────────────────┬─────────────────────┐
│ Protocol │ Port    │ Destination      │ Description         │
├──────────┼─────────┼──────────────────┼─────────────────────┤
│ All      │ All     │ 0.0.0.0/0        │ Allow all outbound  │
└──────────┴─────────┴──────────────────┴─────────────────────┘

Key design decisions:

Backend (Public API):
- Port 443 open to internet (public REST API)
- SSL/TLS termination at Nginx
- Domain-based access (api.example.com)
Frontend (Internal Admin):
- Port 80 restricted to office network only
- No SSL required (internal network)
- IP-based access (203.0.113.10)
- Frontend fetches data from backend via public API
SSH Security:
- Both instances: SSH limited to office network
- No public SSH access
- Self-hosted runners eliminate need for GitHub Actions SSH

Implementation: Step-by-Step

Phase 1: Installing GitHub Actions Runner on EC2

Prerequisites:

EC2 instance already running
SSH access from office network (203.0.113.0/24)
Sudo privileges

Step 1.1: Generate Runner Token

Navigate to your GitHub repository:

Repository → Settings → Actions → Runners → New self-hosted runner

Select Linux and x64 architecture. GitHub will display setup commands and generate a registration token (valid for 1 hour).

Important: Copy the token—you'll need it in the next step.

Step 1.2: Download and Configure Runner (Backend)

SSH into your backend EC2 instance from your office network:

# Connect from office network (203.0.113.0/24)
ssh -i your-key.pem ec2-user@203.0.113.10

# Create runner directory
mkdir -p /home/ec2-user/actions-runner-backend
cd /home/ec2-user/actions-runner-backend

# Download latest runner (check GitHub for current version)
curl -o actions-runner-linux-x64-2.321.0.tar.gz -L \
  https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz

# Validate hash (security best practice)
echo "29fc8cf2dab4c195bb147384e7e2c94cfd4d4022c793b346a6175435265aa278  actions-runner-linux-x64-2.321.0.tar.gz" | shasum -a 256 -c

# Extract
tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz

Step 1.3: Configure Backend Runner

# Run configuration script
./config.sh \
  --url https://github.com/YOUR_USERNAME/backend-api \
  --token YOUR_GENERATED_TOKEN \
  --name backend-production-runner \
  --work _work \
  --labels backend,production,api

Configuration prompts and responses:

Enter the name of the runner group to add this runner to: [press Enter for Default]
→ [Enter] (use default group)

Enter the name of runner: [press Enter for backend-production-runner]
→ [Enter] (use specified name)

This runner will have the following labels: 'self-hosted', 'Linux', 'X64', 'backend', 'production', 'api'
→ Confirmed

Enter name of work folder: [press Enter for _work]
→ [Enter] (use default)

Step 1.4: Configure Frontend Runner (Separate Instance)

For the internal admin frontend, repeat the process on the frontend EC2:

# SSH into frontend EC2
ssh -i your-key.pem ec2-user@203.0.113.11

# Create separate runner directory
mkdir -p /home/ec2-user/actions-runner-frontend
cd /home/ec2-user/actions-runner-frontend

# Download and extract runner (same as above)
curl -o actions-runner-linux-x64-2.321.0.tar.gz -L \
  https://github.com/actions/runner/releases/download/v2.321.0/actions-runner-linux-x64-2.321.0.tar.gz

tar xzf ./actions-runner-linux-x64-2.321.0.tar.gz

# Configure with different labels
./config.sh \
  --url https://github.com/YOUR_USERNAME/admin-frontend \
  --token YOUR_GENERATED_TOKEN_FOR_FRONTEND \
  --name frontend-production-runner \
  --work _work \
  --labels frontend,production,admin

Step 1.5: Install as System Service (Both Runners)

Critical for production: Configure the runners to start automatically on system boot and restart on failure.

Backend runner:

cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh install ec2-user
sudo ./svc.sh start
sudo ./svc.sh status

Frontend runner:

cd /home/ec2-user/actions-runner-frontend
sudo ./svc.sh install ec2-user
sudo ./svc.sh start
sudo ./svc.sh status

Expected output:

● actions.runner.YOUR-ORG-backend-api.backend-production-runner.service
     Loaded: loaded
     Active: active (running) since Thu 2025-11-20 11:30:00 KST; 30s ago
   Main PID: 123456 (runsvc.sh)
     Status: "Running"

Key indicator: Active: active (running) confirms the runner is operational.

Step 1.6: Verify Runner Registration

Return to GitHub:

Repository → Settings → Actions → Runners

You should see:

Self-hosted runners (2)
┌─────────────────────────────────────────────────┐
│ ● backend-production-runner                     │
│   Idle                                          │
│   Linux X64                                     │
│   Labels: self-hosted, Linux, X64, backend,     │
│           production, api                       │
│   Last seen: less than a minute ago            │
└─────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────┐
│ ● frontend-production-runner                    │
│   Idle                                          │
│   Linux X64                                     │
│   Labels: self-hosted, Linux, X64, frontend,    │
│           production, admin                     │
│   Last seen: less than a minute ago            │
└─────────────────────────────────────────────────┘

Status meanings:

● (green dot): Online and ready
Idle: Waiting for jobs
Last seen: less than a minute ago: Healthy connection

Phase 2: Update GitHub Actions Workflows

The workflow changes are remarkably simple—only two modifications needed.

Backend Workflow (API)

Before (SSH-based deployment):

# .github/workflows/ci-cd-backend.yml
name: Backend CI/CD

on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest  # ← GitHub-hosted runner

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '22'

      - name: Install dependencies
        run: npm install

      - name: Build project
        run: npm run build

      # ❌ SSH connection required
      - name: Deploy to EC2
        uses: appleboy/ssh-action@v1
        with:
          host: ${{ secrets.SSH_HOST }}
          port: ${{ secrets.SSH_PORT }}
          username: ${{ secrets.SSH_USER }}
          key: ${{ secrets.SSH_KEY }}
          script: |
            /home/ec2-user/deploy-backend.sh

After (Self-hosted runner):

# .github/workflows/ci-cd-backend.yml
name: Backend CI/CD

on:
  push:
    branches: [ main ]

jobs:
  build-and-deploy:
    runs-on: [self-hosted, backend, production]  # ✅ Use backend runner

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '22'

      - name: Install dependencies
        run: npm install

      - name: Build project
        run: npm run build

      # ✅ Direct execution (no SSH)
      - name: Deploy
        run: /home/ec2-user/deploy-backend.sh

Frontend Workflow (Internal Admin)

# .github/workflows/ci-cd-frontend.yml
name: Admin Frontend CI/CD

on:
  push:
    branches: [ main ]
  workflow_dispatch:  # Allow manual triggers

jobs:
  build-and-deploy:
    runs-on: [self-hosted, frontend, production]  # ✅ Use frontend runner

    steps:
      - name: Checkout repository
        uses: actions/checkout@v3

      - name: Set up Node.js
        uses: actions/setup-node@v3
        with:
          node-version: '22'

      - name: Clean and Install dependencies
        run: |
          rm -rf node_modules package-lock.json
          npm install

      - name: Run linter
        run: npm run lint
        continue-on-error: true

      - name: Deploy to Production
        run: |
          echo "🚀 Starting Blue-Green Deployment..."
          bash /home/ec2-user/deploy-frontend.sh
        working-directory: ${{ github.workspace }}

What changed:

runs-on: ubuntu-latest → runs-on: [self-hosted, backend/frontend, production]
Removed entire appleboy/ssh-action@v1 step
Direct script execution: /home/ec2-user/deploy-*.sh
Frontend: Added workflow_dispatch for manual deployments

What stayed the same:

All other steps (checkout, Node.js setup, build)
The deploy scripts themselves require zero modifications
Blue-Green deployment logic unchanged

Phase 3: Frontend-Specific Configuration

Nginx Configuration for IP-Based Access

Critical: The admin frontend is accessible only via IP from office network.

# /etc/nginx/conf.d/admin-frontend.conf

# Blue-Green Upstream
upstream admin-frontend-server {
    server 127.0.0.1:8011;         # Blue (initially primary)
    server 127.0.0.1:8012 backup;  # Green (initially backup)
}

server {
    listen 80;
    server_name 203.0.113.11;  # EC2 Private IP

    # 🔒 IP Whitelist (allow/deny method)
    allow 203.0.113.0/24;  # Office network
    allow 127.0.0.1;       # Localhost
    deny all;              # Block everything else

    # 🔥 Security Headers
    add_header X-Frame-Options "DENY" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
    add_header Referrer-Policy "no-referrer" always;
    add_header X-Robots-Tag "noindex, nofollow, noarchive" always;

    # 🔥 Cache Prevention
    add_header Cache-Control "no-store, no-cache, must-revalidate, proxy-revalidate" always;
    add_header Pragma "no-cache" always;
    add_header Expires "0" always;

    # 🔥 Forward Client IP
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Host $host;

    location / {
        proxy_pass http://admin-frontend-server;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_cache_bypass $http_upgrade;

        proxy_buffering off;
        proxy_read_timeout 300s;
        proxy_connect_timeout 75s;
        proxy_send_timeout 300s;
    }

    # Health check endpoint (localhost only)
    location /api/health {
        proxy_pass http://admin-frontend-server/api/health;
        access_log off;

        # Allow only from localhost (for Docker health checks)
        satisfy any;
        allow 127.0.0.1;
        allow ::1;
        deny all;
    }

    # Static assets caching
    location /_next/static {
        proxy_pass http://admin-frontend-server;
        add_header Cache-Control "public, max-age=31536000, immutable";
    }

    location /_next/image {
        proxy_pass http://admin-frontend-server;
        add_header Cache-Control "public, max-age=86400";
    }

    # Block search engine indexing
    location = /robots.txt {
        proxy_pass http://admin-frontend-server/robots.txt;
        add_header X-Robots-Tag "noindex, nofollow" always;
    }

    access_log /var/log/nginx/admin-frontend-access.log;
    error_log /var/log/nginx/admin-frontend-error.log;
}

Key security features:

IP Whitelist: Only office network can access
No Domain: Direct IP access prevents DNS-based discovery
Security Headers: Prevent XSS, clickjacking, MIME sniffing
No Caching: Ensures latest version always loads
Robots.txt: Blocks search engine indexing

Frontend Environment Configuration

# .env.production
# Backend API URL (public domain with SSL)
NEXT_PUBLIC_API_BASE_URL=https://api.example.com

# Frontend URL (EC2 IP, no SSL)
NEXT_PUBLIC_SITE_URL=http://203.0.113.11

# Cookie domain (empty for IP-based access)
NEXT_PUBLIC_COOKIE_DOMAIN=

# IP whitelist for additional client-side validation
NEXT_PUBLIC_ALLOWED_IPS=203.0.113.0/24,127.0.0.1

# Enable IP check
NEXT_PUBLIC_ENABLE_IP_CHECK=true

# Node environment
NODE_ENV=production

# Disable telemetry
NEXT_TELEMETRY_DISABLED=1

How frontend communicates with backend:

// Frontend makes API calls to public domain
const API_BASE = process.env.NEXT_PUBLIC_API_BASE_URL; // https://api.example.com

async function sendNotification(data) {
  const response = await fetch(`${API_BASE}/api/notifications`, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${token}`,
    },
    body: JSON.stringify(data),
  });

  return response.json();
}

Architecture flow:

Office Employee (203.0.113.50)
       ↓
   (HTTP 80)
       ↓
Admin Frontend (203.0.113.11:80)
       ↓
   (HTTPS 443)
       ↓
Backend API (api.example.com:443)
       ↓
   (Process)
       ↓
Push Notification Service

Phase 4: Deploy Script Optimization

Backend Deploy Script

The backend deploy script remains largely the same as Part 1, with enhanced logging:

#!/bin/bash
# /home/ec2-user/deploy-backend.sh
set -euo pipefail

# Enhanced logging with timestamps
LOG_DIR="/var/log/backend-deploy"
LOG_FILE="$LOG_DIR/deploy-$(date +%Y%m%d-%H%M%S).log"
mkdir -p "$LOG_DIR"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

log "================================================"
log "🚀 Backend Deployment started by GitHub Actions"
log "Repository: $GITHUB_REPOSITORY"
log "Commit: $GITHUB_SHA"
log "Actor: $GITHUB_ACTOR"
log "================================================"

# Navigate to project directory
cd /home/ec2-user/backend-api

# Cleanup dangling images
echo "🔍 Checking for dangling images..."
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)
if [ "$DANGLING_COUNT" -gt 0 ]; then
    echo "🧹 Cleaning $DANGLING_COUNT dangling images..."
    docker image prune -f
fi

# [Rest of Blue-Green deployment logic from Part 1...]
# (Omitted for brevity - same as original deploy script)

log "✅ Backend deployment completed successfully"

Frontend Deploy Script

The frontend deploy script is similar but adapted for Next.js:

#!/bin/bash
# /home/ec2-user/deploy-frontend.sh
set -euo pipefail

echo "🚀 Starting Frontend Blue-Green Deployment..."

# Navigate to project directory
cd /home/ec2-user/admin-frontend

# Cleanup
DANGLING_COUNT=$(docker images -f "dangling=true" -q | wc -l)
if [ "$DANGLING_COUNT" -gt 0 ]; then
    echo "🧹 $DANGLING_COUNT dangling images found. Cleaning..."
    docker image prune -f
fi

# Remove old stopped containers
for OLD_COLOR in admin-frontend-blue admin-frontend-green; do
  CONTAINER_NAME="admin-frontend-${OLD_COLOR}-1"
  if docker inspect "$CONTAINER_NAME" >/dev/null 2>&1; then
    STATUS=$(docker inspect --format='{{.State.Status}}' "$CONTAINER_NAME")
    if [ "$STATUS" = "exited" ]; then
      echo "🗑  Removing stopped container: $CONTAINER_NAME"
      docker compose -p admin-frontend rm -f "$OLD_COLOR"
    fi
  fi
done

# Determine Blue-Green target
CURRENT=$(docker compose -p admin-frontend ps -q admin-frontend-blue | wc -l)

if [ "$CURRENT" -gt 0 ]; then
    NEW=admin-frontend-green
    OLD=admin-frontend-blue
    NEW_PORT=8012
    OLD_PORT=8011
else
    NEW=admin-frontend-blue
    OLD=admin-frontend-green
    NEW_PORT=8011
    OLD_PORT=8012
fi

echo "🎯 Target: $NEW (port: $NEW_PORT)"

# Build and start new container
docker compose -p admin-frontend build $NEW
docker compose -p admin-frontend up -d $NEW || {
    echo "🚨 Container failed to start"
    docker logs admin-frontend-$NEW-1
    exit 1
}

# Wait for health check
MAX_RETRIES=30
COUNT=0
while [ "$(docker inspect --format='{{.State.Health.Status}}' admin-frontend-$NEW-1)" != "healthy" ]; do
    if [ "$COUNT" -ge "$MAX_RETRIES" ]; then
        echo "❌ Health check failed"
        docker logs admin-frontend-$NEW-1
        docker compose -p admin-frontend stop $NEW
        exit 1
    fi
    echo "🟡 Health check waiting... ($COUNT/$MAX_RETRIES)"
    sleep 5
    COUNT=$((COUNT + 1))
done

echo "✅ $NEW container is healthy"

# Switch Nginx configuration
if [ "$NEW" == "admin-frontend-green" ]; then
    sudo sed -i "s/^ *server 127.0.0.1:8011;/server 127.0.0.1:8012;/" /etc/nginx/conf.d/admin-frontend.conf
    sudo sed -i "s/^ *server 127.0.0.1:8012 backup;/server 127.0.0.1:8011 backup;/" /etc/nginx/conf.d/admin-frontend.conf
else
    sudo sed -i "s/^ *server 127.0.0.1:8012;/server 127.0.0.1:8011;/" /etc/nginx/conf.d/admin-frontend.conf
    sudo sed -i "s/^ *server 127.0.0.1:8011 backup;/server 127.0.0.1:8012 backup;/" /etc/nginx/conf.d/admin-frontend.conf
fi

# Reload Nginx
if ! sudo nginx -t; then
  echo "❌ Nginx config test failed"
  exit 1
fi
sudo nginx -s reload
echo "✅ Nginx reloaded"

# Graceful shutdown of old container
sleep 30
docker compose -p admin-frontend stop $OLD || true

echo "🎉 Deployment complete! $NEW active on port $NEW_PORT"

Key differences from backend script:

Different container names (admin-frontend-*)
Different ports (8011/8012 vs 3011/3012)
Different Nginx config file
No git operations (runner workspace already has code)

Production Testing

First Deployment with Self-Hosted Runners

Backend Deployment

# Triggered by: git push origin main
# Monitoring: GitHub Actions UI + EC2 terminal

[GitHub Actions UI - Backend]
Run build-and-deploy
Runner: backend-production-runner (self-hosted)
✓ Checkout repository (2s)
✓ Set up Node.js (1s)
✓ Install dependencies (35s)
✓ Build project (48s)
→ Deploy (running...)

[EC2 Backend Terminal]
[2025-11-20 14:30:22] 🚀 Backend Deployment started by GitHub Actions
[2025-11-20 14:30:22] Repository: yourorg/backend-api
[2025-11-20 14:30:22] Commit: a7f3d21c8f9e4b2d1a5c8e6f3d9b7a4e2c8d5f1a
[2025-11-20 14:30:22] Actor: developer-name
[2025-11-20 14:30:23] 🔍 Checking for dangling images...
[2025-11-20 14:30:24] ✅ No dangling images found.
[2025-11-20 14:30:25] 🎯 Target: messaging-green (port: 3012)
[2025-11-20 14:33:30] ✅ messaging-green container is healthy
[2025-11-20 14:33:32] ✅ Nginx reloaded
[2025-11-20 14:34:05] ✅ Backend deployment completed successfully

Frontend Deployment

[GitHub Actions UI - Frontend]
Run build-and-deploy
Runner: frontend-production-runner (self-hosted)
✓ Checkout repository (2s)
✓ Set up Node.js (1s)
✓ Clean and Install dependencies (42s)
✓ Run linter (12s)
→ Deploy to Production (running...)

[EC2 Frontend Terminal]
🚀 Starting Frontend Blue-Green Deployment...
🔍 Checking for dangling images...
✅ No dangling images found.
🎯 Target: admin-frontend-green (port: 8012)
🔨 Building Docker image...
🚀 Starting container admin-frontend-green...
🟡 Health check waiting... (0/30)
🟡 Health check waiting... (5/30)
✅ admin-frontend-green container is healthy
🔄 Switching Nginx configuration...
✅ Nginx reloaded
⏳ Graceful shutdown of old container...
🎉 Deployment complete! admin-frontend-green active on port 8012

Result: Both deployments flawless. Zero downtime. No SSH connection required.

Performance Comparison

Metric	SSH-based (Part 1)	Self-hosted Runner	Improvement
Backend deployment time	4m 15s	3m 43s	32s faster (12%)
Frontend deployment time	N/A	4m 12s	New capability
Network latency	~200ms (SSH)	0ms (local)	Eliminated
Secrets required	4 (SSH keys)	0	Simplified
Security attack surface	SSH exposed	SSH restricted	✅ Enhanced
Failure points	SSH timeout, network	Runner availability	✅ Reduced

Why faster?

No SSH handshake: Eliminates 200-500ms connection overhead
Local filesystem: Code already checked out by runner
No SSH action overhead: Direct script execution
Parallel capable: Both runners can deploy simultaneously

Security Improvements

Attack Surface Reduction

Before (SSH-based):

Attack Vectors:
1. SSH brute force (port 22 exposed to internet)
2. SSH key compromise in GitHub Secrets
3. Man-in-the-middle on SSH connection
4. GitHub Actions runner IP spoofing

Required Secrets:
- SSH_HOST (EC2 public IP)
- SSH_PORT (22)
- SSH_USER (ec2-user)
- SSH_KEY (Private key - 2048+ characters)

Security Concerns:
- Frontend accessible from internet
- No IP-based access control
- Domain required for all services

After (Self-hosted runner + IP restrictions):

Attack Vectors:
1. ✅ ELIMINATED: SSH only accessible from office network
2. ✅ ELIMINATED: No SSH keys in GitHub Secrets
3. ✅ ELIMINATED: No SSH connection to intercept
4. ✅ ELIMINATED: Frontend only accessible from office IP
5. Runner authentication via GitHub token (managed by GitHub)

Required Secrets:
- None (runner token managed locally on EC2)

Security Enhancements:
- Frontend: IP whitelist at AWS Security Group level
- Frontend: IP whitelist at Nginx level (defense in depth)
- Backend: Public API with rate limiting
- Both: SSH restricted to office network only

Defense in Depth: Frontend Access Control

Multiple layers of security:

Layer 1: AWS Security Group
  ↓ (Allow 203.0.113.0/24 only)
Layer 2: Nginx IP Whitelist
  ↓ (allow 203.0.113.0/24; deny all;)
Layer 3: Application Middleware
  ↓ (Optional: Additional IP validation)
Layer 4: Security Headers
  ↓ (X-Robots-Tag, X-Frame-Options, etc.)

Frontend Application

Next.js Middleware for additional protection:

// middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';

const ALLOWED_IPS = process.env.NEXT_PUBLIC_ALLOWED_IPS?.split(',') || [];

export function middleware(request: NextRequest) {
  const clientIP = request.headers.get('x-real-ip') || 
                   request.headers.get('x-forwarded-for')?.split(',')[0];

  // IP check (if enabled)
  const ipCheckEnabled = process.env.NEXT_PUBLIC_ENABLE_IP_CHECK === 'true';

  if (ipCheckEnabled && clientIP) {
    const isAllowed = ALLOWED_IPS.some(allowed => {
      if (allowed.includes('/')) {
        // CIDR notation check (simplified)
        return clientIP.startsWith(allowed.split('/')[0].split('.').slice(0, 3).join('.'));
      }
      return clientIP === allowed;
    });

    if (!isAllowed) {
      return new NextResponse('Access Denied', { status: 403 });
    }
  }

  // Security headers
  const response = NextResponse.next();
  response.headers.set('X-Robots-Tag', 'noindex, nofollow, noarchive');
  response.headers.set('X-Frame-Options', 'DENY');
  response.headers.set('X-Content-Type-Options', 'nosniff');
  response.headers.set('Referrer-Policy', 'no-referrer');

  return response;
}

export const config = {
  matcher: '/((?!_next/static|_next/image|favicon.ico).*)',
};

Audit Compliance

Our security audit requirements:

Requirement	SSH-based	Self-hosted + IP	Status
SSH restricted to internal network	❌	✅	Pass
No credentials in external systems	❌	✅	Pass
Internal admin UI not public	❌	✅	Pass
IP-based access control	❌	✅	Pass
Deployment traceability	⚠️ Partial	✅ Full	Pass
Principle of least privilege	❌	✅	Pass

Result: Security audit approved with zero exceptions.

Operational Benefits

1. Simplified Troubleshooting

Before: When deployment failed, debugging required:

# 1. Check GitHub Actions logs (truncated, hard to read)
# 2. SSH into EC2 to see full logs
# 3. Context switching between interfaces

After: All logs in one place (GitHub Actions UI shows full output)

# Runner executes locally, so all stdout/stderr captured
# Full deploy script logs visible in GitHub Actions UI
# No SSH required for debugging

2. No More "Connection Timed Out" Errors

Before: Intermittent SSH failures

Error: dial tcp 203.0.113.10:22: i/o timeout
Error: ssh: handshake failed: read tcp: i/o timeout

After: Zero network-related failures (100+ deployments without incident)

3. Faster Iteration During Development

Deployment testing workflow:

Before:

1. Make code change
2. git push
3. Wait for GitHub Actions
4. GitHub Actions SSH to EC2
5. Wait for deployment
6. Check results
Total: ~5 minutes per iteration

After:

1. Make code change
2. git push
3. Wait for GitHub Actions
4. Runner executes locally
5. Check results
Total: ~3.5 minutes per iteration

Impact: 30% faster feedback loop during deployment script development.

4. Parallel Deployments

New capability: Deploy backend and frontend simultaneously

# Both can run in parallel without conflicts
Backend Runner  → /home/ec2-user/deploy-backend.sh
Frontend Runner → /home/ec2-user/deploy-frontend.sh

# Total deployment time: max(backend, frontend) instead of sum

5. Cost Savings

GitHub Actions minutes:

Plan	SSH-based Usage	Self-hosted Usage	Savings
Free Tier	2,000 min/month	Unlimited	N/A
Paid Plan	$0.008/min	$0/min	100%

Our usage: ~400 minutes/month for builds and deployments (both services).

Cost impact:

SSH-based: Free tier sufficient, but approaching limit
Self-hosted: Zero GitHub Actions minutes consumed for deployment step
Long-term benefit: As team grows, no risk of hitting GitHub Actions limits

Lessons Learned

1. Self-Hosted Runners Are Not "Set and Forget"

Initial assumption: Install runner, it works forever.

Reality: Runners need maintenance:

# Runner updates (every 2-3 months)
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh stop
./config.sh remove --token REMOVAL_TOKEN
# Download new version
./config.sh --url ... --token NEW_TOKEN
sudo ./svc.sh install ec2-user
sudo ./svc.sh start

# Disk space monitoring (runners accumulate _work artifacts)
du -sh /home/ec2-user/actions-runner-*/_work
# Output: 4.2G backend, 3.8G frontend (after 3 months)

# Cleanup old artifacts (automated in cron)
find /home/ec2-user/actions-runner-*/_work -type f -mtime +7 -delete

Solution: Create maintenance schedule (monthly check-in).

2. Runner Restart After EC2 Reboot

Problem: EC2 maintenance reboot → runner offline.

Solution: Systemd service (we already did this in Step 1.5)

# Verify auto-start on boot
sudo systemctl is-enabled actions.runner.*.service
# Output: enabled ✅

Test:

sudo reboot
# Wait 2 minutes
ssh ec2-user@203.0.113.10
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh status
# Output: active (running) ✅

3. Separate Runners for Backend and Frontend

Why separate runners?

Isolation: Backend deployment doesn't block frontend
Resource management: Each runner has dedicated resources
Label-based targeting: Workflows explicitly target correct runner
Security: Frontend runner has no access to backend secrets

Label strategy:

# Backend workflow
runs-on: [self-hosted, backend, production]

# Frontend workflow
runs-on: [self-hosted, frontend, production]

4. IP-Only Access Considerations

Challenge: No SSL/TLS for IP-based access

Our approach:

Internal admin UI: HTTP only (acceptable for office network)
Backend API: HTTPS with domain (public access)
Frontend → Backend: HTTPS (secure communication)

Alternative (if SSL required for frontend):

Use self-signed certificate
Add to browser trust store on office machines
Update Nginx to use self-signed cert

server {
    listen 443 ssl;
    server_name 203.0.113.11;

    ssl_certificate /etc/nginx/ssl/self-signed.crt;
    ssl_certificate_key /etc/nginx/ssl/self-signed.key;

    # ... rest of config
}

5. Frontend Docker Build Optimization

Challenge: Next.js builds are slow

Solution: Standalone output + multi-stage Dockerfile

# Dockerfile for Next.js
FROM node:22-alpine AS deps
RUN apk add --no-cache libc6-compat
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production

FROM node:22-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build

FROM node:22-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs

# Copy standalone build
COPY --from=builder /app/public ./public
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static

USER nextjs
EXPOSE 8000
CMD ["node", "server.js"]

next.config.ts:

const nextConfig = {
  output: 'standalone',  // ✅ Critical for Docker optimization
  // ... other config
};

Result: Image size reduced from 1.2GB to 280MB.

Monitoring and Observability

Runner Health Check

Created monitoring script to ensure runners stay healthy:

#!/bin/bash
# /home/ec2-user/monitor-runners.sh

LOG_FILE="/var/log/github-runner-monitor.log"

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}

# Check backend runner
if ! pgrep -f "Runner.Listener.*backend" > /dev/null; then
    log "⚠️ Backend runner not running. Restarting..."
    cd /home/ec2-user/actions-runner-backend
    sudo ./svc.sh restart
fi

# Check frontend runner
if ! pgrep -f "Runner.Listener.*frontend" > /dev/null; then
    log "⚠️ Frontend runner not running. Restarting..."
    cd /home/ec2-user/actions-runner-frontend
    sudo ./svc.sh restart
fi

# Check disk space
DISK_USAGE=$(df -h /home | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 85 ]; then
    log "⚠️ Disk usage: ${DISK_USAGE}%. Cleaning up..."
    docker system prune -f --volumes
    find /home/ec2-user/actions-runner-*/_work -type f -mtime +7 -delete
    log "✅ Cleanup completed"
fi

log "✅ All runners healthy"

Schedule with cron:

# Every 5 minutes
*/5 * * * * /home/ec2-user/monitor-runners.sh

Production Metrics: 3 Months Post-Migration

After 90 days of production use with self-hosted runners:

Backend (API)

Metric	Before (SSH)	After (Self-hosted)	Change
Deployments	156	178	+14%
Average deploy time	3m 47s	3m 12s	-35s (15%)
Failed deployments	5 (3.2%)	2 (1.1%)	-65%
SSH timeout errors	3	0	-100%
Uptime	99.97%	99.98%	+0.01%

Frontend (Admin)

Metric	Result
Deployments	142
Average deploy time	4m 12s
Failed deployments	1 (0.7%)
Access violations	0 (IP restrictions working)
Uptime	99.99%

Combined Benefits

Key improvements:

Deployment reliability: 96.8% → 98.9%
All failures now application-level (not infrastructure)
Zero network-induced deployment failures
Faster feedback loops encourage smaller, more frequent deployments
Internal admin UI: 100% protection from unauthorized access

Cost-Benefit Analysis

Time Investment

Phase	Time Required
Research and planning	2 hours
Backend runner installation	30 minutes
Frontend runner installation	30 minutes
Workflow migration (both)	40 minutes
Frontend security configuration	1 hour
Testing and validation	2 hours
Documentation	1.5 hours
Total	~8 hours

Ongoing Maintenance

Task	Frequency	Time
Runner version updates (both)	Monthly	30 min
Health check monitoring	Daily (automated)	2 min
Log cleanup	Weekly (automated)	0 min
Total monthly		~30-40 min

Return on Investment

Benefits quantified:

Eliminated SSH timeout issues: ~4 hours/month previously spent debugging
Faster deployments: 35s × 320 deployments = ~187 minutes saved (3.1 hours)
Reduced context switching: ~45 minutes/month (no SSH for log viewing)
GitHub Actions minutes saved: 320 minutes/month ($2.56 at paid rates)
Security compliance: Eliminated audit exceptions (immeasurable value)

Total time saved per month: ~7.5 hours

Payback period: 1 month

Migration Checklist for Others

Pre-Migration

[ ] Verify EC2 instances have sufficient resources
[ ] Ensure outbound HTTPS (port 443) is allowed in security groups
[ ] Document current SSH-based workflow
[ ] Plan IP restrictions for internal services
[ ] Create rollback plan
[ ] Test runner installation on staging environment

Migration Steps - Backend

[ ] Install GitHub Actions runner on backend EC2
[ ] Configure runner as systemd service
[ ] Verify runner appears online in GitHub
[ ] Update workflow file (runs-on: [self-hosted, backend])
[ ] Remove SSH action step from workflow
[ ] Test deployment on non-production branch

Migration Steps - Frontend (if applicable)

[ ] Install separate runner on frontend EC2 (or same instance with different labels)
[ ] Configure AWS Security Group (restrict port 80 to office network)
[ ] Configure Nginx IP whitelist
[ ] Set up robots.txt to block indexing
[ ] Add security headers to Nginx
[ ] Configure environment variables for IP-only access
[ ] Update workflow file for frontend
[ ] Test deployment and verify IP restrictions

Post-Migration

[ ] Remove SSH-related secrets from GitHub
[ ] Update security group rules (restrict SSH to office network)
[ ] Set up runner monitoring (health checks, disk space)
[ ] Create runner maintenance schedule
[ ] Document new deployment process
[ ] Update incident response procedures

Rollback Plan

If issues occur, revert quickly:

# 1. Stop self-hosted runners
cd /home/ec2-user/actions-runner-backend
sudo ./svc.sh stop

cd /home/ec2-user/actions-runner-frontend
sudo ./svc.sh stop

# 2. Revert workflow files
git revert <commit-hash>

# 3. Restore security group rules (allow GitHub IPs for SSH)
# 4. Verify old workflow works

When to Use Self-Hosted Runners

✅ Good Use Cases

Security requirements: Need to restrict SSH or other access
Private networks: Need to access internal services (databases, APIs)
Cost optimization: Heavy CI/CD usage (> 2,000 min/month)
Performance: Deployment speed matters (low latency)
Compliance: Data must stay within specific infrastructure
Internal tools: Admin dashboards, internal APIs not meant for public
IP-based access control: Services restricted to office network

❌ Not Recommended When

Multiple repositories: Managing runners per repo is tedious (use runner groups)
Varying workloads: Idle runners waste resources (GitHub-hosted scales to zero)
Small team, low frequency: Overhead not worth it (< 10 deployments/month)
No ops expertise: Runner maintenance requires systems knowledge
Highly dynamic IPs: If EC2 IP changes frequently (use ALB or Elastic IP)

Conclusion

Migrating from SSH-based GitHub Actions to self-hosted runners eliminated our deployment's weakest link: the network dependency and security exposure. By running the CI/CD pipeline directly on our EC2 instances and implementing IP-based access control for our internal admin frontend, we achieved:

Security:

✅ SSH restricted to office network only (203.0.113.0/24)
✅ Zero credentials stored in GitHub Secrets
✅ Reduced attack surface by 75% (no SSH exposure)
✅ Internal admin UI protected by IP whitelist (AWS + Nginx)
✅ Defense in depth: multiple layers of access control
✅ Passed security audit with zero exceptions

Performance:

✅ 15% faster deployments (3m 47s → 3m 12s)
✅ 65% fewer failed deployments (3.2% → 1.1%)
✅ 100% elimination of network-induced failures
✅ 28% reduction in GitHub Actions minutes consumed
✅ Parallel deployment capability for multiple services

Operational:

✅ Simplified troubleshooting (unified logs)
✅ Easier debugging (local execution)
✅ No SSH key rotation headaches
✅ Faster development iteration (30% faster feedback)
✅ Support for both public and internal services

Architecture:

✅ Backend: Public API with domain and SSL
✅ Frontend: IP-only access for internal tools
✅ Clean separation of concerns
✅ Flexible deployment strategy

Cost:

Initial investment: ~8 hours
Ongoing maintenance: ~30-40 minutes/month
ROI: Pays for itself in 1 month

The migration was straightforward, required minimal changes to our existing Blue-Green deployment architecture, and provided immediate benefits. Our 99.98% backend uptime and 99.99% frontend uptime prove that zero-downtime deployments don't require complex orchestration tools—just thoughtful architecture and the right tool for the job.

Key takeaway: Don't let network dependencies be your deployment's Achilles' heel, and don't expose internal tools to the public internet. If you control the infrastructure, running CI/CD locally with proper network restrictions is simpler, faster, and vastly more secure than remote execution over SSH.