Sherifdeen Adebayo

Posted on Dec 9, 2025

From Zero to DevOps Hero: A Complete Guide to Blue/Green Deployments and Microservices Automation

#devops #microservices #automation #tutorial

A comprehensive walkthrough of two production-ready DevOps projects, perfect for beginners looking to level up their deployment game.

Introduction
Project 1: Blue/Green Deployment with Nginx Auto-Failover
Project 2: Containerized Microservices with Full Automation
Key Takeaways
Next Steps

Introduction

Ever wonder how major tech companies achieve zero-downtime deployments? How do they update their applications without users noticing? In this guide, we'll walk through two real-world DevOps projects that answer these questions.

What You'll Learn:

Blue/Green deployment strategies for zero-downtime releases
Nginx configuration for automatic failover
Docker containerization for multiple programming languages
Infrastructure as Code with Terraform
Configuration management with Ansible
CI/CD pipeline design with drift detection
Production-ready security practices

Prerequisites:

Basic understanding of Docker and Docker Compose
Familiarity with command-line operations
A code editor (VS Code, Vim, etc.)
Git installed on your machine
Access to a cloud provider (AWS, DigitalOcean, Hetzner, etc.) for Project 2

Time Investment:

Project 1: 2-3 hours
Project 2: 8-12 hours

Let's dive in!

Project 1: Blue/Green Deployment with Nginx Auto-Failover

What is Blue/Green Deployment?

Imagine you're running a restaurant. You have two identical kitchens: Kitchen Blue (currently serving customers) and Kitchen Green (on standby). When you want to update the menu or equipment, you:

Update Kitchen Green while Kitchen Blue serves customers
Test Kitchen Green thoroughly
Switch all orders to Kitchen Green
Now Kitchen Blue is on standby for the next update

That's exactly how blue/green deployment works in software! You maintain two identical production environments and can switch between them instantly.

Why Use This Approach?

Traditional deployment problems:

Downtime during updates (users see "Site under maintenance")
No easy rollback if something breaks
Risk of breaking production with untested changes

Blue/Green deployment benefits:

Zero downtime - users never notice the switch
Instant rollback - just switch back if issues arise
Safe testing - new version runs in production environment before switch

Project Architecture

                          ┌─────────────┐
                          │   Nginx     │
                          │   :8080     │
                          └──────┬──────┘
                                 │
          ┌──────────────────────┴──────────────────────┐
          │                                              │
┌─────────▼──────────┐                        ┌─────────▼──────────┐
│   Blue (Primary)   │                        │  Green (Backup)    │
│      :8081         │                        │      :8082         │
└────────────────────┘                        └────────────────────┘

How it works:

Nginx sits at port 8080 (your main entry point)
Blue service at port 8081 (currently handling all traffic)
Green service at port 8082 (backup, ready to take over)
When Blue fails, Nginx automatically routes traffic to Green
No requests are dropped during the transition

Setting Up the Project

Step 1: Project Structure

Create your project directory:

mkdir blue-green-deployment
cd blue-green-deployment

Your final structure will look like this:

blue-green-deployment/
├── docker-compose.yml      # Orchestrates all services
├── nginx.conf.template     # Nginx configuration
├── entrypoint.sh          # Nginx startup script
├── env.example            # Environment variables template
├── .env                   # Your actual environment variables
├── Makefile               # Convenient commands
├── test-failover.sh       # Automated testing script
├── README.md              # Documentation
├── DECISION.md            # Technical decisions
└── PART_B_RESEARCH.md     # Infrastructure research

Step 2: Docker Compose Configuration

The docker-compose.yml file is the heart of this setup. Let's break it down:

services:
  # Nginx acts as the load balancer with automatic failover
  nginx:
    image: nginx:alpine
    container_name: nginx-lb
    ports:
      - "8080:80"           # Main entry point for users
    volumes:
      - ./nginx.conf.template:/etc/nginx/nginx.conf:ro
      - ./entrypoint.sh:/entrypoint.sh:ro
    environment:
      - ACTIVE_POOL=${ACTIVE_POOL:-blue}    # Which pool is primary?
      - BLUE_UPSTREAM=app_blue:${PORT:-3000}
      - GREEN_UPSTREAM=app_green:${PORT:-3000}
    depends_on:
      - app_blue
      - app_green
    entrypoint: ["/bin/sh", "/entrypoint.sh"]
    networks:
      - app-network
    restart: unless-stopped

  # Blue pool - primary by default
  app_blue:
    image: ${BLUE_IMAGE}
    container_name: app_blue
    ports:
      - "8081:${PORT:-3000}"    # Direct access for testing
    environment:
      - APP_POOL=blue
      - RELEASE_ID=${RELEASE_ID_BLUE:-v1.0.0-blue}
      - PORT=${PORT:-3000}
    networks:
      - app-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
             "http://localhost:${PORT:-3000}/healthz"]
      interval: 5s
      timeout: 3s
      retries: 3

  # Green pool - backup by default
  app_green:
    image: ${GREEN_IMAGE}
    container_name: app_green
    ports:
      - "8082:${PORT:-3000}"    # Direct access for testing
    environment:
      - APP_POOL=green
      - RELEASE_ID=${RELEASE_ID_GREEN:-v1.0.0-green}
      - PORT=${PORT:-3000}
    networks:
      - app-network
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
             "http://localhost:${PORT:-3000}/healthz"]
      interval: 5s
      timeout: 3s
      retries: 3

networks:
  app-network:
    driver: bridge

Key points to understand:

Port Strategy:
- Port 8080: Public-facing Nginx (users hit this)
- Port 8081: Direct access to Blue (for chaos testing)
- Port 8082: Direct access to Green (for chaos testing)
Health Checks:
- Every 5 seconds, Docker checks if the service is healthy
- Uses the /healthz endpoint
- After 3 failed attempts (3s timeout each), marks as unhealthy
Environment Variables:
- ACTIVE_POOL: Which service gets traffic first (blue/green)
- RELEASE_ID: Tracks which version is running
- PORT: Application port (default: 3000)

Step 3: Nginx Configuration Magic

The nginx.conf.template is where the failover magic happens:

events {
    worker_connections 1024;
}

http {
    # Logging to help debug issues
    access_log /var/log/nginx/access.log;
    error_log /var/log/nginx/error.log warn;

    # Combined upstream with failover logic
    upstream app_backend {
        # Active pool (primary)
        server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;

        # Backup pool - only used if primary fails
        server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
    }

    server {
        listen 80;
        server_name localhost;

        # Aggressive timeouts for quick failover detection
        proxy_connect_timeout 2s;
        proxy_send_timeout 3s;
        proxy_read_timeout 3s;

        # Retry logic - crucial for zero-downtime failover
        proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
        proxy_next_upstream_tries 2;
        proxy_next_upstream_timeout 10s;

        # Don't buffer - we want real-time responses
        proxy_buffering off;

        # Forward original client info
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        location / {
            proxy_pass http://app_backend;
            proxy_pass_request_headers on;
        }
    }
}

Let's decode this configuration:

1. Upstream Block:

upstream app_backend {
    server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
    server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}

max_fails=2: After 2 failed requests, mark server as down
fail_timeout=5s: Server is down for 5 seconds before retry
backup: This server only receives traffic when primary fails

2. Timeout Settings:

proxy_connect_timeout 2s;    # Can't connect? Fail fast
proxy_read_timeout 3s;       # No response? Move to backup

These are aggressive timeouts for quick failure detection. In production, you might increase these based on your app's response times.

3. Retry Logic:

proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;

This tells Nginx: "If you get an error, timeout, or 5xx status code, try the next server (backup) automatically."

Step 4: The Entrypoint Script

Nginx doesn't support environment variables natively in its config. We use envsubst to template it at runtime:

#!/bin/sh
set -e

# Figure out which upstream is active and which is backup
if [ "$ACTIVE_POOL" = "blue" ]; then
    ACTIVE_UPSTREAM="$BLUE_UPSTREAM"
    BACKUP_UPSTREAM="$GREEN_UPSTREAM"
else
    ACTIVE_UPSTREAM="$GREEN_UPSTREAM"
    BACKUP_UPSTREAM="$BLUE_UPSTREAM"
fi

echo "==> Setting up Nginx with active pool: $ACTIVE_POOL"
echo "    Active upstream: $ACTIVE_UPSTREAM"
echo "    Backup upstream: $BACKUP_UPSTREAM"

# Use envsubst to replace variables in the template
export ACTIVE_UPSTREAM
export BACKUP_UPSTREAM

envsubst '${ACTIVE_UPSTREAM} ${BACKUP_UPSTREAM}' \
    < /etc/nginx/nginx.conf > /tmp/nginx.conf

# Test the config before starting (ALWAYS do this!)
nginx -t -c /tmp/nginx.conf

# Start nginx in foreground
echo "==> Starting Nginx..."
exec nginx -g 'daemon off;' -c /tmp/nginx.conf

What this script does:

Reads the ACTIVE_POOL environment variable
Sets up which upstream is active vs backup
Replaces placeholders in nginx.conf
Tests the configuration (prevents broken configs from starting)
Starts Nginx with the processed config

Step 5: Environment Configuration

Create your .env file from the example:

cp env.example .env

The .env file contents:

# Docker images for blue and green pools
BLUE_IMAGE=your-registry/your-app:blue
GREEN_IMAGE=your-registry/your-app:green

# Which pool should be active? (blue or green)
ACTIVE_POOL=blue

# Release identifiers - show up in X-Release-Id header
RELEASE_ID_BLUE=v1.0.0-blue-20250129
RELEASE_ID_GREEN=v1.0.0-green-20250129

# Application port
PORT=3000

Pro tip: In a real deployment, BLUE_IMAGE and GREEN_IMAGE might point to different versions of your app:

Blue: myapp:v1.2.3
Green: myapp:v1.2.4 (new version being tested)

Step 6: Makefile for Convenience

The Makefile provides friendly commands:

.PHONY: help up down restart logs test clean

help:  ## Show available commands
    @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
    awk 'BEGIN {FS = ":.*?## "}; {printf "  %-15s %s\n", $$1, $$2}'

up:  ## Start all services
    @if [ ! -f .env ]; then \
        echo "Creating .env from env.example..."; \
        cp env.example .env; \
    fi
    docker-compose up -d
    @echo "Services started!"
    @echo "Nginx:  http://localhost:8080"
    @echo "Blue:   http://localhost:8081"
    @echo "Green:  http://localhost:8082"

down:  ## Stop all services
    docker-compose down

test:  ## Run failover test
    ./test-failover.sh

blue:  ## Switch to blue as active
    @sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=blue/' .env
    @$(MAKE) restart

green:  ## Switch to green as active
    @sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=green/' .env
    @$(MAKE) restart

Usage:

make help      # See all commands
make up        # Start everything
make test      # Run failover tests
make green     # Switch to green pool

Running the Project

Start the Stack

Option 1: Using Make (recommended)

make up

Option 2: Manual

cp env.example .env
docker-compose up -d

Verify It's Working

# Check service status
docker-compose ps

# Hit the main endpoint
curl -i http://localhost:8080/version

You should see headers like:

HTTP/1.1 200 OK
X-App-Pool: blue
X-Release-Id: v1.0.0-blue-20250129

Testing the Failover

This is where it gets exciting! Let's break the blue service and watch Nginx automatically switch to green.

Automated Test Script

The test-failover.sh script automates the entire test:

#!/bin/bash
set -e

echo "🔵 Blue/Green Failover Test"
echo "============================"
echo ""

# Step 1: Baseline check
echo "📊 Step 1: Checking baseline (should be blue)..."
for i in {1..3}; do
    response=$(curl -s -i http://localhost:8080/version)
    pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
    echo "  Request $i: Pool=$pool"
done

# Step 2: Trigger chaos on blue
echo "💥 Step 2: Triggering chaos on blue..."
curl -s -X POST "http://localhost:8081/chaos/start?mode=error"
sleep 1

# Step 3: Test failover
echo "🔄 Step 3: Testing failover (should switch to green)..."
success_count=0
green_count=0
for i in {1..10}; do
    http_code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/version)
    if [ "$http_code" = "200" ]; then
        success_count=$((success_count + 1))
        response=$(curl -s -i http://localhost:8080/version)
        pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
        if [ "$pool" = "green" ]; then
            green_count=$((green_count + 1))
        fi
        echo "  Request $i: HTTP $http_code, Pool=$pool ✓"
    else
        echo "  Request $i: HTTP $http_code ✗ FAILED"
    fi
    sleep 0.5
done

# Step 4: Stop chaos
echo "🛑 Step 4: Stopping chaos..."
curl -s -X POST "http://localhost:8081/chaos/stop"

# Results
echo "📈 Results:"
echo "  ├─ Total requests: 10"
echo "  ├─ Successful (200): $success_count"
echo "  └─ Routed to green: $green_count"

if [ $success_count -eq 10 ] && [ $green_count -ge 9 ]; then
    echo "✅ Test PASSED - Failover working correctly!"
else
    echo "❌ Test FAILED - Check the logs"
fi

Run the test:

chmod +x test-failover.sh
./test-failover.sh

Expected output:

🔵 Blue/Green Failover Test
============================

📊 Step 1: Checking baseline (should be blue)...
  Request 1: Pool=blue
  Request 2: Pool=blue
  Request 3: Pool=blue

💥 Step 2: Triggering chaos on blue...
  Chaos initiated: {"status":"chaos_started","mode":"error"}

🔄 Step 3: Testing failover (should switch to green)...
  Request 1: HTTP 200, Pool=green ✓
  Request 2: HTTP 200, Pool=green ✓
  Request 3: HTTP 200, Pool=green ✓
  ...
  Request 10: HTTP 200, Pool=green ✓

📈 Results:
  ├─ Total requests: 10
  ├─ Successful (200): 10
  └─ Routed to green: 10

✅ Test PASSED - Failover working correctly!

What just happened?

All requests initially went to blue
We triggered chaos mode (blue starts returning 500 errors)
Nginx detected blue was failing
Zero requests failed - Nginx automatically retried on green
All subsequent requests went to green

Manual Testing (Understanding Each Step)

Let's do it manually to understand what's happening:

1. Check baseline:

for i in {1..5}; do
  curl -s http://localhost:8080/version | grep -E "pool|release"
done

All responses show "pool": "blue".

2. Break blue service:

# Trigger chaos mode - makes blue return 500 errors
curl -X POST "http://localhost:8081/chaos/start?mode=error"

3. Watch the magic:

# Keep hitting the endpoint
for i in {1..10}; do
  curl -s -w "\nStatus: %{http_code}\n" http://localhost:8080/version | \
    grep -E "pool|release|Status"
  sleep 1
done

You'll see:

All requests still return 200 (no failures!)
Pool changes from "blue" to "green"
Headers now show "pool": "green"

4. Fix blue:

curl -X POST "http://localhost:8081/chaos/stop"

After a few seconds, traffic goes back to blue (it's the primary).

Key Concepts Explained

Why Aggressive Timeouts?

proxy_connect_timeout 2s;
proxy_read_timeout 3s;

Scenario: Your blue service starts hanging (taking 10+ seconds to respond).

With loose timeouts (10s):

User makes request → Nginx waits 10s on blue → Fails → Retries on green
User waited 10+ seconds (bad experience)

With tight timeouts (2-3s):

User makes request → Nginx waits 2s on blue → Fails fast → Retries on green
User gets response in ~2-3 seconds (much better!)

Trade-off: If your app legitimately takes 5 seconds to respond, 3s timeouts will cause false failures. Tune based on your app's performance.

The Backup Directive

server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;

The backup keyword is crucial:

Without it: Nginx load-balances 50/50 between blue and green
With it: Nginx sends all traffic to blue, green only gets traffic when blue is down

This is what makes it true blue/green deployment (not load balancing).

Health Checks vs Failover

Docker health checks:

healthcheck:
  test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/healthz"]
  interval: 5s

Nginx failover:

max_fails=2 fail_timeout=5s

These serve different purposes:

Docker health checks: Tell Docker orchestration layer if container is healthy
Nginx failover: Actual traffic routing decisions

Both are important, but Nginx failover is what keeps users happy during failures.

Production Considerations

What Would Change in Production?

1. Timeouts would be tuned:

Measure your app's real response times
Set timeouts slightly above p95 latency
Balance fast failover vs false positives

2. Monitoring & Alerting:

# Add Prometheus metrics
location /metrics {
    stub_status on;
}

3. Structured Logging:

log_format json escape=json '{'
  '"time":"$time_iso8601",'
  '"remote_addr":"$remote_addr",'
  '"request":"$request",'
  '"status":$status,'
  '"upstream":"$upstream_addr"'
'}';

access_log /var/log/nginx/access.log json;

4. SSL/TLS:

server {
    listen 443 ssl http2;
    ssl_certificate /etc/ssl/certs/cert.pem;
    ssl_certificate_key /etc/ssl/private/key.pem;
    # ... rest of config
}

5. Rate Limiting:

limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location / {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://app_backend;
}

Troubleshooting Guide

Problem: All requests failing

Check if containers are running:

docker-compose ps

Expected output:

NAME        STATE    PORTS
nginx-lb    Up       0.0.0.0:8080->80/tcp
app_blue    Up       0.0.0.0:8081->3000/tcp
app_green   Up       0.0.0.0:8082->3000/tcp

If containers are down:

docker-compose logs

Problem: Not seeing failover

Check Nginx logs:

docker logs nginx-lb

# Or live tail:
docker logs -f nginx-lb

Look for:

[error] ... upstream timed out ... while connecting to upstream
[warn] ... marking server app_blue:3000 as down

Problem: Services won't start

Check if ports are already in use:

lsof -i :8080
lsof -i :8081
lsof -i :8082

If something is using these ports:

# Kill the process
kill -9 <PID>

# Or change ports in docker-compose.yml
ports:
  - "9080:80"  # Use 9080 instead of 8080

Problem: Changes to .env not taking effect

You need to recreate containers:

docker-compose down
docker-compose up -d

Just restarting isn't enough - environment variables are set at container creation time.

Real-World Use Cases

Scenario 1: Deploying a New Version

Without blue/green:

# Take site down
docker-compose down

# Update to new version
docker-compose up -d

# Hope nothing broke (users see downtime)

With blue/green:

# Update green to new version while blue serves traffic
sed -i 's/GREEN_IMAGE=.*/GREEN_IMAGE=myapp:v2.0.0/' .env

# Start green with new version
docker-compose up -d app_green

# Test green directly
curl http://localhost:8082/version

# Switch traffic to green (instant, zero downtime)
make green

# If something's wrong, instant rollback
make blue

Scenario 2: Database Migration

Challenge: New version needs schema changes.

Approach:

Make schema changes backward-compatible
Deploy new version to green (with migration)
Test thoroughly
Switch traffic
Keep blue running for quick rollback
After validation, update blue too

Example:

# Update green with new code + migrations
GREEN_IMAGE=myapp:v2.0.0 docker-compose up -d app_green

# Run migrations on green
docker exec app_green npm run migrate

# Test green
curl http://localhost:8082/test-endpoint

# Switch traffic
make green

# Monitor for issues
docker logs -f app_green

# If problems occur, instant rollback
make blue

Project 2: Containerized Microservices with Full Automation

Now that we understand blue/green deployments, let's scale up to a complete microservices application with full infrastructure automation.

The Big Picture

What we're building:
A production-ready TODO application with:

Multiple microservices (5 different programming languages!)
Automated infrastructure provisioning (Terraform)
Automated configuration management (Ansible)
CI/CD pipelines with drift detection
Automatic HTTPS with Traefik
Zero-trust security model

Architecture:

                           ┌──────────────┐
                           │   Traefik    │
                           │ (HTTPS/Proxy)│
                           └───────┬──────┘
                                   │
            ┌──────────────────────┼──────────────────────┐
            │                      │                      │
    ┌───────▼────────┐    ┌───────▼────────┐    ┌───────▼────────┐
    │   Frontend     │    │   Auth API     │    │   Todos API    │
    │   (Vue.js)     │    │     (Go)       │    │   (Node.js)    │
    └────────────────┘    └────────────────┘    └────────────────┘
                                   │
                    ┌──────────────┼──────────────┐
                    │              │              │
            ┌───────▼────────┐ ┌──▼────┐  ┌─────▼──────┐
            │   Users API    │ │ Redis │  │ Log        │
            │ (Java Spring)  │ │ Queue │  │ Processor  │
            └────────────────┘ └───────┘  │ (Python)   │
                                           └────────────┘

Understanding the Application

The Services

1. Frontend (Vue.js)

User interface
Login page → TODO dashboard
Communicates with backend APIs
Port: 80/443 (via Traefik)

2. Auth API (Go)

Handles user authentication
Issues JWT tokens
Endpoint: /api/auth

3. Todos API (Node.js)

Manages TODO items
CRUD operations
Requires valid JWT token
Endpoint: /api/todos

4. Users API (Java Spring Boot)

User management
Profile operations
Endpoint: /api/users

5. Log Processor (Python)

Processes background tasks
Consumes from Redis queue
Writes audit logs

6. Redis Queue

Message broker
Task queue for async operations

Phase 1: Containerization

Let's containerize each service. The key is understanding that each language has its own best practices.

Frontend Dockerfile (Vue.js)

# Multi-stage build for optimized production image

# Stage 1: Build the application
FROM node:18-alpine AS builder

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy source code
COPY . .

# Build for production
RUN npm run build

# Stage 2: Serve with nginx
FROM nginx:alpine

# Copy built assets from builder stage
COPY --from=builder /app/dist /usr/share/nginx/html

# Copy nginx configuration
COPY nginx.conf /etc/nginx/conf.d/default.conf

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD wget --quiet --tries=1 --spider http://localhost/ || exit 1

EXPOSE 80

CMD ["nginx", "-g", "daemon off;"]

Why multi-stage builds?

Builder stage: 800MB (includes build tools)
Final stage: 25MB (only nginx + static files)
97% size reduction!

Frontend nginx config:

server {
    listen 80;
    root /usr/share/nginx/html;
    index index.html;

    # SPA routing - send all requests to index.html
    location / {
        try_files $uri $uri/ /index.html;
    }

    # Cache static assets
    location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg)$ {
        expires 1y;
        add_header Cache-Control "public, immutable";
    }

    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
}

Auth API Dockerfile (Go)

# Multi-stage build for Go

# Stage 1: Build the binary
FROM golang:1.21-alpine AS builder

WORKDIR /app

# Copy go mod files
COPY go.mod go.sum ./

# Download dependencies
RUN go mod download

# Copy source code
COPY . .

# Build the binary
# CGO_ENABLED=0 creates a static binary (no external dependencies)
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main .

# Stage 2: Create minimal runtime image
FROM alpine:latest

# Add ca-certificates for HTTPS calls
RUN apk --no-cache add ca-certificates

WORKDIR /root/

# Copy the binary from builder
COPY --from=builder /app/main .

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD wget --quiet --tries=1 --spider http://localhost:8080/health || exit 1

EXPOSE 8080

CMD ["./main"]

Why this approach?

Builder stage: 400MB
Final stage: 15MB (just Alpine + binary)
Static binary = no runtime dependencies
Faster startup, smaller attack surface

Todos API Dockerfile (Node.js)

FROM node:18-alpine

WORKDIR /app

# Install dependencies first (better caching)
COPY package*.json ./
RUN npm ci --only=production

# Copy application code
COPY . .

# Create non-root user for security
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001 && \
    chown -R nodejs:nodejs /app

USER nodejs

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD node healthcheck.js || exit 1

EXPOSE 3000

CMD ["node", "server.js"]

Security note: Running as non-root user limits damage if container is compromised.

Users API Dockerfile (Java Spring Boot)

# Multi-stage build for Java

# Stage 1: Build with Maven
FROM maven:3.9-eclipse-temurin-17 AS builder

WORKDIR /app

# Copy pom.xml first (dependency caching)
COPY pom.xml ./
RUN mvn dependency:go-offline

# Copy source and build
COPY src ./src
RUN mvn clean package -DskipTests

# Stage 2: Runtime
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

# Copy JAR from builder
COPY --from=builder /app/target/*.jar app.jar

# Health check
HEALTHCHECK --interval=30s --timeout=3s \
  CMD wget --quiet --tries=1 --spider http://localhost:8080/actuator/health || exit 1

EXPOSE 8080

# Use exec form to ensure proper signal handling
ENTRYPOINT ["java", "-jar", "/app/app.jar"]

Java-specific optimizations:

# Production optimization flags
ENTRYPOINT ["java", \
  "-XX:+UseContainerSupport", \
  "-XX:MaxRAMPercentage=75.0", \
  "-XX:+ExitOnOutOfMemoryError", \
  "-jar", "/app/app.jar"]

Log Processor Dockerfile (Python)

FROM python:3.11-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Create non-root user
RUN useradd -m -u 1001 processor && \
    chown -R processor:processor /app

USER processor

# Health check (check if process is running)
HEALTHCHECK --interval=30s --timeout=3s \
  CMD ps aux | grep processor.py || exit 1

CMD ["python", "processor.py"]

Docker Compose - Orchestrating Everything

Now let's tie it all together with docker-compose.yml:

version: '3.8'

services:
  # Traefik reverse proxy
  traefik:
    image: traefik:v2.10
    container_name: traefik
    command:
      # API and dashboard
      - "--api.dashboard=true"
      - "--api.insecure=true"

      # Docker provider
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"

      # Entrypoints
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"

      # HTTP to HTTPS redirect
      - "--entrypoints.web.http.redirections.entrypoint.to=websecure"
      - "--entrypoints.web.http.redirections.entrypoint.scheme=https"

      # Let's Encrypt
      - "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
      - "--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL}"
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"  # Dashboard
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
      - "./letsencrypt:/letsencrypt"
    networks:
      - web
    restart: unless-stopped

  # Frontend
  frontend:
    build:
      context: ./frontend
      dockerfile: Dockerfile
    container_name: frontend
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.frontend.rule=Host(`${DOMAIN}`)"
      - "traefik.http.routers.frontend.entrypoints=websecure"
      - "traefik.http.routers.frontend.tls.certresolver=letsencrypt"
      - "traefik.http.services.frontend.loadbalancer.server.port=80"
    networks:
      - web
    restart: unless-stopped

  # Auth API
  auth:
    build:
      context: ./auth-api
      dockerfile: Dockerfile
    container_name: auth-api
    environment:
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - JWT_SECRET=${JWT_SECRET}
      - REDIS_URL=redis://redis:6379
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.auth.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/auth`)"
      - "traefik.http.routers.auth.entrypoints=websecure"
      - "traefik.http.routers.auth.tls.certresolver=letsencrypt"
      - "traefik.http.services.auth.loadbalancer.server.port=8080"
    depends_on:
      - postgres
      - redis
    networks:
      - web
      - backend
    restart: unless-stopped

  # Todos API
  todos:
    build:
      context: ./todos-api
      dockerfile: Dockerfile
    container_name: todos-api
    environment:
      - DB_HOST=postgres
      - DB_PORT=5432
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - REDIS_URL=redis://redis:6379
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.todos.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/todos`)"
      - "traefik.http.routers.todos.entrypoints=websecure"
      - "traefik.http.routers.todos.tls.certresolver=letsencrypt"
      - "traefik.http.services.todos.loadbalancer.server.port=3000"
    depends_on:
      - postgres
      - redis
    networks:
      - web
      - backend
    restart: unless-stopped

  # Users API
  users:
    build:
      context: ./users-api
      dockerfile: Dockerfile
    container_name: users-api
    environment:
      - SPRING_DATASOURCE_URL=jdbc:postgresql://postgres:5432/${DB_NAME}
      - SPRING_DATASOURCE_USERNAME=${DB_USER}
      - SPRING_DATASOURCE_PASSWORD=${DB_PASSWORD}
      - SPRING_REDIS_HOST=redis
      - SPRING_REDIS_PORT=6379
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.users.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/users`)"
      - "traefik.http.routers.users.entrypoints=websecure"
      - "traefik.http.routers.users.tls.certresolver=letsencrypt"
      - "traefik.http.services.users.loadbalancer.server.port=8080"
    depends_on:
      - postgres
      - redis
    networks:
      - web
      - backend
    restart: unless-stopped

  # Log Processor
  log-processor:
    build:
      context: ./log-processor
      dockerfile: Dockerfile
    container_name: log-processor
    environment:
      - REDIS_URL=redis://redis:6379
      - LOG_PATH=/logs
    volumes:
      - ./logs:/logs
    depends_on:
      - redis
    networks:
      - backend
    restart: unless-stopped

  # Redis
  redis:
    image: redis:7-alpine
    container_name: redis
    command: redis-server --appendonly yes
    volumes:
      - redis-data:/data
    networks:
      - backend
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 3

  # PostgreSQL
  postgres:
    image: postgres:15-alpine
    container_name: postgres
    environment:
      - POSTGRES_DB=${DB_NAME}
      - POSTGRES_USER=${DB_USER}
      - POSTGRES_PASSWORD=${DB_PASSWORD}
    volumes:
      - postgres-data:/var/lib/postgresql/data
    networks:
      - backend
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
      interval: 10s
      timeout: 3s
      retries: 3

networks:
  web:
    driver: bridge
  backend:
    driver: bridge

volumes:
  postgres-data:
  redis-data:

Key concepts in this compose file:

1. Networks:

networks:
  web:      # Public-facing services
  backend:  # Internal services only

Frontend, APIs → web network (accessible via Traefik)
Database, Redis → backend network only (isolated)
This provides network-level security

2. Traefik Labels:

labels:
  - "traefik.enable=true"
  - "traefik.http.routers.auth.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/auth`)"
  - "traefik.http.routers.auth.tls.certresolver=letsencrypt"

These labels tell Traefik how to route traffic:

Route requests to yourdomain.com/api/auth → auth service
Automatically get SSL certificate from Let's Encrypt
Handle HTTPS termination

3. Environment Variables:

environment:
  - DB_HOST=postgres
  - JWT_SECRET=${JWT_SECRET}

Secrets come from .env file (never committed to git!).

Environment Configuration

Create .env file:

# Domain configuration
DOMAIN=your-domain.com
ACME_EMAIL=your-email@example.com

# Database
DB_NAME=todoapp
DB_USER=todouser
DB_PASSWORD=change-this-strong-password

# Security
JWT_SECRET=change-this-to-random-string-min-32-chars

# Optional: Docker registry
DOCKER_REGISTRY=ghcr.io/yourusername

Security checklist for .env:

[ ] Never commit .env to git
[ ] Add .env to .gitignore
[ ] Use strong passwords (20+ characters)
[ ] Use different passwords for each service
[ ] Rotate secrets regularly

Phase 2: Infrastructure as Code with Terraform

Now let's provision the cloud infrastructure automatically.

Project Structure

infra/
├── terraform/
│   ├── main.tf              # Main configuration
│   ├── variables.tf         # Input variables
│   ├── outputs.tf           # Output values
│   ├── provider.tf          # Provider configuration
│   └── backend.tf           # Remote state configuration
├── ansible/
│   ├── inventory/           # Dynamic inventory
│   ├── roles/
│   │   ├── dependencies/    # Install Docker, etc.
│   │   └── deploy/          # Deploy application
│   ├── playbook.yml         # Main playbook
│   └── ansible.cfg          # Ansible configuration
└── scripts/
    ├── deploy.sh            # Deployment orchestration
    └── drift-check.sh       # Drift detection

Terraform Configuration

provider.tf:

# Provider configuration
terraform {
  required_version = ">= 1.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }

    local = {
      source  = "hashicorp/local"
      version = "~> 2.0"
    }

    null = {
      source  = "hashicorp/null"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Project     = "todo-app"
      Environment = var.environment
      ManagedBy   = "terraform"
    }
  }
}

backend.tf:

# Remote state storage - crucial for team collaboration
terraform {
  backend "s3" {
    bucket         = "your-terraform-state-bucket"
    key            = "todo-app/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

Why remote state?

Team collaboration - everyone sees same state
State locking - prevents concurrent modifications
Backup - state is backed up in S3
Encryption - sensitive data encrypted at rest

variables.tf:

variable "aws_region" {
  description = "AWS region to deploy resources"
  type        = string
  default     = "us-east-1"
}

variable "environment" {
  description = "Environment name"
  type        = string
  default     = "production"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"
}

variable "ssh_public_key" {
  description = "SSH public key for access"
  type        = string
}

variable "domain_name" {
  description = "Domain name for the application"
  type        = string
}

variable "alert_email" {
  description = "Email for drift detection alerts"
  type        = string
}

variable "app_port" {
  description = "Application port"
  type        = number
  default     = 80
}

main.tf:

# VPC for network isolation
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = {
    Name = "todo-app-vpc"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "todo-app-igw"
  }
}

# Public Subnet
resource "aws_subnet" "public" {
  vpc_id                  = aws_vpc.main.id
  cidr_block              = "10.0.1.0/24"
  availability_zone       = "${var.aws_region}a"
  map_public_ip_on_launch = true

  tags = {
    Name = "todo-app-public-subnet"
  }
}

# Route Table
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "todo-app-public-rt"
  }
}

# Route Table Association
resource "aws_route_table_association" "public" {
  subnet_id      = aws_subnet.public.id
  route_table_id = aws_route_table.public.id
}

# Security Group
resource "aws_security_group" "app" {
  name        = "todo-app-sg"
  description = "Security group for TODO application"
  vpc_id      = aws_vpc.main.id

  # SSH
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "SSH access"
  }

  # HTTP
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTP access"
  }

  # HTTPS
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTPS access"
  }

  # Outbound - allow all
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow all outbound"
  }

  tags = {
    Name = "todo-app-sg"
  }
}

# SSH Key Pair
resource "aws_key_pair" "deployer" {
  key_name   = "todo-app-deployer"
  public_key = var.ssh_public_key
}

# Latest Ubuntu AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# EC2 Instance
resource "aws_instance" "app" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  key_name               = aws_key_pair.deployer.key_name
  subnet_id              = aws_subnet.public.id
  vpc_security_group_ids = [aws_security_group.app.id]

  root_block_device {
    volume_size = 30
    volume_type = "gp3"
    encrypted   = true
  }

  user_data = <<-EOF
              #!/bin/bash
              apt-get update
              apt-get install -y python3 python3-pip
              EOF

  tags = {
    Name = "todo-app-server"
  }

  # Lifecycle rule for idempotency
  lifecycle {
    ignore_changes = [
      user_data,  # Don't recreate if user_data changes
      ami,        # Don't recreate on AMI updates unless forced
    ]
  }
}

# Elastic IP for stable public IP
resource "aws_eip" "app" {
  instance = aws_instance.app.id
  domain   = "vpc"

  tags = {
    Name = "todo-app-eip"
  }
}

# Generate Ansible inventory
resource "local_file" "ansible_inventory" {
  content = templatefile("${path.module}/templates/inventory.tpl", {
    app_server_ip = aws_eip.app.public_ip
    ssh_key_path  = "~/.ssh/id_rsa"
    ssh_user      = "ubuntu"
  })

  filename = "${path.module}/../ansible/inventory/hosts"

  # Only regenerate if values change
  lifecycle {
    create_before_destroy = true
  }
}

# Trigger Ansible after provisioning
resource "null_resource" "ansible_provisioner" {
  # Run when instance changes
  triggers = {
    instance_id = aws_instance.app.id
    timestamp   = timestamp()
  }

  # Wait for instance to be ready
  provisioner "local-exec" {
    command = <<-EOT
      echo "Waiting for SSH to be ready..."
      until ssh -o StrictHostKeyChecking=no -o ConnectTimeout=2 ubuntu@${aws_eip.app.public_ip} echo "SSH Ready"; do
        sleep 5
      done

      echo "Running Ansible playbook..."
      cd ${path.module}/../ansible
      ansible-playbook -i inventory/hosts playbook.yml
    EOT
  }

  depends_on = [
    local_file.ansible_inventory,
    aws_eip.app
  ]
}

templates/inventory.tpl:

[app_servers]
todo-app ansible_host=${app_server_ip} ansible_user=${ssh_user} ansible_ssh_private_key_file=${ssh_key_path}

[app_servers:vars]
ansible_python_interpreter=/usr/bin/python3

outputs.tf:

output "instance_public_ip" {
  description = "Public IP of the application server"
  value       = aws_eip.app.public_ip
}

output "instance_id" {
  description = "ID of the EC2 instance"
  value       = aws_instance.app.id
}

output "domain_name" {
  description = "Domain name for the application"
  value       = var.domain_name
}

output "ssh_command" {
  description = "SSH command to connect to the server"
  value       = "ssh ubuntu@${aws_eip.app.public_ip}"
}

Understanding Terraform Idempotency

What is idempotency?
Running the same Terraform code multiple times produces the same result without creating duplicates.

Example - Non-idempotent (bad):

resource "aws_instance" "app" {
  ami           = "ami-12345"
  instance_type = "t3.medium"

  # This causes recreation on every apply!
  tags = {
    Timestamp = timestamp()
  }
}

Idempotent (good):

resource "aws_instance" "app" {
  ami           = "ami-12345"
  instance_type = "t3.medium"

  tags = {
    Name = "todo-app-server"
  }

  lifecycle {
    ignore_changes = [
      tags["Timestamp"],
      user_data
    ]
  }
}

Drift Detection

What is drift?
Drift occurs when actual infrastructure differs from Terraform state (manual changes, external tools, etc.).

drift-check.sh:

#!/bin/bash
set -e

echo "Checking for infrastructure drift..."

# Run terraform plan and capture output
PLAN_OUTPUT=$(terraform plan -detailed-exitcode -no-color 2>&1) || EXIT_CODE=$?

# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected (drift!)

if [ $EXIT_CODE -eq 0 ]; then
    echo "✅ No drift detected - infrastructure matches desired state"
    exit 0

elif [ $EXIT_CODE -eq 2 ]; then
    echo "⚠️  DRIFT DETECTED - infrastructure has changed!"
    echo ""
    echo "$PLAN_OUTPUT"
    echo ""

    # Send email alert
    ./send-drift-alert.sh "$PLAN_OUTPUT"

    # In CI/CD, pause for manual approval
    if [ "$CI" = "true" ]; then
        echo "Pausing for manual approval..."
        # GitHub Actions, GitLab CI, etc. have approval mechanisms
        exit 2
    fi

else
    echo "❌ Error running terraform plan"
    echo "$PLAN_OUTPUT"
    exit 1
fi

send-drift-alert.sh:

#!/bin/bash

DRIFT_DETAILS="$1"
ALERT_EMAIL="${ALERT_EMAIL:-admin@example.com}"

# Using AWS SES
aws ses send-email \
  --from "terraform@example.com" \
  --to "$ALERT_EMAIL" \
  --subject "⚠️ Terraform Drift Detected" \
  --text "$DRIFT_DETAILS"

# Or using curl with Mailgun, SendGrid, etc.
curl -s --user "api:$MAILGUN_API_KEY" \
  https://api.mailgun.net/v3/$MAILGUN_DOMAIN/messages \
  -F from="terraform@example.com" \
  -F to="$ALERT_EMAIL" \
  -F subject="⚠️ Terraform Drift Detected" \
  -F text="$DRIFT_DETAILS"

Phase 3: Configuration Management with Ansible

Terraform provisions infrastructure, Ansible configures it.

Ansible Project Structure

ansible/
├── inventory/
│   └── hosts                 # Generated by Terraform
├── roles/
│   ├── dependencies/
│   │   ├── tasks/
│   │   │   └── main.yml
│   │   └── handlers/
│   │       └── main.yml
│   └── deploy/
│       ├── tasks/
│       │   └── main.yml
│       ├── templates/
│       │   └── .env.j2
│       └── handlers/
│           └── main.yml
├── playbook.yml
└── ansible.cfg

ansible.cfg

[defaults]
inventory = inventory/hosts
remote_user = ubuntu
private_key_file = ~/.ssh/id_rsa
host_key_checking = False
retry_files_enabled = False

# Faster execution
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600

# Better output
stdout_callback = yaml
bin_ansible_callbacks = True

[ssh_connection]
pipelining = True

roles/dependencies/tasks/main.yml

---
# Install required system dependencies

- name: Update apt cache
  apt:
    update_cache: yes
    cache_valid_time: 3600
  become: yes

- name: Install required packages
  apt:
    name:
      - apt-transport-https
      - ca-certificates
      - curl
      - gnupg
      - lsb-release
      - python3-pip
      - git
      - ufw
    state: present
  become: yes

- name: Add Docker GPG key
  apt_key:
    url: https://download.docker.com/linux/ubuntu/gpg
    state: present
  become: yes

- name: Add Docker repository
  apt_repository:
    repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
    state: present
  become: yes

- name: Install Docker
  apt:
    name:
      - docker-ce
      - docker-ce-cli
      - containerd.io
      - docker-buildx-plugin
      - docker-compose-plugin
    state: present
  become: yes
  notify: Restart Docker

- name: Add user to docker group
  user:
    name: "{{ ansible_user }}"
    groups: docker
    append: yes
  become: yes

- name: Install Docker Compose (standalone)
  get_url:
    url: "https://github.com/docker/compose/releases/download/v2.23.0/docker-compose-linux-x86_64"
    dest: /usr/local/bin/docker-compose
    mode: '0755'
  become: yes

- name: Configure UFW firewall
  ufw:
    rule: "{{ item.rule }}"
    port: "{{ item.port }}"
    proto: "{{ item.proto }}"
  loop:
    - { rule: 'allow', port: '22', proto: 'tcp' }
    - { rule: 'allow', port: '80', proto: 'tcp' }
    - { rule: 'allow', port: '443', proto: 'tcp' }
  become: yes

- name: Enable UFW
  ufw:
    state: enabled
  become: yes

roles/dependencies/handlers/main.yml

---
- name: Restart Docker
  systemd:
    name: docker
    state: restarted
    enabled: yes
  become: yes

roles/deploy/tasks/main.yml

---
# Deploy the application

- name: Create application directory
  file:
    path: /opt/todo-app
    state: directory
    owner: "{{ ansible_user }}"
    group: "{{ ansible_user }}"
    mode: '0755'
  become: yes

- name: Clone application repository
  git:
    repo: "{{ app_repo_url }}"
    dest: /opt/todo-app
    version: "{{ app_branch | default('main') }}"
    force: yes
  register: git_clone

- name: Create environment file from template
  template:
    src: .env.j2
    dest: /opt/todo-app/.env
    owner: "{{ ansible_user }}"
    mode: '0600'
  no_log: yes  # Don't log sensitive env vars

- name: Create letsencrypt directory
  file:
    path: /opt/todo-app/letsencrypt
    state: directory
    mode: '0755'

- name: Pull latest Docker images
  community.docker.docker_compose:
    project_src: /opt/todo-app
    pull: yes
  when: git_clone.changed

- name: Start application with Docker Compose
  community.docker.docker_compose:
    project_src: /opt/todo-app
    state: present
    restarted: "{{ git_clone.changed }}"
  register: compose_output

- name: Wait for application to be healthy
  uri:
    url: "https://{{ domain_name }}/health"
    status_code: 200
    validate_certs: no
  retries: 10
  delay: 10
  register: health_check
  until: health_check.status == 200

- name: Display deployment status
  debug:
    msg: "Application deployed successfully at https://{{ domain_name }}"

roles/deploy/templates/.env.j2

# Auto-generated by Ansible - DO NOT EDIT MANUALLY

# Domain configuration
DOMAIN={{ domain_name }}
ACME_EMAIL={{ acme_email }}

# Database
DB_NAME={{ db_name }}
DB_USER={{ db_user }}
DB_PASSWORD={{ db_password }}

# Security
JWT_SECRET={{ jwt_secret }}

# Application
NODE_ENV=production
LOG_LEVEL=info

playbook.yml

---
- name: Deploy TODO Application
  hosts: app_servers
  become: no

  vars:
    app_repo_url: "https://github.com/yourusername/todo-app.git"
    app_branch: "main"
    domain_name: "{{ lookup('env', 'DOMAIN') }}"
    acme_email: "{{ lookup('env', 'ACME_EMAIL') }}"
    db_name: "{{ lookup('env', 'DB_NAME') }}"
    db_user: "{{ lookup('env', 'DB_USER') }}"
    db_password: "{{ lookup('env', 'DB_PASSWORD') }}"
    jwt_secret: "{{ lookup('env', 'JWT_SECRET') }}"

  roles:
    - dependencies
    - deploy

  post_tasks:
    - name: Verify deployment
      uri:
        url: "https://{{ domain_name }}"
        status_code: 200
        validate_certs: yes
      delegate_to: localhost

    - name: Display application URL
      debug:
        msg: "Application is live at https://{{ domain_name }}"

Phase 4: CI/CD Pipeline

Now let's automate everything with GitHub Actions.

.github/workflows/infrastructure.yml

name: Infrastructure Deployment

on:
  push:
    branches: [main]
    paths:
      - 'infra/terraform/**'
      - 'infra/ansible/**'
      - '.github/workflows/infrastructure.yml'
  workflow_dispatch:  # Manual trigger

env:
  TF_VERSION: '1.6.0'
  AWS_REGION: 'us-east-1'

jobs:
  terraform-plan:
    name: Terraform Plan & Drift Detection
    runs-on: ubuntu-latest
    outputs:
      has_changes: ${{ steps.plan.outputs.has_changes }}

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        run: |
          cd infra/terraform
          terraform init

      - name: Terraform Plan
        id: plan
        run: |
          cd infra/terraform
          terraform plan -detailed-exitcode -out=tfplan || EXIT_CODE=$?

          if [ $EXIT_CODE -eq 0 ]; then
            echo "has_changes=false" >> $GITHUB_OUTPUT
            echo "✅ No infrastructure changes detected"
          elif [ $EXIT_CODE -eq 2 ]; then
            echo "has_changes=true" >> $GITHUB_OUTPUT
            echo "⚠️  Infrastructure drift detected!"
          else
            echo "❌ Terraform plan failed"
            exit 1
          fi

      - name: Save plan
        if: steps.plan.outputs.has_changes == 'true'
        uses: actions/upload-artifact@v3
        with:
          name: tfplan
          path: infra/terraform/tfplan

      - name: Send drift alert email
        if: steps.plan.outputs.has_changes == 'true'
        uses: dawidd6/action-send-mail@v3
        with:
          server_address: smtp.gmail.com
          server_port: 465
          username: ${{ secrets.MAIL_USERNAME }}
          password: ${{ secrets.MAIL_PASSWORD }}
          subject: ⚠️ Terraform Drift Detected - TODO App
          to: ${{ secrets.ALERT_EMAIL }}
          from: Terraform CI/CD
          body: |
            Infrastructure drift has been detected!

            Review the changes and approve the workflow to apply:
            ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}

  terraform-apply:
    name: Terraform Apply
    runs-on: ubuntu-latest
    needs: terraform-plan
    if: needs.terraform-plan.outputs.has_changes == 'true'
    environment: production  # Requires manual approval

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Download plan
        uses: actions/download-artifact@v3
        with:
          name: tfplan
          path: infra/terraform/

      - name: Terraform Init
        run: |
          cd infra/terraform
          terraform init

      - name: Terraform Apply
        run: |
          cd infra/terraform
          terraform apply tfplan

      - name: Save outputs
        run: |
          cd infra/terraform
          terraform output -json > outputs.json

      - name: Upload outputs
        uses: actions/upload-artifact@v3
        with:
          name: terraform-outputs
          path: infra/terraform/outputs.json

  ansible-deploy:
    name: Ansible Deployment
    runs-on: ubuntu-latest
    needs: terraform-apply
    if: always() && (needs.terraform-apply.result == 'success' || needs.terraform-plan.outputs.has_changes == 'false')

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install Ansible
        run: |
          pip install ansible

      - name: Setup SSH key
        run: |
          mkdir -p ~/.ssh
          echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
          chmod 600 ~/.ssh/id_rsa
          ssh-keyscan -H ${{ secrets.SERVER_IP }} >> ~/.ssh/known_hosts

      - name: Run Ansible playbook
        env:
          DOMAIN: ${{ secrets.DOMAIN }}
          ACME_EMAIL: ${{ secrets.ACME_EMAIL }}
          DB_NAME: ${{ secrets.DB_NAME }}
          DB_USER: ${{ secrets.DB_USER }}
          DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
          JWT_SECRET: ${{ secrets.JWT_SECRET }}
        run: |
          cd infra/ansible
          ansible-playbook -i inventory/hosts playbook.yml

      - name: Verify deployment
        run: |
          sleep 30  # Wait for services to stabilize
          curl -f https://${{ secrets.DOMAIN }}/health || exit 1
          echo "✅ Deployment verified!"

.github/workflows/application.yml

name: Application Deployment

on:
  push:
    branches: [main]
    paths:
      - 'frontend/**'
      - 'auth-api/**'
      - 'todos-api/**'
      - 'users-api/**'
      - 'log-processor/**'
      - 'docker-compose.yml'
  workflow_dispatch:

jobs:
  deploy:
    name: Deploy Application
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Setup SSH key
        run: |
          mkdir -p ~/.ssh
          echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
          chmod 600 ~/.ssh/id_rsa
          ssh-keyscan -H ${{ secrets.SERVER_IP }} >> ~/.ssh/known_hosts

      - name: Deploy to server
        run: |
          ssh ubuntu@${{ secrets.SERVER_IP }} << 'EOF'
            cd /opt/todo-app
            git pull origin main
            docker-compose pull
            docker-compose up -d --build
          EOF

      - name: Wait for deployment
        run: sleep 30

      - name: Health check
        run: |
          curl -f https://${{ secrets.DOMAIN }}/health || exit 1
          echo "✅ Application deployed successfully!"

Understanding the CI/CD Flow

Infrastructure changes (Terraform/Ansible):

1. Push to main
   ↓
2. Run terraform plan
   ↓
3. Detect drift? → Send email
   ↓
4. Pause for manual approval (GitHub Environment protection)
   ↓
5. Apply changes
   ↓
6. Run Ansible
   ↓
7. Verify deployment

Application changes:

1. Push to main
   ↓
2. SSH to server
   ↓
3. Git pull
   ↓
4. docker-compose pull
   ↓
5. docker-compose up
   ↓
6. Health check

Testing the Complete Setup

Local Testing

1. Test containers locally:

# Start everything
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f

# Test frontend
curl http://localhost

# Test APIs
curl http://localhost/api/auth/health
curl http://localhost/api/todos/health
curl http://localhost/api/users/health

# Stop everything
docker-compose down

2. Test Terraform:

cd infra/terraform

# Initialize
terraform init

# Validate
terraform validate

# Plan (dry run)
terraform plan

# Apply (create infrastructure)
terraform apply

# Show outputs
terraform output

# Destroy (cleanup)
terraform destroy

3. Test Ansible:

cd infra/ansible

# Test connection
ansible all -m ping

# Check syntax
ansible-playbook playbook.yml --syntax-check

# Dry run
ansible-playbook playbook.yml --check

# Run for real
ansible-playbook playbook.yml

# Run specific role
ansible-playbook playbook.yml --tags deploy

Production Deployment

Complete deployment from scratch:

# 1. Clone the repository
git clone https://github.com/yourusername/todo-app.git
cd todo-app

# 2. Configure secrets
cp .env.example .env
# Edit .env with your values

# 3. Initialize Terraform
cd infra/terraform
terraform init

# 4. Create infrastructure
terraform plan
terraform apply

# Wait for Ansible to complete (triggered automatically)

# 5. Configure DNS
# Point your domain to the Elastic IP shown in terraform outputs

# 6. Verify deployment
curl https://your-domain.com

Expected result:

Login page loads at https://your-domain.com
HTTPS works (automatic certificate from Let's Encrypt)
APIs respond at /api/auth, /api/todos, /api/users

Troubleshooting

Issue: Terraform fails with "state locked"

# Check lock info
terraform force-unlock <LOCK_ID>

# Or wait for other operation to complete

Issue: Ansible can't connect to server

# Test SSH manually
ssh -i ~/.ssh/id_rsa ubuntu@<SERVER_IP>

# Check inventory
ansible-inventory --list -i inventory/hosts

# Verbose output
ansible-playbook playbook.yml -vvv

Issue: Containers won't start

# Check logs
docker-compose logs <service-name>

# Check disk space
df -h

# Check memory
free -h

# Restart specific service
docker-compose restart <service-name>

Issue: HTTPS not working

# Check Traefik logs
docker logs traefik

# Verify DNS points to server
dig your-domain.com

# Check certificate
docker exec traefik cat /letsencrypt/acme.json

# Force certificate renewal
docker-compose down
rm -rf letsencrypt/acme.json
docker-compose up -d

Key Takeaways

Blue/Green Deployment Lessons

Nginx's backup directive is powerful - Simple yet effective failover
Tight timeouts enable fast failover - But tune based on your app
Health checks != failover - They serve different purposes
Chaos testing is essential - Test failures before they happen in production
Idempotency prevents surprises - Re-running should be safe

Microservices Deployment Lessons

Multi-stage Docker builds save space - 97% reduction possible
Traefik simplifies routing - Labels replace complex nginx configs
Terraform + Ansible separation works - Provision vs configure
Drift detection prevents disasters - Catch manual changes early
CI/CD approval gates add safety - Human oversight for infrastructure

Production-Ready Checklist

Before going live, ensure:

Security:

[ ] All secrets in environment variables, not code
[ ] SSL/TLS configured (Traefik handles this)
[ ] Firewall rules in place (UFW + security groups)
[ ] Containers run as non-root users
[ ] Regular security updates automated

Monitoring:

[ ] Health checks on all services
[ ] Centralized logging (consider ELK stack)
[ ] Metrics collection (Prometheus)
[ ] Alerting configured (PagerDuty, Opsgenie)
[ ] Uptime monitoring (UptimeRobot)

Reliability:

[ ] Database backups automated
[ ] Disaster recovery plan documented
[ ] Rollback procedures tested
[ ] Load testing completed
[ ] Chaos engineering practiced

Operations:

[ ] Documentation up to date
[ ] Runbooks for common issues
[ ] On-call rotation defined
[ ] Incident response process
[ ] Post-mortem template ready

Next Steps

Beginner Level

Set up Project 1 locally
- Get familiar with Docker Compose
- Understand nginx configuration
- Run the failover tests
Modify the setup
- Add a third color (red)
- Implement weighted load balancing
- Add custom health check endpoints

Intermediate Level

Implement Project 2
- Start with containerization only
- Add Traefik incrementally
- Test locally before cloud deployment
Add observability
- Integrate Prometheus metrics
- Set up Grafana dashboards
- Implement distributed tracing

Advanced Level

Production hardening
- Set up multi-region deployment
- Implement auto-scaling
- Add CDN (CloudFlare)
- Configure WAF rules
Advanced automation
- GitOps with ArgoCD or Flux
- Infrastructure testing with Terratest
- Policy as Code with Open Policy Agent

Resources

Official Documentation:

Learning Paths:

Community:

Conclusion

We've covered two comprehensive DevOps projects:

Project 1 taught us:

Zero-downtime deployments with blue/green strategy
Nginx automatic failover configuration
Chaos engineering for resilience testing

Project 2 showed us:

Containerizing multi-language microservices
Infrastructure as Code with Terraform
Configuration management with Ansible
Production-grade CI/CD pipelines
Drift detection and alerting

The skills you've learned here form the foundation of modern DevOps practices. Start with the basics, experiment fearlessly, and gradually add complexity as you grow.

Remember: the best infrastructure is the one that works reliably, fails gracefully, and lets you sleep peacefully at night.

Happy deploying! 🚀

Questions or feedback? Drop a comment below or reach out on Twitter / LinkedIn.

Found this helpful? Give it a ❤️ and share with fellow developers!

Table of Contents