A comprehensive walkthrough of two production-ready DevOps projects, perfect for beginners looking to level up their deployment game.
Table of Contents
- Introduction
- Project 1: Blue/Green Deployment with Nginx Auto-Failover
- Project 2: Containerized Microservices with Full Automation
- Key Takeaways
- Next Steps
Introduction
Ever wonder how major tech companies achieve zero-downtime deployments? How do they update their applications without users noticing? In this guide, we'll walk through two real-world DevOps projects that answer these questions.
What You'll Learn:
- Blue/Green deployment strategies for zero-downtime releases
- Nginx configuration for automatic failover
- Docker containerization for multiple programming languages
- Infrastructure as Code with Terraform
- Configuration management with Ansible
- CI/CD pipeline design with drift detection
- Production-ready security practices
Prerequisites:
- Basic understanding of Docker and Docker Compose
- Familiarity with command-line operations
- A code editor (VS Code, Vim, etc.)
- Git installed on your machine
- Access to a cloud provider (AWS, DigitalOcean, Hetzner, etc.) for Project 2
Time Investment:
- Project 1: 2-3 hours
- Project 2: 8-12 hours
Let's dive in!
Project 1: Blue/Green Deployment with Nginx Auto-Failover
What is Blue/Green Deployment?
Imagine you're running a restaurant. You have two identical kitchens: Kitchen Blue (currently serving customers) and Kitchen Green (on standby). When you want to update the menu or equipment, you:
- Update Kitchen Green while Kitchen Blue serves customers
- Test Kitchen Green thoroughly
- Switch all orders to Kitchen Green
- Now Kitchen Blue is on standby for the next update
That's exactly how blue/green deployment works in software! You maintain two identical production environments and can switch between them instantly.
Why Use This Approach?
Traditional deployment problems:
- Downtime during updates (users see "Site under maintenance")
- No easy rollback if something breaks
- Risk of breaking production with untested changes
Blue/Green deployment benefits:
- Zero downtime - users never notice the switch
- Instant rollback - just switch back if issues arise
- Safe testing - new version runs in production environment before switch
Project Architecture
┌─────────────┐
│ Nginx │
│ :8080 │
└──────┬──────┘
│
┌──────────────────────┴──────────────────────┐
│ │
┌─────────▼──────────┐ ┌─────────▼──────────┐
│ Blue (Primary) │ │ Green (Backup) │
│ :8081 │ │ :8082 │
└────────────────────┘ └────────────────────┘
How it works:
- Nginx sits at port 8080 (your main entry point)
- Blue service at port 8081 (currently handling all traffic)
- Green service at port 8082 (backup, ready to take over)
- When Blue fails, Nginx automatically routes traffic to Green
- No requests are dropped during the transition
Setting Up the Project
Step 1: Project Structure
Create your project directory:
mkdir blue-green-deployment
cd blue-green-deployment
Your final structure will look like this:
blue-green-deployment/
├── docker-compose.yml # Orchestrates all services
├── nginx.conf.template # Nginx configuration
├── entrypoint.sh # Nginx startup script
├── env.example # Environment variables template
├── .env # Your actual environment variables
├── Makefile # Convenient commands
├── test-failover.sh # Automated testing script
├── README.md # Documentation
├── DECISION.md # Technical decisions
└── PART_B_RESEARCH.md # Infrastructure research
Step 2: Docker Compose Configuration
The docker-compose.yml file is the heart of this setup. Let's break it down:
services:
# Nginx acts as the load balancer with automatic failover
nginx:
image: nginx:alpine
container_name: nginx-lb
ports:
- "8080:80" # Main entry point for users
volumes:
- ./nginx.conf.template:/etc/nginx/nginx.conf:ro
- ./entrypoint.sh:/entrypoint.sh:ro
environment:
- ACTIVE_POOL=${ACTIVE_POOL:-blue} # Which pool is primary?
- BLUE_UPSTREAM=app_blue:${PORT:-3000}
- GREEN_UPSTREAM=app_green:${PORT:-3000}
depends_on:
- app_blue
- app_green
entrypoint: ["/bin/sh", "/entrypoint.sh"]
networks:
- app-network
restart: unless-stopped
# Blue pool - primary by default
app_blue:
image: ${BLUE_IMAGE}
container_name: app_blue
ports:
- "8081:${PORT:-3000}" # Direct access for testing
environment:
- APP_POOL=blue
- RELEASE_ID=${RELEASE_ID_BLUE:-v1.0.0-blue}
- PORT=${PORT:-3000}
networks:
- app-network
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
"http://localhost:${PORT:-3000}/healthz"]
interval: 5s
timeout: 3s
retries: 3
# Green pool - backup by default
app_green:
image: ${GREEN_IMAGE}
container_name: app_green
ports:
- "8082:${PORT:-3000}" # Direct access for testing
environment:
- APP_POOL=green
- RELEASE_ID=${RELEASE_ID_GREEN:-v1.0.0-green}
- PORT=${PORT:-3000}
networks:
- app-network
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider",
"http://localhost:${PORT:-3000}/healthz"]
interval: 5s
timeout: 3s
retries: 3
networks:
app-network:
driver: bridge
Key points to understand:
-
Port Strategy:
- Port 8080: Public-facing Nginx (users hit this)
- Port 8081: Direct access to Blue (for chaos testing)
- Port 8082: Direct access to Green (for chaos testing)
-
Health Checks:
- Every 5 seconds, Docker checks if the service is healthy
- Uses the
/healthzendpoint - After 3 failed attempts (3s timeout each), marks as unhealthy
-
Environment Variables:
-
ACTIVE_POOL: Which service gets traffic first (blue/green) -
RELEASE_ID: Tracks which version is running -
PORT: Application port (default: 3000)
-
Step 3: Nginx Configuration Magic
The nginx.conf.template is where the failover magic happens:
events {
worker_connections 1024;
}
http {
# Logging to help debug issues
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log warn;
# Combined upstream with failover logic
upstream app_backend {
# Active pool (primary)
server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
# Backup pool - only used if primary fails
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}
server {
listen 80;
server_name localhost;
# Aggressive timeouts for quick failover detection
proxy_connect_timeout 2s;
proxy_send_timeout 3s;
proxy_read_timeout 3s;
# Retry logic - crucial for zero-downtime failover
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 10s;
# Don't buffer - we want real-time responses
proxy_buffering off;
# Forward original client info
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
location / {
proxy_pass http://app_backend;
proxy_pass_request_headers on;
}
}
}
Let's decode this configuration:
1. Upstream Block:
upstream app_backend {
server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
}
-
max_fails=2: After 2 failed requests, mark server as down -
fail_timeout=5s: Server is down for 5 seconds before retry -
backup: This server only receives traffic when primary fails
2. Timeout Settings:
proxy_connect_timeout 2s; # Can't connect? Fail fast
proxy_read_timeout 3s; # No response? Move to backup
These are aggressive timeouts for quick failure detection. In production, you might increase these based on your app's response times.
3. Retry Logic:
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;
This tells Nginx: "If you get an error, timeout, or 5xx status code, try the next server (backup) automatically."
Step 4: The Entrypoint Script
Nginx doesn't support environment variables natively in its config. We use envsubst to template it at runtime:
#!/bin/sh
set -e
# Figure out which upstream is active and which is backup
if [ "$ACTIVE_POOL" = "blue" ]; then
ACTIVE_UPSTREAM="$BLUE_UPSTREAM"
BACKUP_UPSTREAM="$GREEN_UPSTREAM"
else
ACTIVE_UPSTREAM="$GREEN_UPSTREAM"
BACKUP_UPSTREAM="$BLUE_UPSTREAM"
fi
echo "==> Setting up Nginx with active pool: $ACTIVE_POOL"
echo " Active upstream: $ACTIVE_UPSTREAM"
echo " Backup upstream: $BACKUP_UPSTREAM"
# Use envsubst to replace variables in the template
export ACTIVE_UPSTREAM
export BACKUP_UPSTREAM
envsubst '${ACTIVE_UPSTREAM} ${BACKUP_UPSTREAM}' \
< /etc/nginx/nginx.conf > /tmp/nginx.conf
# Test the config before starting (ALWAYS do this!)
nginx -t -c /tmp/nginx.conf
# Start nginx in foreground
echo "==> Starting Nginx..."
exec nginx -g 'daemon off;' -c /tmp/nginx.conf
What this script does:
- Reads the
ACTIVE_POOLenvironment variable - Sets up which upstream is active vs backup
- Replaces placeholders in nginx.conf
- Tests the configuration (prevents broken configs from starting)
- Starts Nginx with the processed config
Step 5: Environment Configuration
Create your .env file from the example:
cp env.example .env
The .env file contents:
# Docker images for blue and green pools
BLUE_IMAGE=your-registry/your-app:blue
GREEN_IMAGE=your-registry/your-app:green
# Which pool should be active? (blue or green)
ACTIVE_POOL=blue
# Release identifiers - show up in X-Release-Id header
RELEASE_ID_BLUE=v1.0.0-blue-20250129
RELEASE_ID_GREEN=v1.0.0-green-20250129
# Application port
PORT=3000
Pro tip: In a real deployment, BLUE_IMAGE and GREEN_IMAGE might point to different versions of your app:
- Blue:
myapp:v1.2.3 - Green:
myapp:v1.2.4(new version being tested)
Step 6: Makefile for Convenience
The Makefile provides friendly commands:
.PHONY: help up down restart logs test clean
help: ## Show available commands
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | \
awk 'BEGIN {FS = ":.*?## "}; {printf " %-15s %s\n", $$1, $$2}'
up: ## Start all services
@if [ ! -f .env ]; then \
echo "Creating .env from env.example..."; \
cp env.example .env; \
fi
docker-compose up -d
@echo "Services started!"
@echo "Nginx: http://localhost:8080"
@echo "Blue: http://localhost:8081"
@echo "Green: http://localhost:8082"
down: ## Stop all services
docker-compose down
test: ## Run failover test
./test-failover.sh
blue: ## Switch to blue as active
@sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=blue/' .env
@$(MAKE) restart
green: ## Switch to green as active
@sed -i.bak 's/ACTIVE_POOL=.*/ACTIVE_POOL=green/' .env
@$(MAKE) restart
Usage:
make help # See all commands
make up # Start everything
make test # Run failover tests
make green # Switch to green pool
Running the Project
Start the Stack
Option 1: Using Make (recommended)
make up
Option 2: Manual
cp env.example .env
docker-compose up -d
Verify It's Working
# Check service status
docker-compose ps
# Hit the main endpoint
curl -i http://localhost:8080/version
You should see headers like:
HTTP/1.1 200 OK
X-App-Pool: blue
X-Release-Id: v1.0.0-blue-20250129
Testing the Failover
This is where it gets exciting! Let's break the blue service and watch Nginx automatically switch to green.
Automated Test Script
The test-failover.sh script automates the entire test:
#!/bin/bash
set -e
echo "🔵 Blue/Green Failover Test"
echo "============================"
echo ""
# Step 1: Baseline check
echo "📊 Step 1: Checking baseline (should be blue)..."
for i in {1..3}; do
response=$(curl -s -i http://localhost:8080/version)
pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
echo " Request $i: Pool=$pool"
done
# Step 2: Trigger chaos on blue
echo "💥 Step 2: Triggering chaos on blue..."
curl -s -X POST "http://localhost:8081/chaos/start?mode=error"
sleep 1
# Step 3: Test failover
echo "🔄 Step 3: Testing failover (should switch to green)..."
success_count=0
green_count=0
for i in {1..10}; do
http_code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8080/version)
if [ "$http_code" = "200" ]; then
success_count=$((success_count + 1))
response=$(curl -s -i http://localhost:8080/version)
pool=$(echo "$response" | grep -i "X-App-Pool:" | cut -d: -f2 | tr -d ' \r')
if [ "$pool" = "green" ]; then
green_count=$((green_count + 1))
fi
echo " Request $i: HTTP $http_code, Pool=$pool ✓"
else
echo " Request $i: HTTP $http_code ✗ FAILED"
fi
sleep 0.5
done
# Step 4: Stop chaos
echo "🛑 Step 4: Stopping chaos..."
curl -s -X POST "http://localhost:8081/chaos/stop"
# Results
echo "📈 Results:"
echo " ├─ Total requests: 10"
echo " ├─ Successful (200): $success_count"
echo " └─ Routed to green: $green_count"
if [ $success_count -eq 10 ] && [ $green_count -ge 9 ]; then
echo "✅ Test PASSED - Failover working correctly!"
else
echo "❌ Test FAILED - Check the logs"
fi
Run the test:
chmod +x test-failover.sh
./test-failover.sh
Expected output:
🔵 Blue/Green Failover Test
============================
📊 Step 1: Checking baseline (should be blue)...
Request 1: Pool=blue
Request 2: Pool=blue
Request 3: Pool=blue
💥 Step 2: Triggering chaos on blue...
Chaos initiated: {"status":"chaos_started","mode":"error"}
🔄 Step 3: Testing failover (should switch to green)...
Request 1: HTTP 200, Pool=green ✓
Request 2: HTTP 200, Pool=green ✓
Request 3: HTTP 200, Pool=green ✓
...
Request 10: HTTP 200, Pool=green ✓
📈 Results:
├─ Total requests: 10
├─ Successful (200): 10
└─ Routed to green: 10
✅ Test PASSED - Failover working correctly!
What just happened?
- All requests initially went to blue
- We triggered chaos mode (blue starts returning 500 errors)
- Nginx detected blue was failing
- Zero requests failed - Nginx automatically retried on green
- All subsequent requests went to green
Manual Testing (Understanding Each Step)
Let's do it manually to understand what's happening:
1. Check baseline:
for i in {1..5}; do
curl -s http://localhost:8080/version | grep -E "pool|release"
done
All responses show "pool": "blue".
2. Break blue service:
# Trigger chaos mode - makes blue return 500 errors
curl -X POST "http://localhost:8081/chaos/start?mode=error"
3. Watch the magic:
# Keep hitting the endpoint
for i in {1..10}; do
curl -s -w "\nStatus: %{http_code}\n" http://localhost:8080/version | \
grep -E "pool|release|Status"
sleep 1
done
You'll see:
- All requests still return 200 (no failures!)
- Pool changes from "blue" to "green"
- Headers now show
"pool": "green"
4. Fix blue:
curl -X POST "http://localhost:8081/chaos/stop"
After a few seconds, traffic goes back to blue (it's the primary).
Key Concepts Explained
Why Aggressive Timeouts?
proxy_connect_timeout 2s;
proxy_read_timeout 3s;
Scenario: Your blue service starts hanging (taking 10+ seconds to respond).
With loose timeouts (10s):
- User makes request → Nginx waits 10s on blue → Fails → Retries on green
- User waited 10+ seconds (bad experience)
With tight timeouts (2-3s):
- User makes request → Nginx waits 2s on blue → Fails fast → Retries on green
- User gets response in ~2-3 seconds (much better!)
Trade-off: If your app legitimately takes 5 seconds to respond, 3s timeouts will cause false failures. Tune based on your app's performance.
The Backup Directive
server ${ACTIVE_UPSTREAM} max_fails=2 fail_timeout=5s;
server ${BACKUP_UPSTREAM} max_fails=2 fail_timeout=5s backup;
The backup keyword is crucial:
- Without it: Nginx load-balances 50/50 between blue and green
- With it: Nginx sends all traffic to blue, green only gets traffic when blue is down
This is what makes it true blue/green deployment (not load balancing).
Health Checks vs Failover
Docker health checks:
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:3000/healthz"]
interval: 5s
Nginx failover:
max_fails=2 fail_timeout=5s
These serve different purposes:
- Docker health checks: Tell Docker orchestration layer if container is healthy
- Nginx failover: Actual traffic routing decisions
Both are important, but Nginx failover is what keeps users happy during failures.
Production Considerations
What Would Change in Production?
1. Timeouts would be tuned:
- Measure your app's real response times
- Set timeouts slightly above p95 latency
- Balance fast failover vs false positives
2. Monitoring & Alerting:
# Add Prometheus metrics
location /metrics {
stub_status on;
}
3. Structured Logging:
log_format json escape=json '{'
'"time":"$time_iso8601",'
'"remote_addr":"$remote_addr",'
'"request":"$request",'
'"status":$status,'
'"upstream":"$upstream_addr"'
'}';
access_log /var/log/nginx/access.log json;
4. SSL/TLS:
server {
listen 443 ssl http2;
ssl_certificate /etc/ssl/certs/cert.pem;
ssl_certificate_key /etc/ssl/private/key.pem;
# ... rest of config
}
5. Rate Limiting:
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location / {
limit_req zone=api burst=20 nodelay;
proxy_pass http://app_backend;
}
Troubleshooting Guide
Problem: All requests failing
Check if containers are running:
docker-compose ps
Expected output:
NAME STATE PORTS
nginx-lb Up 0.0.0.0:8080->80/tcp
app_blue Up 0.0.0.0:8081->3000/tcp
app_green Up 0.0.0.0:8082->3000/tcp
If containers are down:
docker-compose logs
Problem: Not seeing failover
Check Nginx logs:
docker logs nginx-lb
# Or live tail:
docker logs -f nginx-lb
Look for:
[error] ... upstream timed out ... while connecting to upstream
[warn] ... marking server app_blue:3000 as down
Problem: Services won't start
Check if ports are already in use:
lsof -i :8080
lsof -i :8081
lsof -i :8082
If something is using these ports:
# Kill the process
kill -9 <PID>
# Or change ports in docker-compose.yml
ports:
- "9080:80" # Use 9080 instead of 8080
Problem: Changes to .env not taking effect
You need to recreate containers:
docker-compose down
docker-compose up -d
Just restarting isn't enough - environment variables are set at container creation time.
Real-World Use Cases
Scenario 1: Deploying a New Version
Without blue/green:
# Take site down
docker-compose down
# Update to new version
docker-compose up -d
# Hope nothing broke (users see downtime)
With blue/green:
# Update green to new version while blue serves traffic
sed -i 's/GREEN_IMAGE=.*/GREEN_IMAGE=myapp:v2.0.0/' .env
# Start green with new version
docker-compose up -d app_green
# Test green directly
curl http://localhost:8082/version
# Switch traffic to green (instant, zero downtime)
make green
# If something's wrong, instant rollback
make blue
Scenario 2: Database Migration
Challenge: New version needs schema changes.
Approach:
- Make schema changes backward-compatible
- Deploy new version to green (with migration)
- Test thoroughly
- Switch traffic
- Keep blue running for quick rollback
- After validation, update blue too
Example:
# Update green with new code + migrations
GREEN_IMAGE=myapp:v2.0.0 docker-compose up -d app_green
# Run migrations on green
docker exec app_green npm run migrate
# Test green
curl http://localhost:8082/test-endpoint
# Switch traffic
make green
# Monitor for issues
docker logs -f app_green
# If problems occur, instant rollback
make blue
Project 2: Containerized Microservices with Full Automation
Now that we understand blue/green deployments, let's scale up to a complete microservices application with full infrastructure automation.
The Big Picture
What we're building:
A production-ready TODO application with:
- Multiple microservices (5 different programming languages!)
- Automated infrastructure provisioning (Terraform)
- Automated configuration management (Ansible)
- CI/CD pipelines with drift detection
- Automatic HTTPS with Traefik
- Zero-trust security model
Architecture:
┌──────────────┐
│ Traefik │
│ (HTTPS/Proxy)│
└───────┬──────┘
│
┌──────────────────────┼──────────────────────┐
│ │ │
┌───────▼────────┐ ┌───────▼────────┐ ┌───────▼────────┐
│ Frontend │ │ Auth API │ │ Todos API │
│ (Vue.js) │ │ (Go) │ │ (Node.js) │
└────────────────┘ └────────────────┘ └────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌───────▼────────┐ ┌──▼────┐ ┌─────▼──────┐
│ Users API │ │ Redis │ │ Log │
│ (Java Spring) │ │ Queue │ │ Processor │
└────────────────┘ └───────┘ │ (Python) │
└────────────┘
Understanding the Application
The Services
1. Frontend (Vue.js)
- User interface
- Login page → TODO dashboard
- Communicates with backend APIs
- Port: 80/443 (via Traefik)
2. Auth API (Go)
- Handles user authentication
- Issues JWT tokens
- Endpoint:
/api/auth
3. Todos API (Node.js)
- Manages TODO items
- CRUD operations
- Requires valid JWT token
- Endpoint:
/api/todos
4. Users API (Java Spring Boot)
- User management
- Profile operations
- Endpoint:
/api/users
5. Log Processor (Python)
- Processes background tasks
- Consumes from Redis queue
- Writes audit logs
6. Redis Queue
- Message broker
- Task queue for async operations
Phase 1: Containerization
Let's containerize each service. The key is understanding that each language has its own best practices.
Frontend Dockerfile (Vue.js)
# Multi-stage build for optimized production image
# Stage 1: Build the application
FROM node:18-alpine AS builder
WORKDIR /app
# Copy package files
COPY package*.json ./
# Install dependencies
RUN npm ci --only=production
# Copy source code
COPY . .
# Build for production
RUN npm run build
# Stage 2: Serve with nginx
FROM nginx:alpine
# Copy built assets from builder stage
COPY --from=builder /app/dist /usr/share/nginx/html
# Copy nginx configuration
COPY nginx.conf /etc/nginx/conf.d/default.conf
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget --quiet --tries=1 --spider http://localhost/ || exit 1
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
Why multi-stage builds?
- Builder stage: 800MB (includes build tools)
- Final stage: 25MB (only nginx + static files)
- 97% size reduction!
Frontend nginx config:
server {
listen 80;
root /usr/share/nginx/html;
index index.html;
# SPA routing - send all requests to index.html
location / {
try_files $uri $uri/ /index.html;
}
# Cache static assets
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
}
Auth API Dockerfile (Go)
# Multi-stage build for Go
# Stage 1: Build the binary
FROM golang:1.21-alpine AS builder
WORKDIR /app
# Copy go mod files
COPY go.mod go.sum ./
# Download dependencies
RUN go mod download
# Copy source code
COPY . .
# Build the binary
# CGO_ENABLED=0 creates a static binary (no external dependencies)
RUN CGO_ENABLED=0 GOOS=linux go build -a -installsuffix cgo -o main .
# Stage 2: Create minimal runtime image
FROM alpine:latest
# Add ca-certificates for HTTPS calls
RUN apk --no-cache add ca-certificates
WORKDIR /root/
# Copy the binary from builder
COPY --from=builder /app/main .
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget --quiet --tries=1 --spider http://localhost:8080/health || exit 1
EXPOSE 8080
CMD ["./main"]
Why this approach?
- Builder stage: 400MB
- Final stage: 15MB (just Alpine + binary)
- Static binary = no runtime dependencies
- Faster startup, smaller attack surface
Todos API Dockerfile (Node.js)
FROM node:18-alpine
WORKDIR /app
# Install dependencies first (better caching)
COPY package*.json ./
RUN npm ci --only=production
# Copy application code
COPY . .
# Create non-root user for security
RUN addgroup -g 1001 -S nodejs && \
adduser -S nodejs -u 1001 && \
chown -R nodejs:nodejs /app
USER nodejs
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD node healthcheck.js || exit 1
EXPOSE 3000
CMD ["node", "server.js"]
Security note: Running as non-root user limits damage if container is compromised.
Users API Dockerfile (Java Spring Boot)
# Multi-stage build for Java
# Stage 1: Build with Maven
FROM maven:3.9-eclipse-temurin-17 AS builder
WORKDIR /app
# Copy pom.xml first (dependency caching)
COPY pom.xml ./
RUN mvn dependency:go-offline
# Copy source and build
COPY src ./src
RUN mvn clean package -DskipTests
# Stage 2: Runtime
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
# Copy JAR from builder
COPY --from=builder /app/target/*.jar app.jar
# Health check
HEALTHCHECK --interval=30s --timeout=3s \
CMD wget --quiet --tries=1 --spider http://localhost:8080/actuator/health || exit 1
EXPOSE 8080
# Use exec form to ensure proper signal handling
ENTRYPOINT ["java", "-jar", "/app/app.jar"]
Java-specific optimizations:
# Production optimization flags
ENTRYPOINT ["java", \
"-XX:+UseContainerSupport", \
"-XX:MaxRAMPercentage=75.0", \
"-XX:+ExitOnOutOfMemoryError", \
"-jar", "/app/app.jar"]
Log Processor Dockerfile (Python)
FROM python:3.11-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY . .
# Create non-root user
RUN useradd -m -u 1001 processor && \
chown -R processor:processor /app
USER processor
# Health check (check if process is running)
HEALTHCHECK --interval=30s --timeout=3s \
CMD ps aux | grep processor.py || exit 1
CMD ["python", "processor.py"]
Docker Compose - Orchestrating Everything
Now let's tie it all together with docker-compose.yml:
version: '3.8'
services:
# Traefik reverse proxy
traefik:
image: traefik:v2.10
container_name: traefik
command:
# API and dashboard
- "--api.dashboard=true"
- "--api.insecure=true"
# Docker provider
- "--providers.docker=true"
- "--providers.docker.exposedbydefault=false"
# Entrypoints
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
# HTTP to HTTPS redirect
- "--entrypoints.web.http.redirections.entrypoint.to=websecure"
- "--entrypoints.web.http.redirections.entrypoint.scheme=https"
# Let's Encrypt
- "--certificatesresolvers.letsencrypt.acme.tlschallenge=true"
- "--certificatesresolvers.letsencrypt.acme.email=${ACME_EMAIL}"
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
ports:
- "80:80"
- "443:443"
- "8080:8080" # Dashboard
volumes:
- "/var/run/docker.sock:/var/run/docker.sock:ro"
- "./letsencrypt:/letsencrypt"
networks:
- web
restart: unless-stopped
# Frontend
frontend:
build:
context: ./frontend
dockerfile: Dockerfile
container_name: frontend
labels:
- "traefik.enable=true"
- "traefik.http.routers.frontend.rule=Host(`${DOMAIN}`)"
- "traefik.http.routers.frontend.entrypoints=websecure"
- "traefik.http.routers.frontend.tls.certresolver=letsencrypt"
- "traefik.http.services.frontend.loadbalancer.server.port=80"
networks:
- web
restart: unless-stopped
# Auth API
auth:
build:
context: ./auth-api
dockerfile: Dockerfile
container_name: auth-api
environment:
- DB_HOST=postgres
- DB_PORT=5432
- DB_NAME=${DB_NAME}
- DB_USER=${DB_USER}
- DB_PASSWORD=${DB_PASSWORD}
- JWT_SECRET=${JWT_SECRET}
- REDIS_URL=redis://redis:6379
labels:
- "traefik.enable=true"
- "traefik.http.routers.auth.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/auth`)"
- "traefik.http.routers.auth.entrypoints=websecure"
- "traefik.http.routers.auth.tls.certresolver=letsencrypt"
- "traefik.http.services.auth.loadbalancer.server.port=8080"
depends_on:
- postgres
- redis
networks:
- web
- backend
restart: unless-stopped
# Todos API
todos:
build:
context: ./todos-api
dockerfile: Dockerfile
container_name: todos-api
environment:
- DB_HOST=postgres
- DB_PORT=5432
- DB_NAME=${DB_NAME}
- DB_USER=${DB_USER}
- DB_PASSWORD=${DB_PASSWORD}
- REDIS_URL=redis://redis:6379
labels:
- "traefik.enable=true"
- "traefik.http.routers.todos.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/todos`)"
- "traefik.http.routers.todos.entrypoints=websecure"
- "traefik.http.routers.todos.tls.certresolver=letsencrypt"
- "traefik.http.services.todos.loadbalancer.server.port=3000"
depends_on:
- postgres
- redis
networks:
- web
- backend
restart: unless-stopped
# Users API
users:
build:
context: ./users-api
dockerfile: Dockerfile
container_name: users-api
environment:
- SPRING_DATASOURCE_URL=jdbc:postgresql://postgres:5432/${DB_NAME}
- SPRING_DATASOURCE_USERNAME=${DB_USER}
- SPRING_DATASOURCE_PASSWORD=${DB_PASSWORD}
- SPRING_REDIS_HOST=redis
- SPRING_REDIS_PORT=6379
labels:
- "traefik.enable=true"
- "traefik.http.routers.users.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/users`)"
- "traefik.http.routers.users.entrypoints=websecure"
- "traefik.http.routers.users.tls.certresolver=letsencrypt"
- "traefik.http.services.users.loadbalancer.server.port=8080"
depends_on:
- postgres
- redis
networks:
- web
- backend
restart: unless-stopped
# Log Processor
log-processor:
build:
context: ./log-processor
dockerfile: Dockerfile
container_name: log-processor
environment:
- REDIS_URL=redis://redis:6379
- LOG_PATH=/logs
volumes:
- ./logs:/logs
depends_on:
- redis
networks:
- backend
restart: unless-stopped
# Redis
redis:
image: redis:7-alpine
container_name: redis
command: redis-server --appendonly yes
volumes:
- redis-data:/data
networks:
- backend
restart: unless-stopped
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
# PostgreSQL
postgres:
image: postgres:15-alpine
container_name: postgres
environment:
- POSTGRES_DB=${DB_NAME}
- POSTGRES_USER=${DB_USER}
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres-data:/var/lib/postgresql/data
networks:
- backend
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"]
interval: 10s
timeout: 3s
retries: 3
networks:
web:
driver: bridge
backend:
driver: bridge
volumes:
postgres-data:
redis-data:
Key concepts in this compose file:
1. Networks:
networks:
web: # Public-facing services
backend: # Internal services only
- Frontend, APIs →
webnetwork (accessible via Traefik) - Database, Redis →
backendnetwork only (isolated) - This provides network-level security
2. Traefik Labels:
labels:
- "traefik.enable=true"
- "traefik.http.routers.auth.rule=Host(`${DOMAIN}`) && PathPrefix(`/api/auth`)"
- "traefik.http.routers.auth.tls.certresolver=letsencrypt"
These labels tell Traefik how to route traffic:
- Route requests to
yourdomain.com/api/auth→ auth service - Automatically get SSL certificate from Let's Encrypt
- Handle HTTPS termination
3. Environment Variables:
environment:
- DB_HOST=postgres
- JWT_SECRET=${JWT_SECRET}
Secrets come from .env file (never committed to git!).
Environment Configuration
Create .env file:
# Domain configuration
DOMAIN=your-domain.com
ACME_EMAIL=your-email@example.com
# Database
DB_NAME=todoapp
DB_USER=todouser
DB_PASSWORD=change-this-strong-password
# Security
JWT_SECRET=change-this-to-random-string-min-32-chars
# Optional: Docker registry
DOCKER_REGISTRY=ghcr.io/yourusername
Security checklist for .env:
- [ ] Never commit .env to git
- [ ] Add .env to .gitignore
- [ ] Use strong passwords (20+ characters)
- [ ] Use different passwords for each service
- [ ] Rotate secrets regularly
Phase 2: Infrastructure as Code with Terraform
Now let's provision the cloud infrastructure automatically.
Project Structure
infra/
├── terraform/
│ ├── main.tf # Main configuration
│ ├── variables.tf # Input variables
│ ├── outputs.tf # Output values
│ ├── provider.tf # Provider configuration
│ └── backend.tf # Remote state configuration
├── ansible/
│ ├── inventory/ # Dynamic inventory
│ ├── roles/
│ │ ├── dependencies/ # Install Docker, etc.
│ │ └── deploy/ # Deploy application
│ ├── playbook.yml # Main playbook
│ └── ansible.cfg # Ansible configuration
└── scripts/
├── deploy.sh # Deployment orchestration
└── drift-check.sh # Drift detection
Terraform Configuration
provider.tf:
# Provider configuration
terraform {
required_version = ">= 1.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
local = {
source = "hashicorp/local"
version = "~> 2.0"
}
null = {
source = "hashicorp/null"
version = "~> 3.0"
}
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = "todo-app"
Environment = var.environment
ManagedBy = "terraform"
}
}
}
backend.tf:
# Remote state storage - crucial for team collaboration
terraform {
backend "s3" {
bucket = "your-terraform-state-bucket"
key = "todo-app/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
}
}
Why remote state?
- Team collaboration - everyone sees same state
- State locking - prevents concurrent modifications
- Backup - state is backed up in S3
- Encryption - sensitive data encrypted at rest
variables.tf:
variable "aws_region" {
description = "AWS region to deploy resources"
type = string
default = "us-east-1"
}
variable "environment" {
description = "Environment name"
type = string
default = "production"
}
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium"
}
variable "ssh_public_key" {
description = "SSH public key for access"
type = string
}
variable "domain_name" {
description = "Domain name for the application"
type = string
}
variable "alert_email" {
description = "Email for drift detection alerts"
type = string
}
variable "app_port" {
description = "Application port"
type = number
default = 80
}
main.tf:
# VPC for network isolation
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "todo-app-vpc"
}
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "todo-app-igw"
}
}
# Public Subnet
resource "aws_subnet" "public" {
vpc_id = aws_vpc.main.id
cidr_block = "10.0.1.0/24"
availability_zone = "${var.aws_region}a"
map_public_ip_on_launch = true
tags = {
Name = "todo-app-public-subnet"
}
}
# Route Table
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = {
Name = "todo-app-public-rt"
}
}
# Route Table Association
resource "aws_route_table_association" "public" {
subnet_id = aws_subnet.public.id
route_table_id = aws_route_table.public.id
}
# Security Group
resource "aws_security_group" "app" {
name = "todo-app-sg"
description = "Security group for TODO application"
vpc_id = aws_vpc.main.id
# SSH
ingress {
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "SSH access"
}
# HTTP
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTP access"
}
# HTTPS
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
description = "HTTPS access"
}
# Outbound - allow all
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound"
}
tags = {
Name = "todo-app-sg"
}
}
# SSH Key Pair
resource "aws_key_pair" "deployer" {
key_name = "todo-app-deployer"
public_key = var.ssh_public_key
}
# Latest Ubuntu AMI
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
# EC2 Instance
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
key_name = aws_key_pair.deployer.key_name
subnet_id = aws_subnet.public.id
vpc_security_group_ids = [aws_security_group.app.id]
root_block_device {
volume_size = 30
volume_type = "gp3"
encrypted = true
}
user_data = <<-EOF
#!/bin/bash
apt-get update
apt-get install -y python3 python3-pip
EOF
tags = {
Name = "todo-app-server"
}
# Lifecycle rule for idempotency
lifecycle {
ignore_changes = [
user_data, # Don't recreate if user_data changes
ami, # Don't recreate on AMI updates unless forced
]
}
}
# Elastic IP for stable public IP
resource "aws_eip" "app" {
instance = aws_instance.app.id
domain = "vpc"
tags = {
Name = "todo-app-eip"
}
}
# Generate Ansible inventory
resource "local_file" "ansible_inventory" {
content = templatefile("${path.module}/templates/inventory.tpl", {
app_server_ip = aws_eip.app.public_ip
ssh_key_path = "~/.ssh/id_rsa"
ssh_user = "ubuntu"
})
filename = "${path.module}/../ansible/inventory/hosts"
# Only regenerate if values change
lifecycle {
create_before_destroy = true
}
}
# Trigger Ansible after provisioning
resource "null_resource" "ansible_provisioner" {
# Run when instance changes
triggers = {
instance_id = aws_instance.app.id
timestamp = timestamp()
}
# Wait for instance to be ready
provisioner "local-exec" {
command = <<-EOT
echo "Waiting for SSH to be ready..."
until ssh -o StrictHostKeyChecking=no -o ConnectTimeout=2 ubuntu@${aws_eip.app.public_ip} echo "SSH Ready"; do
sleep 5
done
echo "Running Ansible playbook..."
cd ${path.module}/../ansible
ansible-playbook -i inventory/hosts playbook.yml
EOT
}
depends_on = [
local_file.ansible_inventory,
aws_eip.app
]
}
templates/inventory.tpl:
[app_servers]
todo-app ansible_host=${app_server_ip} ansible_user=${ssh_user} ansible_ssh_private_key_file=${ssh_key_path}
[app_servers:vars]
ansible_python_interpreter=/usr/bin/python3
outputs.tf:
output "instance_public_ip" {
description = "Public IP of the application server"
value = aws_eip.app.public_ip
}
output "instance_id" {
description = "ID of the EC2 instance"
value = aws_instance.app.id
}
output "domain_name" {
description = "Domain name for the application"
value = var.domain_name
}
output "ssh_command" {
description = "SSH command to connect to the server"
value = "ssh ubuntu@${aws_eip.app.public_ip}"
}
Understanding Terraform Idempotency
What is idempotency?
Running the same Terraform code multiple times produces the same result without creating duplicates.
Example - Non-idempotent (bad):
resource "aws_instance" "app" {
ami = "ami-12345"
instance_type = "t3.medium"
# This causes recreation on every apply!
tags = {
Timestamp = timestamp()
}
}
Idempotent (good):
resource "aws_instance" "app" {
ami = "ami-12345"
instance_type = "t3.medium"
tags = {
Name = "todo-app-server"
}
lifecycle {
ignore_changes = [
tags["Timestamp"],
user_data
]
}
}
Drift Detection
What is drift?
Drift occurs when actual infrastructure differs from Terraform state (manual changes, external tools, etc.).
drift-check.sh:
#!/bin/bash
set -e
echo "Checking for infrastructure drift..."
# Run terraform plan and capture output
PLAN_OUTPUT=$(terraform plan -detailed-exitcode -no-color 2>&1) || EXIT_CODE=$?
# Exit codes:
# 0 = no changes
# 1 = error
# 2 = changes detected (drift!)
if [ $EXIT_CODE -eq 0 ]; then
echo "✅ No drift detected - infrastructure matches desired state"
exit 0
elif [ $EXIT_CODE -eq 2 ]; then
echo "⚠️ DRIFT DETECTED - infrastructure has changed!"
echo ""
echo "$PLAN_OUTPUT"
echo ""
# Send email alert
./send-drift-alert.sh "$PLAN_OUTPUT"
# In CI/CD, pause for manual approval
if [ "$CI" = "true" ]; then
echo "Pausing for manual approval..."
# GitHub Actions, GitLab CI, etc. have approval mechanisms
exit 2
fi
else
echo "❌ Error running terraform plan"
echo "$PLAN_OUTPUT"
exit 1
fi
send-drift-alert.sh:
#!/bin/bash
DRIFT_DETAILS="$1"
ALERT_EMAIL="${ALERT_EMAIL:-admin@example.com}"
# Using AWS SES
aws ses send-email \
--from "terraform@example.com" \
--to "$ALERT_EMAIL" \
--subject "⚠️ Terraform Drift Detected" \
--text "$DRIFT_DETAILS"
# Or using curl with Mailgun, SendGrid, etc.
curl -s --user "api:$MAILGUN_API_KEY" \
https://api.mailgun.net/v3/$MAILGUN_DOMAIN/messages \
-F from="terraform@example.com" \
-F to="$ALERT_EMAIL" \
-F subject="⚠️ Terraform Drift Detected" \
-F text="$DRIFT_DETAILS"
Phase 3: Configuration Management with Ansible
Terraform provisions infrastructure, Ansible configures it.
Ansible Project Structure
ansible/
├── inventory/
│ └── hosts # Generated by Terraform
├── roles/
│ ├── dependencies/
│ │ ├── tasks/
│ │ │ └── main.yml
│ │ └── handlers/
│ │ └── main.yml
│ └── deploy/
│ ├── tasks/
│ │ └── main.yml
│ ├── templates/
│ │ └── .env.j2
│ └── handlers/
│ └── main.yml
├── playbook.yml
└── ansible.cfg
ansible.cfg
[defaults]
inventory = inventory/hosts
remote_user = ubuntu
private_key_file = ~/.ssh/id_rsa
host_key_checking = False
retry_files_enabled = False
# Faster execution
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 3600
# Better output
stdout_callback = yaml
bin_ansible_callbacks = True
[ssh_connection]
pipelining = True
roles/dependencies/tasks/main.yml
---
# Install required system dependencies
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
become: yes
- name: Install required packages
apt:
name:
- apt-transport-https
- ca-certificates
- curl
- gnupg
- lsb-release
- python3-pip
- git
- ufw
state: present
become: yes
- name: Add Docker GPG key
apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: present
become: yes
- name: Add Docker repository
apt_repository:
repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
state: present
become: yes
- name: Install Docker
apt:
name:
- docker-ce
- docker-ce-cli
- containerd.io
- docker-buildx-plugin
- docker-compose-plugin
state: present
become: yes
notify: Restart Docker
- name: Add user to docker group
user:
name: "{{ ansible_user }}"
groups: docker
append: yes
become: yes
- name: Install Docker Compose (standalone)
get_url:
url: "https://github.com/docker/compose/releases/download/v2.23.0/docker-compose-linux-x86_64"
dest: /usr/local/bin/docker-compose
mode: '0755'
become: yes
- name: Configure UFW firewall
ufw:
rule: "{{ item.rule }}"
port: "{{ item.port }}"
proto: "{{ item.proto }}"
loop:
- { rule: 'allow', port: '22', proto: 'tcp' }
- { rule: 'allow', port: '80', proto: 'tcp' }
- { rule: 'allow', port: '443', proto: 'tcp' }
become: yes
- name: Enable UFW
ufw:
state: enabled
become: yes
roles/dependencies/handlers/main.yml
---
- name: Restart Docker
systemd:
name: docker
state: restarted
enabled: yes
become: yes
roles/deploy/tasks/main.yml
---
# Deploy the application
- name: Create application directory
file:
path: /opt/todo-app
state: directory
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: '0755'
become: yes
- name: Clone application repository
git:
repo: "{{ app_repo_url }}"
dest: /opt/todo-app
version: "{{ app_branch | default('main') }}"
force: yes
register: git_clone
- name: Create environment file from template
template:
src: .env.j2
dest: /opt/todo-app/.env
owner: "{{ ansible_user }}"
mode: '0600'
no_log: yes # Don't log sensitive env vars
- name: Create letsencrypt directory
file:
path: /opt/todo-app/letsencrypt
state: directory
mode: '0755'
- name: Pull latest Docker images
community.docker.docker_compose:
project_src: /opt/todo-app
pull: yes
when: git_clone.changed
- name: Start application with Docker Compose
community.docker.docker_compose:
project_src: /opt/todo-app
state: present
restarted: "{{ git_clone.changed }}"
register: compose_output
- name: Wait for application to be healthy
uri:
url: "https://{{ domain_name }}/health"
status_code: 200
validate_certs: no
retries: 10
delay: 10
register: health_check
until: health_check.status == 200
- name: Display deployment status
debug:
msg: "Application deployed successfully at https://{{ domain_name }}"
roles/deploy/templates/.env.j2
# Auto-generated by Ansible - DO NOT EDIT MANUALLY
# Domain configuration
DOMAIN={{ domain_name }}
ACME_EMAIL={{ acme_email }}
# Database
DB_NAME={{ db_name }}
DB_USER={{ db_user }}
DB_PASSWORD={{ db_password }}
# Security
JWT_SECRET={{ jwt_secret }}
# Application
NODE_ENV=production
LOG_LEVEL=info
playbook.yml
---
- name: Deploy TODO Application
hosts: app_servers
become: no
vars:
app_repo_url: "https://github.com/yourusername/todo-app.git"
app_branch: "main"
domain_name: "{{ lookup('env', 'DOMAIN') }}"
acme_email: "{{ lookup('env', 'ACME_EMAIL') }}"
db_name: "{{ lookup('env', 'DB_NAME') }}"
db_user: "{{ lookup('env', 'DB_USER') }}"
db_password: "{{ lookup('env', 'DB_PASSWORD') }}"
jwt_secret: "{{ lookup('env', 'JWT_SECRET') }}"
roles:
- dependencies
- deploy
post_tasks:
- name: Verify deployment
uri:
url: "https://{{ domain_name }}"
status_code: 200
validate_certs: yes
delegate_to: localhost
- name: Display application URL
debug:
msg: "Application is live at https://{{ domain_name }}"
Phase 4: CI/CD Pipeline
Now let's automate everything with GitHub Actions.
.github/workflows/infrastructure.yml
name: Infrastructure Deployment
on:
push:
branches: [main]
paths:
- 'infra/terraform/**'
- 'infra/ansible/**'
- '.github/workflows/infrastructure.yml'
workflow_dispatch: # Manual trigger
env:
TF_VERSION: '1.6.0'
AWS_REGION: 'us-east-1'
jobs:
terraform-plan:
name: Terraform Plan & Drift Detection
runs-on: ubuntu-latest
outputs:
has_changes: ${{ steps.plan.outputs.has_changes }}
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Terraform Init
run: |
cd infra/terraform
terraform init
- name: Terraform Plan
id: plan
run: |
cd infra/terraform
terraform plan -detailed-exitcode -out=tfplan || EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "has_changes=false" >> $GITHUB_OUTPUT
echo "✅ No infrastructure changes detected"
elif [ $EXIT_CODE -eq 2 ]; then
echo "has_changes=true" >> $GITHUB_OUTPUT
echo "⚠️ Infrastructure drift detected!"
else
echo "❌ Terraform plan failed"
exit 1
fi
- name: Save plan
if: steps.plan.outputs.has_changes == 'true'
uses: actions/upload-artifact@v3
with:
name: tfplan
path: infra/terraform/tfplan
- name: Send drift alert email
if: steps.plan.outputs.has_changes == 'true'
uses: dawidd6/action-send-mail@v3
with:
server_address: smtp.gmail.com
server_port: 465
username: ${{ secrets.MAIL_USERNAME }}
password: ${{ secrets.MAIL_PASSWORD }}
subject: ⚠️ Terraform Drift Detected - TODO App
to: ${{ secrets.ALERT_EMAIL }}
from: Terraform CI/CD
body: |
Infrastructure drift has been detected!
Review the changes and approve the workflow to apply:
${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
terraform-apply:
name: Terraform Apply
runs-on: ubuntu-latest
needs: terraform-plan
if: needs.terraform-plan.outputs.has_changes == 'true'
environment: production # Requires manual approval
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ env.AWS_REGION }}
- name: Download plan
uses: actions/download-artifact@v3
with:
name: tfplan
path: infra/terraform/
- name: Terraform Init
run: |
cd infra/terraform
terraform init
- name: Terraform Apply
run: |
cd infra/terraform
terraform apply tfplan
- name: Save outputs
run: |
cd infra/terraform
terraform output -json > outputs.json
- name: Upload outputs
uses: actions/upload-artifact@v3
with:
name: terraform-outputs
path: infra/terraform/outputs.json
ansible-deploy:
name: Ansible Deployment
runs-on: ubuntu-latest
needs: terraform-apply
if: always() && (needs.terraform-apply.result == 'success' || needs.terraform-plan.outputs.has_changes == 'false')
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install Ansible
run: |
pip install ansible
- name: Setup SSH key
run: |
mkdir -p ~/.ssh
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
ssh-keyscan -H ${{ secrets.SERVER_IP }} >> ~/.ssh/known_hosts
- name: Run Ansible playbook
env:
DOMAIN: ${{ secrets.DOMAIN }}
ACME_EMAIL: ${{ secrets.ACME_EMAIL }}
DB_NAME: ${{ secrets.DB_NAME }}
DB_USER: ${{ secrets.DB_USER }}
DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
JWT_SECRET: ${{ secrets.JWT_SECRET }}
run: |
cd infra/ansible
ansible-playbook -i inventory/hosts playbook.yml
- name: Verify deployment
run: |
sleep 30 # Wait for services to stabilize
curl -f https://${{ secrets.DOMAIN }}/health || exit 1
echo "✅ Deployment verified!"
.github/workflows/application.yml
name: Application Deployment
on:
push:
branches: [main]
paths:
- 'frontend/**'
- 'auth-api/**'
- 'todos-api/**'
- 'users-api/**'
- 'log-processor/**'
- 'docker-compose.yml'
workflow_dispatch:
jobs:
deploy:
name: Deploy Application
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Setup SSH key
run: |
mkdir -p ~/.ssh
echo "${{ secrets.SSH_PRIVATE_KEY }}" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
ssh-keyscan -H ${{ secrets.SERVER_IP }} >> ~/.ssh/known_hosts
- name: Deploy to server
run: |
ssh ubuntu@${{ secrets.SERVER_IP }} << 'EOF'
cd /opt/todo-app
git pull origin main
docker-compose pull
docker-compose up -d --build
EOF
- name: Wait for deployment
run: sleep 30
- name: Health check
run: |
curl -f https://${{ secrets.DOMAIN }}/health || exit 1
echo "✅ Application deployed successfully!"
Understanding the CI/CD Flow
Infrastructure changes (Terraform/Ansible):
1. Push to main
↓
2. Run terraform plan
↓
3. Detect drift? → Send email
↓
4. Pause for manual approval (GitHub Environment protection)
↓
5. Apply changes
↓
6. Run Ansible
↓
7. Verify deployment
Application changes:
1. Push to main
↓
2. SSH to server
↓
3. Git pull
↓
4. docker-compose pull
↓
5. docker-compose up
↓
6. Health check
Testing the Complete Setup
Local Testing
1. Test containers locally:
# Start everything
docker-compose up -d
# Check status
docker-compose ps
# View logs
docker-compose logs -f
# Test frontend
curl http://localhost
# Test APIs
curl http://localhost/api/auth/health
curl http://localhost/api/todos/health
curl http://localhost/api/users/health
# Stop everything
docker-compose down
2. Test Terraform:
cd infra/terraform
# Initialize
terraform init
# Validate
terraform validate
# Plan (dry run)
terraform plan
# Apply (create infrastructure)
terraform apply
# Show outputs
terraform output
# Destroy (cleanup)
terraform destroy
3. Test Ansible:
cd infra/ansible
# Test connection
ansible all -m ping
# Check syntax
ansible-playbook playbook.yml --syntax-check
# Dry run
ansible-playbook playbook.yml --check
# Run for real
ansible-playbook playbook.yml
# Run specific role
ansible-playbook playbook.yml --tags deploy
Production Deployment
Complete deployment from scratch:
# 1. Clone the repository
git clone https://github.com/yourusername/todo-app.git
cd todo-app
# 2. Configure secrets
cp .env.example .env
# Edit .env with your values
# 3. Initialize Terraform
cd infra/terraform
terraform init
# 4. Create infrastructure
terraform plan
terraform apply
# Wait for Ansible to complete (triggered automatically)
# 5. Configure DNS
# Point your domain to the Elastic IP shown in terraform outputs
# 6. Verify deployment
curl https://your-domain.com
Expected result:
- Login page loads at https://your-domain.com
- HTTPS works (automatic certificate from Let's Encrypt)
- APIs respond at /api/auth, /api/todos, /api/users
Troubleshooting
Issue: Terraform fails with "state locked"
# Check lock info
terraform force-unlock <LOCK_ID>
# Or wait for other operation to complete
Issue: Ansible can't connect to server
# Test SSH manually
ssh -i ~/.ssh/id_rsa ubuntu@<SERVER_IP>
# Check inventory
ansible-inventory --list -i inventory/hosts
# Verbose output
ansible-playbook playbook.yml -vvv
Issue: Containers won't start
# Check logs
docker-compose logs <service-name>
# Check disk space
df -h
# Check memory
free -h
# Restart specific service
docker-compose restart <service-name>
Issue: HTTPS not working
# Check Traefik logs
docker logs traefik
# Verify DNS points to server
dig your-domain.com
# Check certificate
docker exec traefik cat /letsencrypt/acme.json
# Force certificate renewal
docker-compose down
rm -rf letsencrypt/acme.json
docker-compose up -d
Key Takeaways
Blue/Green Deployment Lessons
-
Nginx's
backupdirective is powerful - Simple yet effective failover - Tight timeouts enable fast failover - But tune based on your app
- Health checks != failover - They serve different purposes
- Chaos testing is essential - Test failures before they happen in production
- Idempotency prevents surprises - Re-running should be safe
Microservices Deployment Lessons
- Multi-stage Docker builds save space - 97% reduction possible
- Traefik simplifies routing - Labels replace complex nginx configs
- Terraform + Ansible separation works - Provision vs configure
- Drift detection prevents disasters - Catch manual changes early
- CI/CD approval gates add safety - Human oversight for infrastructure
Production-Ready Checklist
Before going live, ensure:
Security:
- [ ] All secrets in environment variables, not code
- [ ] SSL/TLS configured (Traefik handles this)
- [ ] Firewall rules in place (UFW + security groups)
- [ ] Containers run as non-root users
- [ ] Regular security updates automated
Monitoring:
- [ ] Health checks on all services
- [ ] Centralized logging (consider ELK stack)
- [ ] Metrics collection (Prometheus)
- [ ] Alerting configured (PagerDuty, Opsgenie)
- [ ] Uptime monitoring (UptimeRobot)
Reliability:
- [ ] Database backups automated
- [ ] Disaster recovery plan documented
- [ ] Rollback procedures tested
- [ ] Load testing completed
- [ ] Chaos engineering practiced
Operations:
- [ ] Documentation up to date
- [ ] Runbooks for common issues
- [ ] On-call rotation defined
- [ ] Incident response process
- [ ] Post-mortem template ready
Next Steps
Beginner Level
-
Set up Project 1 locally
- Get familiar with Docker Compose
- Understand nginx configuration
- Run the failover tests
-
Modify the setup
- Add a third color (red)
- Implement weighted load balancing
- Add custom health check endpoints
Intermediate Level
-
Implement Project 2
- Start with containerization only
- Add Traefik incrementally
- Test locally before cloud deployment
-
Add observability
- Integrate Prometheus metrics
- Set up Grafana dashboards
- Implement distributed tracing
Advanced Level
-
Production hardening
- Set up multi-region deployment
- Implement auto-scaling
- Add CDN (CloudFlare)
- Configure WAF rules
-
Advanced automation
- GitOps with ArgoCD or Flux
- Infrastructure testing with Terratest
- Policy as Code with Open Policy Agent
Resources
Official Documentation:
Learning Paths:
Community:
Conclusion
We've covered two comprehensive DevOps projects:
Project 1 taught us:
- Zero-downtime deployments with blue/green strategy
- Nginx automatic failover configuration
- Chaos engineering for resilience testing
Project 2 showed us:
- Containerizing multi-language microservices
- Infrastructure as Code with Terraform
- Configuration management with Ansible
- Production-grade CI/CD pipelines
- Drift detection and alerting
The skills you've learned here form the foundation of modern DevOps practices. Start with the basics, experiment fearlessly, and gradually add complexity as you grow.
Remember: the best infrastructure is the one that works reliably, fails gracefully, and lets you sleep peacefully at night.
Happy deploying! 🚀
Questions or feedback? Drop a comment below or reach out on Twitter / LinkedIn.
Found this helpful? Give it a ❤️ and share with fellow developers!
Top comments (0)