DEV Community

Mart Young
Mart Young

Posted on • Edited on

From Zero to Production: A Complete Guide to Deploying Microservices with Terraform, Ansible and CI/CD

How I built a production-ready DevOps pipeline for a microservices TODO application - and how you can too, even if you're just starting out.


Introduction: Why This Matters

If you're reading this, you've probably heard terms like "DevOps," "Infrastructure as Code," and "CI/CD" thrown around, but maybe you're not entirely sure what they mean or how they fit together. That's exactly where I was when I started.

This guide isn't just about completing a task - it's about understanding the why behind each decision, learning from common mistakes, and building something you can be proud of. By the end, you'll have deployed a real application to the cloud with automated infrastructure, proper security, and a professional workflow.

What you'll build:

  • A microservices application with 5 different services (Vue.js, Go, Node.js, Java, Python)
  • Automated cloud infrastructure using Terraform
  • Server configuration and deployment with Ansible
  • CI/CD pipelines that detect when things go wrong
  • Multi-environment setup (dev, staging, production)
  • Secure HTTPS with automatic SSL certificates
  • A single command that deploys everything

What you'll learn:

  • How containerization actually works (beyond just "docker run")
  • Why infrastructure as code matters (and how it saves you from disasters)
  • How to think about security in a cloud environment
  • The importance of automation and what happens when you skip it

Let's dive in.


Part 1: Understanding What We're Building

Before we start writing code, let's understand what we're actually building. This isn't just a TODO app - it's a microservices architecture, which means instead of one big application, we have multiple small services that work together.

The Architecture

Think of it like a restaurant:

  • Frontend (Vue.js) - The dining room where customers interact
  • Auth API (Go) - The host who checks if you have a reservation (authentication)
  • Todos API (Node.js) - The waiter who takes your order (manages your todos)
  • Users API (Java) - The manager who knows all the customers (user management)
  • Log Processor (Python) - The kitchen staff who process orders (background processing)
  • Redis - The order board where everyone can see what's happening (message queue)

Each service runs in its own container, which is like giving each part of the restaurant its own kitchen. If the waiter (Todos API) has a problem, it doesn't crash the whole restaurant.

Why Containerization?

You might be thinking: "Why not just run everything on one server?" Great question! Here's why containers matter:

  1. Isolation: If one service crashes, others keep running
  2. Consistency: "It works on my machine" becomes "it works everywhere"
  3. Scalability: Need more power? Spin up more containers
  4. Portability: Move from AWS to Azure? Just change where you run containers

The Challenge

The real challenge isn't just getting containers to run - it's:

  • Making sure they can talk to each other
  • Securing them with HTTPS
  • Automating deployment so you don't manually SSH into servers
  • Detecting when someone changes things manually (drift detection)
  • Managing multiple environments without chaos

That's what makes this a real-world project.


Part 2: Setting Up Your Development Environment

Before we write any code, let's make sure you have everything you need. Don't worry if some of these are new - I'll explain what each one does.

Required Accounts

GitHub Account (Free)

  • This is where your code lives and where CI/CD runs
  • Think of it as your code's home and your automation's brain
  • Sign up at github.com if you don't have one

AWS Account (Free tier available)

  • This is where your servers will run
  • AWS has a free tier that's perfect for learning
  • You'll need a credit card, but we'll stay within free limits
  • Sign up at aws.amazon.com

Domain Name (Optional but recommended - ~$10-15/year)

  • This is your website's address (like yourname.com)
  • You can use services like Namecheap, GoDaddy, or Cloudflare
  • Why you need it: Let's Encrypt (free SSL) requires a real domain
  • Alternative: You can test with localhost but won't get real SSL

Installing Required Tools

Docker & Docker Compose

# On Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose-plugin

# Verify installation
docker --version
docker compose version
Enter fullscreen mode Exit fullscreen mode

What is Docker? Think of it as a shipping container for software. Just like shipping containers standardize how goods are transported, Docker standardizes how applications run.

Terraform (Version 1.5.0 or higher)

# Download from hashicorp.com or use package manager
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Verify
terraform version
Enter fullscreen mode Exit fullscreen mode

What is Terraform? It's like a blueprint for your cloud infrastructure. Instead of clicking buttons in AWS console (which you'll forget), you write code that describes what you want, and Terraform makes it happen.

Ansible

sudo apt-get install ansible

# Verify
ansible --version
Enter fullscreen mode Exit fullscreen mode

What is Ansible? Think of it as a remote control for servers. Instead of SSHing into each server and typing commands, you write a "playbook" that tells Ansible what to do, and it does it on all your servers.

AWS CLI

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure with your credentials
aws configure
Enter fullscreen mode Exit fullscreen mode

What is AWS CLI? It's a command-line interface to AWS. Instead of using the web console, you can control AWS from your terminal.

Setting Up AWS

This is where many beginners get stuck, so let's go through it step by step.

Step 1: Create an IAM User

Why not use your root account? Security best practice - root account has unlimited power. If it gets compromised, your entire AWS account is at risk.

  1. Go to AWS Console → IAM → Users
  2. Click "Create user"
  3. Name it something like terraform-user
  4. Important: Check "Provide user access to the AWS Management Console" if you want console access, OR just programmatic access for CLI/API
  5. Attach policies:
    • AmazonEC2FullAccess (for creating servers)
    • AmazonS3FullAccess (for storing Terraform state)
    • AmazonDynamoDBFullAccess (for state locking)
    • AmazonSESFullAccess (for email notifications)
  6. Save the Access Key ID and Secret Access Key - you'll need these!

Step 2: Create S3 Bucket for Terraform State

What is Terraform state? Terraform needs to remember what it created. This "memory" is stored in a state file. We put it in S3 so it's:

  • Backed up automatically
  • Accessible from anywhere
  • Versioned (can see history)
  1. Go to S3 → Create bucket
  2. Name it something like yourname-terraform-state
  3. Important settings:
    • Region: Choose one (remember which one!)
    • Block Public Access: Keep all enabled (security)
    • Versioning: Enable this (so you can recover if state gets corrupted)
  4. Click Create

Step 3: Create DynamoDB Table for State Locking

Why do we need locking? Imagine two people trying to deploy at the same time. Without locking, they might both try to create the same server, causing conflicts. DynamoDB prevents this.

  1. Go to DynamoDB → Create table
  2. Table name: terraform-state-lock
  3. Partition key: LockID (type: String)
  4. Table settings: Use default
  5. Capacity: On-demand (pay per request - perfect for this use case)
  6. Click Create

Step 4: Create EC2 Key Pair

What is a key pair? It's like a password, but more secure. Instead of typing a password, you use a private key file to authenticate.

  1. Go to EC2 → Key Pairs → Create key pair
  2. Name: my-terraform-key (or whatever you prefer)
  3. Key pair type: RSA
  4. Private key file format: .pem
  5. Click Create
  6. IMPORTANT: The .pem file downloads automatically. Save it somewhere safe! You'll need it to SSH into your servers.

Step 5: Verify AWS CLI Works

# Test your credentials
aws sts get-caller-identity

# Should show your user ARN
Enter fullscreen mode Exit fullscreen mode

If this works, you're all set! If not, check your aws configure settings.


Part 3: Containerizing Your Application

Now that your environment is set up, let's containerize the application. This is where the magic happens.

Understanding Dockerfiles

A Dockerfile is like a recipe. It tells Docker:

  1. What base image to start with (like choosing an operating system)
  2. What files to copy
  3. What commands to run
  4. What port to expose
  5. What command to run when the container starts

Creating Dockerfiles for All Services

Now, you might be wondering: "How do I know what Dockerfile to create for each service?" Great question! Let me show you the pattern.

The rule of thumb: Each service folder needs its own Dockerfile. Look at your project structure:

DevOps-deployment/
├── frontend/          → Needs Dockerfile (Vue.js)
├── auth-api/          → Needs Dockerfile (Go) 
├── todos-api/         → Needs Dockerfile (Node.js)
├── users-api/         → Needs Dockerfile (Java)
└── log-message-processor/ → Needs Dockerfile (Python)
Enter fullscreen mode Exit fullscreen mode

How to figure out what each service needs:

  1. Check what language/framework it uses (look for package.json, pom.xml, requirements.txt, go.mod)
  2. Find the entry point (usually server.js, main.go, main.py, or a compiled JAR)
  3. Determine the port it runs on (check the code or config files)
  4. Follow the pattern for that language

Frontend Dockerfile (Vue.js)

First, check the service:

cd frontend
ls -la
# You'll see: package.json, src/, public/
# This tells you: It's a Vue.js app that needs to be built
Enter fullscreen mode Exit fullscreen mode

Vue.js apps are special - they compile to static HTML/CSS/JS files that need a web server. We use a "multi-stage build":

  1. Build stage: Use Node.js to compile the Vue app
  2. Runtime stage: Use nginx (lightweight web server) to serve the compiled files
# Step 1: Build the application
FROM node:18-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

# Step 2: Serve it with nginx
FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • FROM node:18-alpine AS build - Start with Node.js for building (the AS build names this stage)
  • COPY package*.json ./ - Copy dependency files first (Docker caching optimization!)
  • RUN npm install - Install dependencies
  • RUN npm run build - Compile Vue.js to static files (creates dist/ folder)
  • FROM nginx:alpine - Start a NEW stage with nginx (much smaller image)
  • COPY --from=build - Copy the built files from the build stage
  • EXPOSE 80 - nginx serves on port 80

Why two stages? The build stage has Node.js + all build tools (~500MB). The runtime stage only has nginx + static files (~20MB). This makes the final image 25x smaller!

Auth API Dockerfile (Go)

Check the service:

cd auth-api
ls -la
# You'll see: go.mod, main.go
# This tells you: It's Go, entry point is main.go
Enter fullscreen mode Exit fullscreen mode

Go is special - it compiles to a single binary file. No runtime needed! We also use multi-stage build:

# Build stage
FROM golang:1.21-alpine AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o auth-api

# Runtime stage
FROM alpine:latest
RUN apk --no-cache add ca-certificates
COPY --from=build /app/auth-api /auth-api
EXPOSE 8081
CMD ["/auth-api"]
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • FROM golang:1.21-alpine AS build - Go compiler for building
  • RUN go mod download - Download Go dependencies
  • RUN go build -o auth-api - Compile to a single binary file
  • FROM alpine:latest - Tiny Linux (only 5MB!)
  • COPY --from=build /app/auth-api - Copy the compiled binary
  • CMD ["/auth-api"] - Run the binary

Key differences from Vue.js:

  • Go compiles to a single binary (no runtime needed!)
  • Final image is super small (~10MB vs ~500MB for Node.js)
  • CGO_ENABLED=0 creates a static binary (no external dependencies)

Todos API Dockerfile (Node.js)

Check the service:

cd todos-api
ls -la
# You'll see: package.json, server.js, routes.js
# This tells you: It's Node.js, entry point is server.js
Enter fullscreen mode Exit fullscreen mode

Check package.json for the start command:

{
  "scripts": {
    "start": "node server.js"
  }
}
Enter fullscreen mode Exit fullscreen mode

Node.js API Dockerfile (simpler than Vue.js - no build step needed):

FROM node:18-alpine
WORKDIR /app

# Copy dependency files first (Docker caching optimization)
COPY package*.json ./

# Install dependencies (production only for smaller image)
RUN npm ci --only=production

# Copy application code
COPY . .

# Expose the port (check server.js to see which port)
EXPOSE 8082

# Start the application
CMD ["node", "server.js"]
Enter fullscreen mode Exit fullscreen mode

Why this pattern?

  • npm ci --only=production - Faster, more reliable than npm install, and skips dev dependencies
  • Copy package*.json first - If dependencies don't change, Docker reuses the cached layer
  • node:18-alpine - Lightweight Node.js image

How to test it:

# Build the image
docker build -t todos-api ./todos-api

# Run it
docker run -p 8082:8082 todos-api

# Test it
curl http://localhost:8082
Enter fullscreen mode Exit fullscreen mode

Users API Dockerfile (Java Spring Boot)

Check the service:

cd users-api
ls -la
# You'll see: pom.xml, src/
# This tells you: It's Java with Maven, needs to be compiled
Enter fullscreen mode Exit fullscreen mode

Java services need two stages:

  1. Build stage - Compile the code
  2. Runtime stage - Run the compiled JAR
# Stage 1: Build
FROM maven:3.9-eclipse-temurin-17 AS build
WORKDIR /app

# Copy Maven config first (caching optimization)
COPY pom.xml .
# Download dependencies (cached if pom.xml doesn't change)
RUN mvn dependency:go-offline

# Copy source code
COPY src ./src

# Build the application
RUN mvn clean package -DskipTests

# Stage 2: Runtime
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app

# Install JAXB dependencies (needed for Java 17+)
RUN apk add --no-cache wget && \
    mkdir -p /app/lib && \
    wget -q -O /app/lib/jaxb-api.jar https://repo1.maven.org/maven2/javax/xml/bind/jaxb-api/2.3.1/jaxb-api-2.3.1.jar && \
    wget -q -O /app/lib/jaxb-runtime.jar https://repo1.maven.org/maven2/org/glassfish/jaxb/jaxb-runtime/2.3.1/jaxb-runtime-2.3.1.jar && \
    apk del wget

# Copy the built JAR from build stage
COPY --from=build /app/target/*.jar app.jar

EXPOSE 8083

# Run the Spring Boot application
ENTRYPOINT ["java", \
    "--add-opens", "java.base/java.lang=ALL-UNNAMED", \
    "--add-opens", "java.base/java.lang.reflect=ALL-UNNAMED", \
    "--add-opens", "java.base/java.util=ALL-UNNAMED", \
    "-cp", "app.jar:/app/lib/*", \
    "org.springframework.boot.loader.JarLauncher"]
Enter fullscreen mode Exit fullscreen mode

Why this is complex:

  • Java needs compilation (Maven does this)
  • Spring Boot creates a "fat JAR" (includes everything)
  • Java 17+ removed some libraries (JAXB), so we add them back
  • The --add-opens flags are needed for Java 17+ module system

Don't worry if this looks complicated - Java Dockerfiles are the most complex. The pattern is always:

  1. Build stage: Install dependencies, compile
  2. Runtime stage: Copy compiled artifact, run it

Log Message Processor Dockerfile (Python)

Check the service:

cd log-message-processor
ls -la
# You'll see: requirements.txt, main.py
# This tells you: It's Python, entry point is main.py
Enter fullscreen mode Exit fullscreen mode

Python Dockerfile:

FROM python:3.11-slim
WORKDIR /app

# Install build dependencies (needed to compile some Python packages)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Remove build dependencies (they're not needed at runtime)
RUN apt-get purge -y gcc g++ python3-dev && \
    apt-get autoremove -y && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Copy application code
COPY . .

# Run the application
CMD ["python", "main.py"]
Enter fullscreen mode Exit fullscreen mode

Why install then remove build dependencies?

  • Some Python packages need to compile C extensions
  • We install gcc, g++ to compile them
  • After installation, we remove them (saves ~200MB!)
  • The compiled packages still work without the compilers

Simpler alternative (if no C extensions needed):

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]
Enter fullscreen mode Exit fullscreen mode

The Pattern: How to Create Any Dockerfile

Here's the mental model:

  1. Identify the language → Check for language-specific files

    • package.json → Node.js
    • pom.xml or build.gradle → Java
    • requirements.txt → Python
    • go.mod → Go
    • Cargo.toml → Rust
  2. Find the base image → Use official images

    • Node.js → node:18-alpine
    • Java → eclipse-temurin:17-jre-alpine (runtime) or maven:3.9-eclipse-temurin-17 (build)
    • Python → python:3.11-slim
    • Go → golang:1.21-alpine (build) or alpine:latest (runtime)
  3. Copy dependencies first → Docker caching optimization

    • Copy package.json / pom.xml / requirements.txt / go.mod
    • Install dependencies
    • Then copy application code
  4. Expose the port → Check the code for which port it uses

  5. Set the command → How to start the application

Testing Each Dockerfile

Before adding to docker-compose, test each one:

# Test Frontend
cd frontend
docker build -t frontend-test .
docker run -p 8080:80 frontend-test
curl http://localhost:8080

# Test Auth API
cd ../auth-api
docker build -t auth-api-test .
docker run -p 8081:8081 auth-api-test
curl http://localhost:8081/health

# Test Todos API
cd ../todos-api
docker build -t todos-api-test .
docker run -p 8082:8082 todos-api-test
curl http://localhost:8082

# Test Users API
cd ../users-api
docker build -t users-api-test .
docker run -p 8083:8083 users-api-test
curl http://localhost:8083/health

# Test Log Processor
cd ../log-message-processor
docker build -t log-processor-test .
docker run log-processor-test
# (This might not have HTTP endpoint, check logs)
Enter fullscreen mode Exit fullscreen mode

Common issues and fixes:

  1. "Module not found" or "Package not found"

    • Make sure you copied dependency files before installing
    • Check that requirements.txt / package.json is in the right place
  2. "Port already in use"

    • Another container is using that port
    • Use docker ps to see what's running
    • Stop it with docker stop <container-id>
  3. "Cannot connect to database" or "Connection refused"

    • Services need to be in the same Docker network
    • Use service names (e.g., redis) not localhost
    • Wait for dependencies to be ready (use depends_on in docker-compose)
  4. Image too large

    • Use multi-stage builds (build in one stage, copy artifacts to smaller runtime stage)
    • Use alpine or slim base images
    • Remove build dependencies after installation

Creating docker-compose.yml

Now we need to orchestrate all these containers. That's where Docker Compose comes in.

Think of docker-compose.yml as a conductor's score - it tells all the musicians (containers) when to play, how to play together, and in what order.

Let's build it piece by piece:

Step 1: The Reverse Proxy (Traefik)

Traefik is like a smart receptionist at a hotel:

  • It receives all incoming requests (guests)
  • It looks at the URL and decides which service should handle it (which room)
  • It automatically gets SSL certificates from Let's Encrypt (security badges)
  • It handles HTTPS redirects (escorts HTTP guests to HTTPS)
services:
  # Reverse Proxy - Routes traffic to the right service
  traefik:
    image: traefik:latest
    container_name: traefik
    command:
      - "--api.insecure=true"  # Enable dashboard (for debugging)
      - "--providers.docker=true"  # Watch Docker containers
      - "--providers.docker.exposedbydefault=false"  # Only expose containers with labels
      - "--entrypoints.web.address=:80"  # HTTP entry point
      - "--entrypoints.websecure.address=:443"  # HTTPS entry point
      - "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"  # Use HTTP challenge
      - "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"  # Challenge on port 80
      - "--certificatesresolvers.letsencrypt.acme.email=${LETSENCRYPT_EMAIL:-your-email@example.com}"  # Email for cert
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"  # Where to store certs
    ports:
      - "80:80"    # HTTP
      - "443:443"  # HTTPS
      - "8080:8080"  # Traefik dashboard (for debugging)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro  # Let Traefik see other containers
      - ./letsencrypt:/letsencrypt  # Store SSL certificates
    networks:
      - app-network
    restart: unless-stopped  # Auto-restart if it crashes
Enter fullscreen mode Exit fullscreen mode

Key concepts:

  • ports: "80:80" means "map host port 80 to container port 80"
  • volumes: /var/run/docker.sock lets Traefik discover other containers automatically
  • networks: app-network puts Traefik on the same network as other services

Step 2: The Frontend Service

The Vue.js frontend needs to be built and served:

  # Frontend - Vue.js application
  frontend:
    build:
      context: ./frontend  # Where the Dockerfile is
      dockerfile: Dockerfile
    container_name: frontend
    environment:
      - PORT=80  # Port the app runs on inside container
      - AUTH_API_ADDRESS=http://auth-api:8081  # Use service name, not localhost!
      - TODOS_API_ADDRESS=http://todos-api:8082
    labels:
      # Tell Traefik to route traffic to this service
      - "traefik.enable=true"
      # Route requests for your domain to this service
      - "traefik.http.routers.frontend.rule=Host(`${DOMAIN:-yourdomain.com}`)"
      # Use HTTPS
      - "traefik.http.routers.frontend.entrypoints=websecure"
      # Get SSL certificate automatically
      - "traefik.http.routers.frontend.tls.certresolver=letsencrypt"
      # Frontend runs on port 80 inside container
      - "traefik.http.services.frontend.loadbalancer.server.port=80"
      # Redirect HTTP to HTTPS
      - "traefik.http.routers.frontend-redirect.rule=Host(`${DOMAIN:-yourdomain.com}`)"
      - "traefik.http.routers.frontend-redirect.entrypoints=web"
      - "traefik.http.routers.frontend-redirect.middlewares=redirect-to-https"
      - "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"
    networks:
      - app-network
    depends_on:
      - auth-api
      - todos-api
      - users-api
Enter fullscreen mode Exit fullscreen mode

Important points:

  • build: context: ./frontend tells Docker to build from the frontend folder
  • environment: sets variables the app can read
  • AUTH_API_ADDRESS=http://auth-api:8081 - Notice we use auth-api (the service name), not localhost!
  • depends_on: ensures these services start before frontend
  • Labels tell Traefik how to route traffic

Step 3: The Auth API (Go)

  # Auth API - Handles authentication
  auth-api:
    build:
      context: ./auth-api
      dockerfile: Dockerfile
    container_name: auth-api
    environment:
      - AUTH_API_PORT=8081
      - USERS_API_ADDRESS=http://users-api:8083
      - JWT_SECRET=${JWT_SECRET:-myfancysecret}  # Secret key for tokens
      - REDIS_URL=redis://redis:6379  # Redis connection
    labels:
      - "traefik.enable=true"
      # Route /api/auth requests to this service
      - "traefik.http.routers.auth.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/api/auth`)"
      - "traefik.http.routers.auth.entrypoints=websecure"
      - "traefik.http.routers.auth.tls.certresolver=letsencrypt"
      - "traefik.http.services.auth.loadbalancer.server.port=8081"
      # Also handle /login route (frontend calls this)
      - "traefik.http.routers.auth-login.rule=Host(`${DOMAIN:-yourdomian.com}`) && (Path(`/login`) || PathPrefix(`/login/`))"
      - "traefik.http.routers.auth-login.entrypoints=websecure"
      - "traefik.http.routers.auth-login.tls.certresolver=letsencrypt"
      - "traefik.http.routers.auth-login.service=auth"
    networks:
      - app-network
    depends_on:
      - redis  # Needs Redis for session storage
Enter fullscreen mode Exit fullscreen mode

Routing explained:

  • PathPrefix(/api/auth) means any URL starting with /api/auth goes here
  • Example: https://your-domain.com/api/auth/login → auth-api
  • Path(/login) means exactly /login goes here

Step 4: The Todos API (Node.js)

  # Todos API - Manages todo items
  todos-api:
    build:
      context: ./todos-api
      dockerfile: Dockerfile
    container_name: todos-api
    environment:
      - PORT=8082
      - AUTH_API_URL=http://auth-api:8081  # To validate tokens
      - JWT_SECRET=${JWT_SECRET:-myfancysecret}
      - REDIS_URL=redis://redis:6379
    labels:
      - "traefik.enable=true"
      # Route /api/todos requests here
      - "traefik.http.routers.todos.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/api/todos`)"
      - "traefik.http.routers.todos.entrypoints=websecure"
      - "traefik.http.routers.todos.tls.certresolver=letsencrypt"
      - "traefik.http.services.todos.loadbalancer.server.port=8082"
    networks:
      - app-network
    depends_on:
      - redis
      - auth-api  # Needs auth-api to validate tokens
Enter fullscreen mode Exit fullscreen mode

Step 5: The Users API (Java)

  # Users API - Manages user accounts
  users-api:
    build:
      context: ./users-api
      dockerfile: Dockerfile
    container_name: users-api
    environment:
      - SERVER_PORT=8083
      - JWT_SECRET=${JWT_SECRET:-myfancysecret}
      - REDIS_URL=redis://redis:6379
    labels:
      - "traefik.enable=true"
      # Route /api/users requests here
      - "traefik.http.routers.users.rule=Host(`${DOMAIN:-yourdomian.com}`) && PathPrefix(`/api/users`)"
      - "traefik.http.routers.users.entrypoints=websecure"
      - "traefik.http.routers.users.tls.certresolver=letsencrypt"
      - "traefik.http.services.users.loadbalancer.server.port=8083"
    networks:
      - app-network
    depends_on:
      - redis
Enter fullscreen mode Exit fullscreen mode

Step 6: The Log Processor (Python)

This service doesn't need Traefik routing - it's a background worker:

  # Log Processor - Background worker that processes messages
  log-message-processor:
    build:
      context: ./log-message-processor
      dockerfile: Dockerfile
    container_name: log-message-processor
    environment:
      - REDIS_HOST=redis  # Use service name
      - REDIS_PORT=6379
      - REDIS_CHANNEL=log-messages
    networks:
      - app-network
    depends_on:
      - redis
    restart: unless-stopped  # Keep it running
Enter fullscreen mode Exit fullscreen mode

Why no Traefik labels? This service doesn't serve HTTP requests - it just listens to Redis for messages.

Step 7: Supporting Services

  # Redis - Message queue and cache
  redis:
    image: redis:7-alpine  # Use pre-built image, no Dockerfile needed
    container_name: redis
    ports:
      - "6379:6379"  # Expose for debugging (optional)
    volumes:
      - redis-data:/data  # Persist data
    networks:
      - app-network
    restart: unless-stopped

  # Zipkin handler - Service for /zipkin endpoint
  zipkin-handler:
    image: nginx:alpine
    container_name: zipkin-handler
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.zipkin.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/zipkin`)"
      - "traefik.http.routers.zipkin.entrypoints=websecure"
      - "traefik.http.routers.zipkin.tls.certresolver=letsencrypt"
      - "traefik.http.services.zipkin.loadbalancer.server.port=80"
    networks:
      - app-network
    command: >
      sh -c "echo 'server {
        listen 80;
        location / {
          return 200 \"OK\";
          add_header Content-Type text/plain;
        }
      }' > /etc/nginx/conf.d/default.conf && nginx -g 'daemon off;'"
Enter fullscreen mode Exit fullscreen mode

Step 8: Networks and Volumes

At the end of the file, define shared resources:

# Networks - How containers communicate
networks:
  app-network:
    driver: bridge  # Default network type

# Volumes - Persistent storage
volumes:
  redis-data:  # Named volume for Redis data
Enter fullscreen mode Exit fullscreen mode

Why networks? Containers on the same network can talk to each other using service names (like auth-api instead of IP addresses).

Why volumes? Data in containers is lost when they're removed. Volumes persist data.

Complete docker-compose.yml Structure

Here's the mental model:

docker-compose.yml
├── services (all your containers)
│   ├── traefik (reverse proxy)
│   ├── frontend (Vue.js app)
│   ├── auth-api (Go service)
│   ├── todos-api (Node.js service)
│   ├── users-api (Java service)
│   ├── log-message-processor (Python worker)
│   ├── redis (database/queue)
│   └── zipkin-handler (dummy endpoint)
├── networks (how they connect)
│   └── app-network
└── volumes (persistent storage)
    └── redis-data
Enter fullscreen mode Exit fullscreen mode

Understanding Traefik Labels (Deep Dive)

Labels are how you tell Traefik what to do. Let's break down a complex example:

labels:
  - "traefik.enable=true"  # Step 1: Enable Traefik for this service
  - "traefik.http.routers.auth.rule=Host(`example.com`) && PathPrefix(`/api/auth`)"  # Step 2: Define routing rule
  - "traefik.http.routers.auth.entrypoints=websecure"  # Step 3: Use HTTPS
  - "traefik.http.routers.auth.tls.certresolver=letsencrypt"  # Step 4: Get SSL cert
  - "traefik.http.services.auth.loadbalancer.server.port=8081"  # Step 5: Which port to forward to
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  1. Router = A set of rules for routing traffic
  2. Rule = Conditions that must match (domain + path)
  3. Entrypoint = Which port/protocol (web = HTTP, websecure = HTTPS)
  4. Service = The actual container and port
  5. Middleware = Transformations (redirects, rewrites, etc.)

Example flow:

  1. User visits https://example.com/api/auth/login
  2. Traefik receives request on port 443 (websecure entrypoint)
  3. Traefik checks rules: "Does this match Host(example.com) && PathPrefix(/api/auth)?" → Yes!
  4. Traefik forwards to auth service on port 8081
  5. Auth-api container handles the request

Testing Your docker-compose.yml

Before deploying to the cloud, test locally:

Step 1: Create Environment File

Create a .env file in the root directory:

cat > .env <<EOF
DOMAIN=localhost
LETSENCRYPT_EMAIL=your-email@example.com
JWT_SECRET=test-secret-key-change-this-in-production
EOF
Enter fullscreen mode Exit fullscreen mode

What's in .env?

  • DOMAIN - Your domain name (use localhost for local testing)
  • LETSENCRYPT_EMAIL - Email for SSL certificate notifications
  • JWT_SECRET - Secret key for JWT tokens (use a strong random string in production)

Step 2: Start All Services

# Build and start all containers in the background
docker compose up -d

# Watch all logs in real-time
docker compose logs -f

# Or watch specific service logs
docker compose logs -f frontend
docker compose logs -f auth-api
Enter fullscreen mode Exit fullscreen mode

What -d means: Detached mode - runs in the background so you can use your terminal.

Step 3: Verify Everything is Running

# Check status of all containers
docker compose ps

# You should see something like:
# NAME                    STATUS              PORTS
# traefik                 Up 2 minutes        0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp
# frontend                Up 2 minutes        
# auth-api                Up 2 minutes        
# todos-api               Up 2 minutes        
# users-api               Up 2 minutes        
# log-message-processor   Up 2 minutes        
# redis                   Up 2 minutes        0.0.0.0:6379->6379/tcp
Enter fullscreen mode Exit fullscreen mode

Step 4: Test Each Service

# Test frontend (should return HTML)
curl http://localhost

# Test auth API (should return "Not Found" - that's expected!)
curl http://localhost/api/auth

# Test todos API (should return "Invalid Token" - also expected!)
curl http://localhost/api/todos

# Test users API
curl http://localhost/api/users

# Check Traefik dashboard (optional)
# Open http://localhost:8080 in your browser
Enter fullscreen mode Exit fullscreen mode

Expected responses:

  • Frontend: HTML page (login screen)
  • /api/auth without path: "Not Found" (correct - needs specific endpoint)
  • /api/todos without auth: "Invalid Token" (correct - needs authentication)
  • /api/users without auth: "Missing or invalid Authorization header" (correct)

Step 5: Test with Browser

  1. Open http://localhost in your browser
  2. You should see the login page
  3. Try logging in (if you have test credentials)
  4. Check browser console (F12) for any errors

Step 6: Check Logs for Errors

# View logs for a specific service
docker compose logs frontend
docker compose logs auth-api
docker compose logs traefik

# View last 100 lines
docker compose logs --tail=100 traefik

# Follow logs in real-time
docker compose logs -f traefik
Enter fullscreen mode Exit fullscreen mode

What to look for:

  • ✅ "Server started" or "Listening on port X" = Good!
  • ❌ "Connection refused" = Service dependency not ready
  • ❌ "Module not found" = Missing dependency in Dockerfile
  • ❌ "Port already in use" = Another service is using that port

Step 7: Stop Everything

# Stop all containers
docker compose down

# Stop and remove volumes (clean slate)
docker compose down -v

# Stop and remove images too (full cleanup)
docker compose down --rmi all -v
Enter fullscreen mode Exit fullscreen mode

Common Issues and Solutions

Issue 1: "Port 80 already in use"

Problem: Another service (like Apache, Nginx, or another Docker container) is using port 80.

Solution:

# Find what's using port 80
sudo lsof -i :80
# or
sudo netstat -tulpn | grep :80

# Stop the conflicting service
sudo systemctl stop apache2  # or nginx, or whatever it is

# Or change Traefik ports in docker-compose.yml:
ports:
  - "8080:80"   # Use 8080 instead of 80
  - "8443:443"  # Use 8443 instead of 443
Enter fullscreen mode Exit fullscreen mode

Issue 2: "Build failed" or "Module not found"

Problem: Dockerfile has issues or dependencies are missing.

Solution:

# Build a specific service to see detailed errors
docker compose build frontend

# Check the Dockerfile syntax
# Make sure COPY commands are in the right order
# Make sure RUN commands install dependencies before copying code
Enter fullscreen mode Exit fullscreen mode

Issue 3: "Container keeps restarting"

Problem: The application is crashing on startup.

Solution:

# Check why it's restarting
docker compose logs <service-name>

# Common causes:
# - Missing environment variables
# - Database/Redis not ready (add depends_on)
# - Port conflict
# - Missing files or dependencies
Enter fullscreen mode Exit fullscreen mode

Issue 4: "Cannot connect to auth-api" or "Connection refused"

Problem: Services can't find each other.

Solution:

  • ✅ Use service names (e.g., http://auth-api:8081), not localhost
  • ✅ Make sure all services are on the same network (app-network)
  • ✅ Check depends_on - services might be starting before dependencies are ready
  • ✅ Add health checks or wait scripts if needed

Issue 5: "SSL certificate error" or "Let's Encrypt failed"

Problem: Let's Encrypt can't verify your domain.

Solution:

  • For local testing: Use localhost and HTTP only (remove HTTPS redirect)
  • For production: Make sure DNS points to your server
  • Make sure ports 80 and 443 are open in firewall
  • Check Traefik logs: docker compose logs traefik

Quick Reference: docker-compose Commands

# Start services
docker compose up              # Start and show logs
docker compose up -d           # Start in background

# Stop services
docker compose stop            # Stop but don't remove
docker compose down           # Stop and remove containers

# View logs
docker compose logs            # All services
docker compose logs -f         # Follow (live updates)
docker compose logs <service>  # Specific service

# Rebuild
docker compose build           # Build all
docker compose build <service> # Build specific service
docker compose up --build      # Build and start

# Status
docker compose ps              # Show running containers
docker compose top             # Show running processes

# Execute commands
docker compose exec <service> <command>  # Run command in container
docker compose exec frontend sh           # Get shell in frontend container
Enter fullscreen mode Exit fullscreen mode

Part 4: Infrastructure as Code with Terraform

Now comes the infrastructure part. This is where many people get intimidated, but it's actually simpler than it seems.

Why Infrastructure as Code?

Imagine you're building a house. You could:

  1. Manual approach: Tell the builder "put a window here, a door there" every time
  2. Blueprint approach: Draw a blueprint once, builder follows it every time

Infrastructure as Code is the blueprint approach. Benefits:

  • Reproducible: Same code = same infrastructure, every time
  • Version controlled: See what changed and when
  • Testable: Try changes without breaking production
  • Documented: The code IS the documentation

Understanding Terraform Basics

Terraform uses a language called HCL (HashiCorp Configuration Language). It's designed to be human-readable.

Basic structure:

resource "aws_instance" "todo_app" {
  ami           = "ami-12345"
  instance_type = "t3.medium"
}
Enter fullscreen mode Exit fullscreen mode

This says: "Create an AWS EC2 instance resource, call it 'todo_app', with these properties."

Creating Your First Terraform Configuration

Let's build it step by step:

Step 1: Provider Configuration

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    # We'll configure this during init
  }
}

provider "aws" {
  region = var.aws_region
}
Enter fullscreen mode Exit fullscreen mode

What's happening:

  • required_version - Ensures everyone uses compatible Terraform
  • required_providers - Tells Terraform which plugins to download
  • backend "s3" - Where to store state (we'll configure this later)
  • provider "aws" - Which cloud provider to use

Step 2: Data Sources (Getting Information)

Before creating resources, we often need to look things up:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical (Ubuntu's publisher)

  filter {
    name   = "name"
    values = ["*ubuntu-jammy-22.04-amd64-server*"]
  }
}
Enter fullscreen mode Exit fullscreen mode

What is an AMI? Amazon Machine Image - it's like a template for a virtual machine. This code finds the latest Ubuntu 22.04 image.

Why use data sources? AMI IDs change in different regions. Instead of hardcoding ami-12345, we let Terraform find the right one.

Step 3: Security Group (Firewall Rules)

resource "aws_security_group" "todo_app" {
  name        = "todo-app-sg-${var.environment}"
  description = "Security group for TODO application"

  ingress {
    description = "HTTP"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow from anywhere
  }

  ingress {
    description = "HTTPS"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "SSH"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.ssh_cidr]  # Only from your IP (security!)
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"  # All protocols
    cidr_blocks = ["0.0.0.0/0"]  # Allow all outbound
  }

  tags = {
    Name        = "todo-app-sg-${var.environment}"
    Environment = var.environment
  }
}
Enter fullscreen mode Exit fullscreen mode

What is a security group? It's AWS's firewall. It controls what traffic can reach your server.

Breaking it down:

  • ingress - Incoming traffic rules
  • egress - Outgoing traffic rules
  • cidr_blocks = ["0.0.0.0/0"] - From anywhere (0.0.0.0/0 means "everywhere")
  • var.ssh_cidr - A variable (we'll set this to your IP for security)

Security tip: In production, restrict SSH to your IP only! Use a service like whatismyip.com to find your IP, then set ssh_cidr = "YOUR_IP/32".

Step 4: EC2 Instance (Your Server)

resource "aws_instance" "todo_app" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  key_name               = var.key_pair_name
  vpc_security_group_ids = [aws_security_group.todo_app.id]

  user_data = <<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y python3 python3-pip
  EOF

  tags = {
    Name        = "todo-app-server-${var.environment}"
    Environment = var.environment
    Project     = "hngi13-stage6"
  }

  lifecycle {
    create_before_destroy = true
  }
}
Enter fullscreen mode Exit fullscreen mode

What's happening:

  • ami - Which OS image to use (from our data source)
  • instance_type - Server size (t3.medium = 2 vCPU, 4GB RAM)
  • key_name - Which SSH key to install
  • vpc_security_group_ids - Which firewall rules to apply
  • user_data - Script that runs when server starts (bootstrap script)
  • lifecycle - Terraform behavior (create new before destroying old = zero downtime)

Step 5: Variables (Making It Flexible)

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"
}

variable "key_pair_name" {
  description = "AWS Key Pair name"
  type        = string  # Required - no default
}

variable "environment" {
  description = "Environment name (dev, stg, prod)"
  type        = string
  default     = "dev"

  validation {
    condition     = contains(["dev", "stg", "prod"], var.environment)
    error_message = "Environment must be one of: dev, stg, prod"
  }
}
Enter fullscreen mode Exit fullscreen mode

Why variables? Makes your code reusable. Same code works for dev, staging, and production - just change the variables!

Step 6: Outputs (Getting Information Back)

output "server_ip" {
  description = "Public IP of the server"
  value       = aws_instance.todo_app.public_ip
}

output "ansible_inventory_path" {
  description = "Path to generated Ansible inventory"
  value       = local_file.ansible_inventory.filename
}
Enter fullscreen mode Exit fullscreen mode

What are outputs? After Terraform creates resources, you often need information about them (like the server IP). Outputs make that information available.

Environment-Specific Configuration

Create separate .tfvars files for each environment. This is crucial - you'll have three files, one for each environment:

terraform.dev.tfvars:

environment = "dev"
aws_region  = "us-east-1"
instance_type = "t3.small"  # Smaller for dev (saves money)
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "0.0.0.0/0"  # Less restrictive for dev
server_user = "ubuntu"
skip_ansible_provision = false  # Run Ansible automatically
Enter fullscreen mode Exit fullscreen mode

terraform.stg.tfvars:

environment = "stg"
aws_region  = "us-east-1"
instance_type = "t3.small"  # Can be same size as dev for staging
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "0.0.0.0/0"  # Can be less restrictive than prod
server_user = "ubuntu"
skip_ansible_provision = false
Enter fullscreen mode Exit fullscreen mode

terraform.prod.tfvars:

environment = "prod"
aws_region  = "us-east-1"
instance_type = "t3.medium"  # More power for production
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "YOUR_IP/32"  # Restrict SSH to your IP only!
server_user = "ubuntu"
skip_ansible_provision = false
Enter fullscreen mode Exit fullscreen mode

Why three separate files? Different environments have different needs:

  • Dev: Smaller instance, less security (for quick testing)
  • Staging: Similar to dev, but closer to production setup (for pre-production testing)
  • Prod: Larger instance, maximum security (for real users)

File structure:

infra/terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.dev.tfvars    ← Development environment
├── terraform.stg.tfvars    ← Staging environment
└── terraform.prod.tfvars   ← Production environment
Enter fullscreen mode Exit fullscreen mode

Remote State Configuration

Remember the S3 bucket we created? Now we use it. Important: Each environment needs its own state file path!

For Development:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/dev/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"
Enter fullscreen mode Exit fullscreen mode

For Staging:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/stg/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"
Enter fullscreen mode Exit fullscreen mode

For Production:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/prod/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"
Enter fullscreen mode Exit fullscreen mode

What this does:

  • bucket - Where to store state (same bucket for all environments)
  • key - File path in bucket (different per environment!)
  • dynamodb_table - For locking (prevents conflicts when multiple people run Terraform)
  • encrypt=true - Encrypt state at rest (security)

Why separate keys per environment?

  • terraform-state/dev/terraform.tfstate → Development infrastructure
  • terraform-state/stg/terraform.tfstate → Staging infrastructure
  • terraform-state/prod/terraform.tfstate → Production infrastructure

These are completely separate files. This means:

  • ✅ Dev, staging, and prod infrastructure are isolated
  • ✅ You can destroy dev without affecting staging or prod
  • ✅ Each environment has its own state history
  • ✅ No risk of accidentally modifying the wrong environment

Your First Terraform Run

Let's deploy to development first (always start with dev!):

# 1. Initialize (downloads providers, sets up backend)
terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/dev/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

# 2. Plan (see what will be created - SAFE, doesn't change anything)
terraform plan -var-file=terraform.dev.tfvars

# 3. Apply (actually create resources)
terraform apply -var-file=terraform.dev.tfvars
Enter fullscreen mode Exit fullscreen mode

What happens:

  1. terraform init - Downloads AWS provider, configures backend for dev environment
  2. terraform plan - Shows you what will be created/changed/destroyed (dry run)
  3. terraform apply - Actually creates the resources

Pro tip: Always run plan first! It's like a dry run. Review the output carefully before applying.

For other environments, repeat the same steps but:

  • Use the appropriate -backend-config="key=terraform-state/ENV/terraform.tfstate"
  • Use the matching -var-file=terraform.ENV.tfvars

Example for staging:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/stg/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

terraform plan -var-file=terraform.stg.tfvars
terraform apply -var-file=terraform.stg.tfvars
Enter fullscreen mode Exit fullscreen mode

Understanding Drift Detection (Critical for Safety!)

What is drift? Imagine you have a blueprint for a house (Terraform code), but someone goes and changes the actual house (AWS infrastructure) without updating the blueprint. That's drift - your code and reality don't match anymore.

Real-world example:

  1. You deploy infrastructure with Terraform ✅
  2. Later, you manually add a tag to your EC2 instance in AWS Console 🏷️
  3. Terraform doesn't know about this change
  4. Next time you run Terraform, it sees the difference → DRIFT DETECTED!

Why drift is dangerous:

  • 🔴 Security risk: Someone might have changed something maliciously
  • 🔴 Data loss: Terraform might try to "fix" things and delete your changes
  • 🔴 Confusion: You don't know what changed or why
  • 🔴 Breaking changes: Manual changes might break your application

How drift detection works:

Think of it like a security guard checking your house:

  1. Terraform Plan = Security guard walks around and notes what's different
  2. Check Git History = Did you change the blueprint (Terraform files)?
    • If yes → Expected changes (you updated the code)
    • If no → DRIFT! (Someone changed infrastructure without updating code)

Drift Log

  1. Alert = Security guard calls you immediately

Email Alert

  1. Approval = You review and decide what to do

Github Issue

  1. Action = Apply changes or investigate further

The detection logic:

# Step 1: Run terraform plan to see what's different
terraform plan -out=tfplan

# Step 2: Check if Terraform files changed in this commit
git diff HEAD~1 HEAD --name-only | grep -E "\.tf$|\.tfvars$"

# Step 3: Determine the type of change
if [ no terraform files changed ] && [ plan shows changes ]; then
  echo "🚨 DRIFT DETECTED!"
  # Send email, create GitHub issue, wait for approval
else
  echo "✅ Expected changes (code was updated)"
  # Proceed automatically
fi
Enter fullscreen mode Exit fullscreen mode

Setting Up Email Notifications for Drift

When drift is detected, you need to know immediately! That's where email notifications come in.

Step 1: Verify Your Email in AWS SES

AWS SES (Simple Email Service) is like a post office for your applications. First, you need to verify your email address:

  1. Go to AWS Console → SES (Simple Email Service)
  2. Click "Verified identities" → "Create identity"
  3. Choose "Email address"
  4. Enter your email (e.g., your-email@gmail.com)
  5. Click "Create identity"
  6. Check your email and click the verification link

Why verify? AWS prevents spam by requiring you to verify you own the email address.

Step 2: Create the Email Notification Script

Create infra/ci-cd/scripts/email-notification.sh:

#!/bin/bash
# Email Notification Script for Terraform Drift
# Sends email alert when infrastructure drift is detected

set -e

DRIFT_SUMMARY="${1:-}"

if [ -z "$DRIFT_SUMMARY" ]; then
  echo "Error: Drift summary not provided"
  exit 1
fi

# Email configuration from environment variables
EMAIL_TO="${EMAIL_TO:-}"
EMAIL_FROM="${EMAIL_FROM:-}"
AWS_REGION="${AWS_REGION:-us-east-1}"

# GitHub Actions variables for workflow link
GITHUB_SERVER_URL="${GITHUB_SERVER_URL:-https://github.com}"
GITHUB_REPOSITORY="${GITHUB_REPOSITORY:-}"
GITHUB_RUN_ID="${GITHUB_RUN_ID:-}"

# Check if email is configured
if [ -z "$EMAIL_TO" ] || [ -z "$EMAIL_FROM" ]; then
  echo "⚠️  Email not configured. Skipping email notification."
  exit 0
fi

# Build workflow run URL if GitHub variables are available
WORKFLOW_URL=""
if [ -n "$GITHUB_REPOSITORY" ] && [ -n "$GITHUB_RUN_ID" ]; then
  WORKFLOW_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}"
fi

# Create email body
SUBJECT="🚨 Terraform Drift Detected - Action Required"
BODY=$(cat <<EOF
Terraform infrastructure drift has been detected.

This means infrastructure was changed OUTSIDE of Terraform (e.g., manually in AWS Console).

Please review the changes and approve the deployment in GitHub Actions.

$(if [ -n "$WORKFLOW_URL" ]; then echo "🔗 View Workflow Run: $WORKFLOW_URL"; echo ""; fi)

Drift Summary:
$DRIFT_SUMMARY

---
This is an automated message from GitHub Actions.
EOF
)

# Send email via AWS SES
echo "📧 Sending drift alert email via AWS SES..."
aws ses send-email \
  --region "$AWS_REGION" \
  --from "$EMAIL_FROM" \
  --to "$EMAIL_TO" \
  --subject "$SUBJECT" \
  --text "$BODY" \
  || echo "⚠️  Failed to send email. Check AWS credentials and SES configuration."

echo "✅ Email notification sent!"
Enter fullscreen mode Exit fullscreen mode

Make it executable:

chmod +x infra/ci-cd/scripts/email-notification.sh
Enter fullscreen mode Exit fullscreen mode

What this script does:

  1. Takes the drift summary as input
  2. Checks if email is configured
  3. Builds a nice email message with the drift details
  4. Sends it via AWS SES
  5. Includes a link to the GitHub workflow run

Step 3: Add GitHub Secrets

In your GitHub repository, go to SettingsSecrets and variablesActions, and add:

  • EMAIL_TO - Your email address (where to send alerts)
  • EMAIL_FROM - Your verified SES email (must match the one you verified in AWS)
  • AWS_ACCESS_KEY_ID - Your AWS access key
  • AWS_SECRET_ACCESS_KEY - Your AWS secret key
  • AWS_REGION - Your AWS region (e.g., us-east-1)

Security tip: Never commit these values to your repository! Always use GitHub Secrets.

Understanding GitHub Issue Approval

When drift is detected, the workflow creates a GitHub issue and waits for your approval. This is like a safety checkpoint.

How it works:

  1. Drift detected → Workflow pauses
  2. GitHub issue created → Contains:
    • What changed
    • Why it's drift (no code changes)
    • Link to workflow run
    • Plan summary
  3. You review → Check the issue
  4. You approve → Comment "approve" or click approve button
  5. Workflow continues → Terraform applies the changes

Example GitHub Issue:

🚨 REAL DRIFT DETECTED - Infrastructure Changed Outside Terraform (dev)

⚠️ CRITICAL: Real Infrastructure Drift Detected

Infrastructure has been modified outside of Terraform. This is unexpected.

Environment: dev

What happened:
- Terraform code files were NOT modified
- But infrastructure plan shows changes
- This indicates manual changes or changes from another process

Action Required:
1. Review the plan below
2. Investigate what caused the drift
3. Approve if changes are intentional, or revert if unauthorized

Plan Summary:
Enter fullscreen mode Exit fullscreen mode

aws_instance.todo_app will be updated in-place

~ resource "aws_instance" "todo_app" {
~ tags = {
- "ManualTag" = "test" -> null
}
}


Workflow Run:
🔗 View Workflow Run: https://github.com/yourusername/repo/actions/runs/123456

Next Steps:
- Approve to apply these changes
- Or investigate and revert unauthorized changes
Enter fullscreen mode Exit fullscreen mode

How to approve:

  1. Go to the GitHub issue (you'll get a notification)
  2. Review the changes carefully
  3. If changes are OK: Comment "approve" or click the approve button
  4. If changes are suspicious: Investigate first, then approve or revert

Why this matters:

  • ✅ Prevents accidental changes
  • ✅ Gives you time to investigate
  • ✅ Creates an audit trail (who approved what, when)
  • ✅ Protects production from unauthorized changes

Testing Drift Detection

Want to test if drift detection works? Here's how:

Step 1: Deploy infrastructure normally

terraform apply -var-file=terraform.dev.tfvars
Enter fullscreen mode Exit fullscreen mode

Step 2: Manually change something in AWS Console

  1. Go to AWS Console → EC2 → Instances
  2. Find your instance
  3. Click "Tags" → "Manage tags"
  4. Add a new tag: TestTag = "drift-test"
  5. Save

Added port 8080 via AWS console
ADD Port-8080

Step 3: Trigger the workflow

  1. Go to GitHub Actions
  2. Run the "Infrastructure Deployment" workflow
  3. Select "dev" environment
  4. Watch it detect drift!

Step 4: Check your email

  • You should receive an email alert
  • Check spam folder if you don't see it

Email Alert

Step 5: Approve in GitHub

  • A GitHub issue should be created
  • Review and approve
  • Watch Terraform apply the changes

Approved github issue
Github Issue

Destroying Infrastructure (When You Need to Start Over)

Sometimes you need to tear everything down and start fresh. Here's how to do it safely.

⚠️ WARNING: Destroying infrastructure will DELETE EVERYTHING:

  • Your EC2 instance
  • All data on the server
  • Security groups
  • Everything created by Terraform

Make sure you:

  • ✅ Have backups if you need data
  • ✅ Are destroying the right environment (dev, not prod!)
  • ✅ Really want to delete everything

Method 1: Destroy via Command Line

# Step 1: Initialize Terraform (if not already done)
cd infra/terraform
terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/dev/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

# Step 2: Plan the destruction (see what will be deleted)
terraform plan -destroy -var-file=terraform.dev.tfvars

# Step 3: Review the plan carefully!
# Make sure it's only deleting what you want

# Step 4: Destroy everything
terraform destroy -var-file=terraform.dev.tfvars
Enter fullscreen mode Exit fullscreen mode

What happens:

  1. Terraform reads the state file
  2. Plans what needs to be destroyed
  3. Shows you the plan (review it!)
  4. Asks for confirmation (type yes)
  5. Deletes everything in reverse order

Method 2: Destroy via GitHub Actions

If you have a destroy workflow set up:

  1. Go to GitHub Actions

Terraform Destroy

  1. Find "Destroy Infrastructure" workflow (if you have one)
  2. Click "Run workflow"
  3. Select environment (be careful - don't destroy prod!)
  4. Confirm and run Destroy Action

Example destroy workflow:

name: Destroy Infrastructure

on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment to destroy'
        required: true
        type: choice
        options: [dev, stg, prod]

jobs:
  destroy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        working-directory: infra/terraform
        run: |
          terraform init \
            -backend-config="bucket=${{ secrets.TERRAFORM_STATE_BUCKET }}" \
            -backend-config="key=terraform-state/${{ github.event.inputs.environment }}/terraform.tfstate" \
            -backend-config="region=${{ secrets.AWS_REGION }}" \
            -backend-config="dynamodb_table=${{ secrets.TERRAFORM_STATE_LOCK_TABLE }}" \
            -backend-config="encrypt=true"

      - name: Terraform Plan Destroy
        working-directory: infra/terraform
        run: |
          terraform plan -destroy \
            -var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
            -var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}" \
            -out=tfplan

      - name: Manual Approval
        uses: trstringer/manual-approval@v1
        with:
          secret: ${{ github.TOKEN }}
          approvers: ${{ github.actor }}
          minimum-approvals: 1
          issue-title: "⚠️ DESTROY Infrastructure - ${{ github.event.inputs.environment }}"
          issue-body: |
            **⚠️ WARNING: Infrastructure Destruction Requested**

            This will **DELETE ALL INFRASTRUCTURE** for environment: **${{ github.event.inputs.environment }}**

            **This action cannot be undone!**

            Review the plan and approve only if you're sure.

      - name: Terraform Destroy
        if: steps.manual-approval.outcome == 'success'
        working-directory: infra/terraform
        run: terraform apply -auto-approve tfplan
Enter fullscreen mode Exit fullscreen mode

Safety features:

  • ✅ Manual approval required (can't destroy by accident)
  • ✅ Shows what will be destroyed
  • ✅ Creates GitHub issue for review
  • ✅ Environment selection (prevents destroying wrong env)

Method 3: Destroy Specific Resources

Don't want to destroy everything? You can target specific resources:

# Destroy only the EC2 instance (keep security group)
terraform destroy -target=aws_instance.todo_app -var-file=terraform.dev.tfvars

# Destroy only the security group
terraform destroy -target=aws_security_group.todo_app -var-file=terraform.dev.tfvars
Enter fullscreen mode Exit fullscreen mode

When to use this:

  • You want to recreate just one resource
  • Something is broken and you want to rebuild it
  • You're testing changes

After Destruction

After destroying, your state file still exists in S3, but it's empty (or has no resources). You can:

  1. Start fresh: Run terraform apply again to recreate everything
  2. Clean up state: Delete the state file from S3 (optional)
  3. Keep state: Leave it (Terraform will just create new resources)

Best practice: Keep the state file. It's useful for history and doesn't cost much.

Summary: The Complete Terraform Workflow

Here's the full picture of how everything works together:

┌─────────────────────────────────────────────────────────┐
│ 1. You make changes to Terraform code                   │
│    (or someone changes infrastructure manually)        │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ 2. GitHub Actions workflow runs                         │
│    - Checks out code                                    │
│    - Runs terraform plan                                │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ 3. Drift Detection                                      │
│    - Did Terraform files change?                        │
│    - Does plan show changes?                            │
│    - If no code changes + plan changes = DRIFT!         │
└─────────────────┬───────────────────────────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
        ▼                   ▼
┌──────────────┐   ┌──────────────────┐
│ DRIFT        │   │ EXPECTED CHANGES  │
│ DETECTED     │   │ (Code updated)    │
└──────┬───────┘   └────────┬───────────┘
       │                    │
       ▼                    ▼
┌──────────────────┐   ┌──────────────────┐
│ Send Email       │   │ Apply directly   │
│ Create Issue     │   │ (No approval)    │
│ Wait for Approval│   └──────────────────┘
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ You Review       │
│ & Approve        │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Apply Changes    │
│ (terraform apply) │
└──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Key takeaways:

  • ✅ Always run terraform plan first (see what will happen)
  • ✅ Drift detection protects you from unexpected changes
  • ✅ Email notifications keep you informed
  • ✅ Manual approval prevents accidents
  • ✅ Destroy carefully - it's permanent!

Part 5: Server Configuration with Ansible

Terraform created your server, but it's just a blank Ubuntu machine. Now we need to:

  1. Install Docker
  2. Clone your code
  3. Start the application

That's where Ansible comes in.

Understanding Ansible

Ansible is like having a robot assistant that can:

  • SSH into your servers
  • Run commands
  • Install software
  • Copy files
  • Start services

Why Ansible over SSH scripts?

  • Idempotent: Run it multiple times safely (won't break if run twice)
  • Declarative: You say "what" you want, not "how" to do it
  • Organized: Roles and playbooks keep things organized
  • Reusable: Write once, use for dev/stg/prod

Ansible Playbook Structure

---
- name: Configure TODO Application Server
  hosts: all
  become: yes  # Use sudo
  gather_facts: yes  # Collect info about the server

  vars:
    app_user: ubuntu
    app_dir: /opt/todo-app

  roles:
    - role: dependencies  # Install Docker, etc.
    - role: deploy        # Deploy the application
Enter fullscreen mode Exit fullscreen mode

Breaking it down:

  • hosts: all - Run on all servers in inventory
  • become: yes - Use sudo (needed for installing packages)
  • gather_facts - Ansible learns about the server (OS, IP, etc.)
  • roles - Reusable collections of tasks

Creating the Dependencies Role

This role installs everything the server needs:

roles/dependencies/tasks/main.yml:

---
- name: Update apt cache
  apt:
    update_cache: yes
    cache_valid_time: 3600

- name: Install required packages
  apt:
    name:
      - git
      - curl
      - python3
      - python3-pip
    state: present

- name: Check if Docker is already installed
  command: docker --version
  register: docker_check
  changed_when: false
  failed_when: false
  ignore_errors: yes

- name: Install Docker (only if not installed)
  apt:
    name:
      - docker-ce
      - docker-ce-cli
      - containerd.io
      - docker-compose-plugin
    state: present
  when: docker_check.rc != 0  # Only if Docker not found

- name: Add user to docker group
  user:
    name: "{{ app_user }}"
    groups: docker
    append: yes

- name: Start and enable Docker
  systemd:
    name: docker
    state: started
    enabled: yes
Enter fullscreen mode Exit fullscreen mode

Key concepts:

  • register - Save command output to a variable
  • when - Conditional execution (only if condition is true)
  • changed_when: false - This task never "changes" anything (just checks)
  • state: present - Ensure package is installed (idempotent!)

Creating the Deploy Role

This role actually deploys your application:

roles/deploy/tasks/main.yml:

---
- name: Create application directory
  file:
    path: "{{ app_dir }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: '0755'

- name: Clone repository
  git:
    repo: "{{ repo_url }}"
    dest: "{{ app_dir }}"
    version: "{{ repo_branch | default('main') }}"
    update: yes
  register: git_pull_result
  changed_when: git_pull_result.changed

- name: Create .env file
  copy:
    content: |
      DOMAIN="{{ domain }}"
      LETSENCRYPT_EMAIL="{{ letsencrypt_email }}"
      JWT_SECRET="{{ jwt_secret }}"
      # ... other variables
    dest: "{{ app_dir }}/.env"
    owner: "{{ app_user }}"
    mode: '0600'
  register: env_file_result
  changed_when: env_file_result.changed

- name: Determine if rebuild is needed
  set_fact:
    needs_rebuild: "{{ git_pull_result.changed | default(false) or env_file_result.changed | default(false) }}"

- name: Build images if code/config changed
  shell: docker compose build
  args:
    chdir: "{{ app_dir }}"
  when: needs_rebuild | default(false)

- name: Start/update containers
  shell: docker compose up -d
  args:
    chdir: "{{ app_dir }}"
Enter fullscreen mode Exit fullscreen mode

Making it idempotent:

  • Only rebuilds if code or config changed
  • docker compose up -d is idempotent (won't restart if nothing changed)
  • Safe to run multiple times

Environment-Specific Variables

Just like Terraform, Ansible needs separate configuration files for each environment. Create three files:

group_vars/dev/vars.yml:

---
domain: "dev.yourdomain.com"
letsencrypt_email: "dev-email@example.com"
jwt_secret: "dev-secret-key"
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "dev"  # Use dev branch for development
Enter fullscreen mode Exit fullscreen mode

group_vars/stg/vars.yml:

---
domain: "stg.yourdomain.com"
letsencrypt_email: "staging-email@example.com"
jwt_secret: "staging-secret-key"  # Different from dev!
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "staging"  # Use staging branch for staging
Enter fullscreen mode Exit fullscreen mode

group_vars/prod/vars.yml:

---
domain: "yourdomain.com"
letsencrypt_email: "prod-email@example.com"
jwt_secret: "super-secure-production-secret"  # Different per environment!
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "main"  # Use main branch for production
Enter fullscreen mode Exit fullscreen mode

Why three separate files? Each environment needs:

  • Different domains: dev.yourdomain.com, stg.yourdomain.com, yourdomain.com
  • Different secrets: If dev gets compromised, staging and prod are still safe
  • Different branches:
    • dev branch → development environment (experimental features)
    • staging branch → staging environment (testing before production)
    • main branch → production environment (stable, tested code)

File structure:

infra/ansible/
├── playbook.yml
├── inventory/
│   ├── dev.yml
│   ├── stg.yml
│   └── prod.yml
└── group_vars/
    ├── dev/
    │   └── vars.yml      ← Development variables
    ├── stg/
    │   └── vars.yml      ← Staging variables
    └── prod/
        └── vars.yml      ← Production variables
Enter fullscreen mode Exit fullscreen mode

Branch strategy explained:

  • Development (dev branch): Where you experiment and develop new features
  • Staging (staging branch): Where you test features before they go to production
  • Production (main branch): The stable code that real users interact with

This way, you can test changes in dev/staging without affecting production!

Generating Inventory

Terraform automatically generates the Ansible inventory:

templates/inventory.tpl:

all:
  hosts:
    todo-app-server:
      ansible_host: ${server_ip}
      ansible_user: ${server_user}
      ansible_ssh_private_key_file: ${ssh_key_path}
      ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
Enter fullscreen mode Exit fullscreen mode

This gets generated as ansible/inventory/dev.yml (or stg.yml, prod.yml) with the actual server IP.

Running Ansible

# From the ansible directory
cd infra/ansible

# Run the playbook
ansible-playbook -i inventory/dev.yml playbook.yml

# With verbose output (for debugging)
ansible-playbook -i inventory/dev.yml playbook.yml -vvv
Enter fullscreen mode Exit fullscreen mode

What happens:

  1. Ansible connects to your server via SSH
  2. Runs the dependencies role (installs Docker)
  3. Runs the deploy role (clones code, starts containers)
  4. Your application is live!

Part 6: CI/CD with GitHub Actions

Now we automate everything. Instead of running commands manually, GitHub Actions does it for us.

Understanding CI/CD

CI (Continuous Integration): Automatically test and build when code changes
CD (Continuous Deployment): Automatically deploy when tests pass

Why CI/CD?

  • Consistency: Same process every time
  • Speed: Deploy in minutes, not hours
  • Safety: Automated tests catch bugs before production
  • History: See what was deployed when

Setting Up GitHub Secrets

Before workflows can run, they need credentials:

  1. Go to your GitHub repo → Settings → Secrets and variables → Actions
  2. Add these secrets:
    • AWS_ACCESS_KEY_ID - From your IAM user
    • AWS_SECRET_ACCESS_KEY - From your IAM user
    • TERRAFORM_STATE_BUCKET - Your S3 bucket name
    • TERRAFORM_STATE_LOCK_TABLE - Your DynamoDB table name
    • TERRAFORM_KEY_PAIR_NAME - Your EC2 key pair name
    • SSH_PRIVATE_KEY - Contents of your .pem file
    • EMAIL_TO - Where to send drift alerts
    • EMAIL_FROM - Your verified SES email

Infrastructure Workflow

This workflow runs when infrastructure code changes:

name: Infrastructure Deployment

on:
  push:
    paths:
      - 'infra/terraform/**'
      - 'infra/ansible/**'
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment (dev, stg, prod)'
        required: true
        type: choice
        options: [dev, stg, prod]

jobs:
  terraform-plan:
    name: Terraform Plan & Drift Detection
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        working-directory: infra/terraform
        run: terraform init -backend-config=...

      - name: Terraform Plan
        run: terraform plan -out=tfplan

      - name: Check for Drift
        id: drift-check
        run: |
          # Detect if this is drift (infrastructure changed outside Terraform)
          # vs expected changes (Terraform code changed)

      - name: Send Drift Email
        if: steps.drift-check.outputs.change_type == 'drift'
        run: ./infra/ci-cd/scripts/email-notification.sh "$(cat drift_summary.txt)"

      - name: Manual Approval
        if: steps.drift-check.outputs.change_type == 'drift'
        uses: trstringer/manual-approval@v1

      - name: Terraform Apply
        if: steps.manual-approval.outcome == 'success'
        run: terraform apply -auto-approve tfplan
Enter fullscreen mode Exit fullscreen mode

Drift Detection in CI/CD (Quick Reference)

Note: For a detailed explanation of drift detection, email setup, and GitHub approval, see the "Understanding Drift Detection" section earlier in this guide.

Quick summary:

  • Drift = Infrastructure changed outside Terraform
  • Detected automatically in CI/CD
  • Email sent + GitHub issue created
  • Manual approval required before applying

Application Deployment Workflow

Separate workflow for application code changes:

name: Application Deployment

on:
  push:
    paths:
      - 'frontend/**'
      - 'auth-api/**'
      - 'todos-api/**'
      - 'docker-compose.yml'
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment'
        type: choice
        options: [dev, stg, prod]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Get Server IP
        run: |
          # Find server by tag
          INSTANCE_ID=$(aws ec2 describe-instances ...)
          SERVER_IP=$(aws ec2 describe-instances ...)

      - name: Deploy with Ansible
        run: |
          ansible-playbook -i inventory/${ENV}.yml playbook.yml --tags deploy
Enter fullscreen mode Exit fullscreen mode

Why separate workflows? Infrastructure changes are rare and need careful review. Application changes are frequent and should deploy quickly.

Infrastructure Destruction Workflow

⚠️ CRITICAL: This workflow DESTROYS EVERYTHING. Use with extreme caution!

The destroy workflow is separate from the deployment workflow for safety. It has multiple confirmation steps to prevent accidental destruction.

How the Destroy Workflow Works

Step 1: Manual Trigger Only

  • Only runs when you manually trigger it (no automatic triggers)
  • Requires you to select the environment
  • Requires you to type "DESTROY" to confirm

Step 2: Validation

  • Checks that you typed "DESTROY" correctly (case-sensitive)
  • Prevents typos from accidentally destroying infrastructure

Step 3: State File Handling

  • Tries to download state from artifacts (most recent)
  • Falls back to S3 remote backend if artifacts missing
  • Imports resources if state is completely missing

Step 4: Destroy Plan

  • Shows you exactly what will be destroyed
  • Review this carefully before proceeding

Step 5: Destruction

  • Deletes all resources in the correct order
  • Handles dependencies (e.g., detaches volumes before deleting)

Step 6: Verification

  • Checks that everything was destroyed
  • Cleans up orphaned resources
  • Provides a summary

The Complete Destroy Workflow

Here's what the actual workflow looks like:

name: Infrastructure Destruction

on:
  workflow_dispatch:  # Manual trigger only - safe!
    inputs:
      environment:
        description: 'Environment to destroy (dev, stg, prod)'
        required: true
        type: choice
        options: [dev, stg, prod]
        default: 'dev'
      confirm_destroy:
        description: 'Type "DESTROY" to confirm (case-sensitive)'
        required: true
        type: string

jobs:
  validate-destroy:
    name: Validate Destruction Request
    runs-on: ubuntu-latest
    steps:
      - name: Validate confirmation
        run: |
          if [ "${{ github.event.inputs.confirm_destroy }}" != "DESTROY" ]; then
            echo "❌ Invalid confirmation. You must type 'DESTROY' to proceed."
            exit 1
          fi
          echo "✅ Destruction confirmed. Proceeding..."

  terraform-destroy:
    name: Destroy Infrastructure
    needs: validate-destroy
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init (with S3 backend)
        working-directory: infra/terraform
        run: |
          terraform init \
            -backend-config="bucket=${{ secrets.TERRAFORM_STATE_BUCKET }}" \
            -backend-config="key=terraform-state/${{ github.event.inputs.environment }}/terraform.tfstate" \
            -backend-config="region=${{ secrets.AWS_REGION }}" \
            -backend-config="dynamodb_table=${{ secrets.TERRAFORM_STATE_LOCK_TABLE }}" \
            -backend-config="encrypt=true"

      - name: Terraform Plan Destroy
        working-directory: infra/terraform
        run: |
          terraform plan -destroy \
            -var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
            -var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}" \
            -out=destroy.tfplan
          echo ""
          echo "⚠️ DESTRUCTION PLAN SUMMARY:"
          terraform show -no-color destroy.tfplan | head -100

      - name: Terraform Destroy
        working-directory: infra/terraform
        run: |
          echo "🔥 Starting infrastructure destruction..."
          terraform destroy -auto-approve \
            -var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
            -var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}"

      - name: Verify Destruction
        working-directory: infra/terraform
        run: |
          echo "🔍 Verifying all resources are destroyed..."
          # Check for orphaned resources and clean them up
Enter fullscreen mode Exit fullscreen mode

How to Use the Destroy Workflow

Step 1: Go to GitHub Actions

  1. Open your repository on GitHub
  2. Click the "Actions" tab
  3. Find "Infrastructure Destruction" in the workflow list

Step 2: Run the Workflow

  1. Click "Run workflow" button (top right)
  2. Select the environment you want to destroy:
    • ⚠️ Be very careful - make sure you select the right one!
    • Dev is usually safe to destroy
    • Staging should be destroyed carefully
    • NEVER destroy production unless absolutely necessary!
  3. In the "Type DESTROY to confirm" field, type exactly: DESTROY
    • Must be all caps
    • Must be exactly "DESTROY" (no extra spaces)
  4. Click "Run workflow"

Step 3: Watch It Run

  1. The workflow will start with a validation job
    • Checks that you typed "DESTROY" correctly
    • If wrong, workflow fails immediately (safe!)
  2. Then the terraform-destroy job runs:
    • Initializes Terraform with the correct backend
    • Creates a destroy plan (shows what will be deleted)
    • Review the plan carefully!
    • Destroys all resources
    • Verifies everything is gone

Step 4: Review the Results

  • Check the workflow logs
  • Verify in AWS Console that resources are gone
  • Check that costs are now $0.00

Safety Features

The destroy workflow has multiple safety features:

  1. Manual trigger only - Can't be triggered automatically
  2. Confirmation required - Must type "DESTROY" exactly
  3. Environment selection - Prevents destroying wrong environment
  4. Plan before destroy - Shows you what will be deleted
  5. Validation job - Double-checks confirmation before proceeding
  6. State file handling - Works with remote state (S3)
  7. Verification - Checks that everything was destroyed

What Gets Destroyed

When you run the destroy workflow, it deletes:

  • EC2 instance - Your server and everything on it
  • Security groups - Firewall rules
  • EBS volumes - Any attached storage (if using EBS for state)
  • All containers - Docker containers running on the instance
  • All data - Everything on the server is permanently lost

What stays:

  • S3 bucket - Your Terraform state bucket (not deleted)
  • DynamoDB table - State locking table (not deleted)
  • GitHub repository - Your code (not deleted)

When to Use the Destroy Workflow

Good reasons to destroy:

  • ✅ You're done with the project and want to stop costs
  • ✅ You want to start completely fresh
  • ✅ You're testing and need to clean up
  • ✅ You're moving to a different AWS account

Bad reasons to destroy:

  • ❌ Just to restart services (use Ansible instead)
  • ❌ To fix a small issue (fix the issue, don't destroy)
  • ❌ Because something isn't working (debug first)
  • ❌ In production without a backup plan

After Destruction

What happens:

  1. All infrastructure is deleted
  2. State file in S3 is updated (shows no resources)
  3. You stop paying for AWS resources
  4. All data is permanently lost

To recreate:

  1. Run the Infrastructure Deployment workflow again
  2. It will create everything from scratch
  3. You'll need to redeploy your application

Important notes:

  • State file history is preserved in S3
  • You can see what was destroyed in the workflow logs
  • GitHub Actions artifacts are kept for 90 days
  • You can manually delete artifacts if needed

Destroy Workflow vs Manual Destroy

Use the workflow when:

  • ✅ You want safety features (confirmation, validation)
  • ✅ You want to destroy from anywhere (don't need local setup)
  • ✅ You want an audit trail (GitHub Actions logs)
  • ✅ You're working with a team (everyone can see what happened)

Use manual destroy when:

  • ✅ You need to destroy specific resources only
  • ✅ You're debugging and need more control
  • ✅ You don't have GitHub Actions set up

Example: Destroying Development Environment

Let's walk through destroying a dev environment:

  1. Go to ActionsInfrastructure Destruction
  2. Click "Run workflow"
  3. Select environment: dev
  4. Type confirmation: DESTROY
  5. Click "Run workflow"

What you'll see:

✅ validate-destroy: Validation passed
✅ terraform-destroy: 
   - Terraform Init: Success
   - Terraform Plan Destroy: Shows what will be deleted
   - Terraform Destroy: Deleting resources...
   - Verify Destruction: All resources destroyed
Enter fullscreen mode Exit fullscreen mode

After completion:

  • Check AWS Console → EC2 → No instances
  • Check AWS Console → Security Groups → No groups
  • Check AWS Billing → Costs should be $0.00

Troubleshooting Destroy Workflow

Issue: "Invalid confirmation"

  • Problem: You didn't type "DESTROY" exactly
  • Solution: Type exactly DESTROY (all caps, no spaces)

Issue: "State file not found"

  • Problem: State file is missing or in wrong location
  • Solution: Workflow will try to import resources automatically

Issue: "Resources still exist after destroy"

  • Problem: Some resources might be stuck
  • Solution: Check the verification step - it will try to clean up orphaned resources

Issue: "Can't destroy because of dependencies"

  • Problem: Resources have dependencies (e.g., volume attached)
  • Solution: Workflow handles this automatically (detaches volumes first)

Best Practices for Destruction

  1. Always destroy dev first - Test the workflow in dev before using in staging/prod
  2. Review the plan - Check what will be destroyed before confirming
  3. Backup important data - If you need any data, back it up first
  4. Destroy during off-hours - If others are using the environment
  5. Document why - Add a comment in the workflow run explaining why you destroyed
  6. Verify after - Check AWS Console to confirm everything is gone
  7. Clean up artifacts - Delete GitHub Actions artifacts if you want

Part 7: Single Command Deployment

The ultimate goal: one command that does everything.

How It Works

When you run:

terraform apply -var-file=terraform.dev.tfvars -auto-approve
Enter fullscreen mode Exit fullscreen mode

Here's what happens behind the scenes:

  1. Terraform provisions infrastructure

    • Creates security group
    • Launches EC2 instance
    • Waits for instance to be ready
  2. Terraform generates Ansible inventory

    • Creates ansible/inventory/dev.yml with server IP
    • Ready for Ansible to use
  3. Terraform triggers Ansible (via null_resource)

    • Waits for SSH to be available
    • Runs Ansible playbook
    • Installs Docker
    • Clones repository
    • Starts containers
  4. Traefik gets SSL certificate

    • Contacts Let's Encrypt
    • Verifies domain ownership
    • Gets certificate
    • Enables HTTPS
  5. Application is live!

    • Frontend accessible at https://yourdomain.com
    • APIs at https://yourdomain.com/api/*

The Magic: null_resource

resource "null_resource" "ansible_provision" {
  triggers = {
    instance_id = aws_instance.todo_app.id
  }

  provisioner "local-exec" {
    command = <<-EOT
      # Wait for SSH
      until ssh ... 'echo "ready"'; do sleep 10; done

      # Run Ansible
      cd ../ansible
      ansible-playbook -i inventory/${var.environment}.yml playbook.yml
    EOT
  }
}
Enter fullscreen mode Exit fullscreen mode

What is null_resource? It's a Terraform resource that doesn't create anything in AWS. It just runs a command. Perfect for triggering Ansible!

Why the wait? EC2 instances take 30-60 seconds to boot. We wait for SSH to be ready before running Ansible.

Testing the Single Command

# Make sure you're in the terraform directory
cd infra/terraform

# Initialize (one-time setup)
terraform init -backend-config=...

# The magic command
terraform apply -var-file=terraform.dev.tfvars -auto-approve

# Watch it work!
# You'll see:
# 1. Security group created
# 2. EC2 instance launching
# 3. Waiting for SSH...
# 4. Running Ansible...
# 5. Application deployed!
Enter fullscreen mode Exit fullscreen mode

Pro tip: The first run takes 5-10 minutes. Subsequent runs are faster (only changes what's needed).


Part 8: Multi-Environment Setup

Real applications need multiple environments. Here's how to set it up properly.

Why Multiple Environments?

  • Dev: Where you experiment (break things safely)
  • Staging: Mirror of production (test before going live)
  • Production: The real thing (users depend on it)

Environment Isolation

Each environment is completely separate:

  • Different EC2 instances
  • Different security groups
  • Different state files in S3
  • Different domains
  • Different secrets

Why isolation matters: If dev gets hacked, staging and prod are still safe. If you break dev, staging and prod keep running. This is why we have three separate environments!

Setting Up Per-Environment Configuration

Terraform:

  • terraform.dev.tfvars - Dev configuration
  • terraform.stg.tfvars - Staging configuration
  • terraform.prod.tfvars - Production configuration

Ansible:

  • group_vars/dev/vars.yml - Dev variables
  • group_vars/stg/vars.yml - Staging variables
  • group_vars/prod/vars.yml - Production variables

State files:

  • terraform-state/dev/terraform.tfstate
  • terraform-state/stg/terraform.tfstate
  • terraform-state/prod/terraform.tfstate

Deploying to Different Environments

Via GitHub Actions:

  1. Go to Actions → Infrastructure Deployment
  2. Click "Run workflow"
  3. Select environment (dev/stg/prod)
  4. Click "Run workflow"

Via command line:

# Dev
terraform apply -var-file=terraform.dev.tfvars -auto-approve

# Staging
terraform apply -var-file=terraform.stg.tfvars -auto-approve

# Production
terraform apply -var-file=terraform.prod.tfvars -auto-approve
Enter fullscreen mode Exit fullscreen mode

Important: Always test in dev first! Never deploy to prod without testing.


Part 9: Common Issues and Solutions

Every project has issues. Here are the ones you'll likely encounter and how to fix them.

Issue 1: Let's Encrypt Certificate Errors

Symptom: Timeout during connect (likely firewall problem)

Causes:

  1. DNS not pointing to your server
  2. Security group blocking ports 80/443
  3. Firewall on the server blocking ports

Solutions:

# 1. Verify DNS
dig yourdomain.com
# Should show your server IP

# 2. Check security group
# AWS Console → EC2 → Security Groups
# Verify ports 80 and 443 allow 0.0.0.0/0

# 3. Check server firewall
sudo ufw status
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

# 4. Switch to HTTP challenge (more reliable)
# In docker-compose.yml:
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
Enter fullscreen mode Exit fullscreen mode

Issue 2: Terraform State Lock

Symptom: Error acquiring the state lock

Cause: Another Terraform run is in progress, or a previous run crashed

Solution:

# Check what's locking
aws dynamodb scan --table-name terraform-state-lock

# If you're sure no one else is running Terraform:
terraform force-unlock <LOCK_ID>

# Be careful! Only do this if you're certain.
Enter fullscreen mode Exit fullscreen mode

Issue 3: Ansible Connection Failed

Symptom: SSH connection failed or Permission denied

Causes:

  1. Security group doesn't allow SSH from your IP
  2. Wrong key pair
  3. Server not ready yet

Solutions:

# 1. Test SSH manually
ssh -i ~/.ssh/your-key.pem ubuntu@<server-ip>

# 2. Check security group
# Make sure it allows port 22 from your IP

# 3. Verify key pair name matches
# AWS Console → EC2 → Key Pairs
# Should match what's in terraform.tfvars

# 4. Wait longer (server might still be booting)
# EC2 instances take 1-2 minutes to be ready
Enter fullscreen mode Exit fullscreen mode

Issue 4: Drift Detection Not Working

Symptom: Changes made manually but drift not detected

Check:

  1. Did Terraform files change? (That's "expected", not drift)
  2. Is state in S3? (Local state won't work properly)
  3. Check drift detection logic in workflow

Test drift:

  1. Manually add a tag to your EC2 instance in AWS Console
  2. Run the infrastructure workflow
  3. Should detect drift and send email

Issue 5: Containers Keep Restarting

Symptom: docker ps shows containers restarting constantly

Debug:

# Check logs
docker logs <container-name>

# Check all containers
docker compose logs

# Common causes:
# - Configuration error in .env
# - Port conflict
# - Missing environment variables
# - Application crash on startup
Enter fullscreen mode Exit fullscreen mode

Part 10: Best Practices and Security

Now that everything works, let's make it production-ready.

Security Best Practices

  1. Never commit secrets

    • Use GitHub Secrets
    • Use environment variables
    • Add .env to .gitignore
  2. Restrict SSH access

    • In production, set ssh_cidr to your IP only
    • Use YOUR_IP/32 format (e.g., 1.2.3.4/32)
  3. Use different secrets per environment

    • Dev JWT secret ≠ Staging JWT secret ≠ Prod JWT secret
    • If dev is compromised, staging and prod are still safe
  4. Enable MFA

    • On AWS account
    • On GitHub account
    • Extra layer of protection
  5. Regular updates

    • Keep Docker images updated
    • Keep system packages updated
    • Security patches are important!

Infrastructure Best Practices

  1. Always use remote state

    • S3 + DynamoDB
    • Never commit state files
    • Enable versioning on S3 bucket
  2. Separate state per environment

    • Different S3 keys
    • Complete isolation
    • Can't accidentally affect prod from dev
  3. Use version constraints

    • In Terraform: version = "~> 5.0"
    • Prevents unexpected breaking changes
  4. Tag everything

    • Makes it easy to find resources
    • Helps with cost tracking
    • Required for organization

Deployment Best Practices

  1. Test in dev first

    • Always deploy to dev → staging → prod (in that order!)
    • Catch issues early before they reach production
    • Dev is for breaking things
  2. Review drift alerts

    • Don't ignore them!
    • Investigate unexpected changes
    • Could be security issue
  3. Use idempotent deployments

    • Safe to run multiple times
    • Ansible should be idempotent
    • Terraform is idempotent by design
  4. Monitor your infrastructure

    • Set up CloudWatch alarms
    • Monitor costs
    • Watch for unusual activity

Cost Optimization

  1. Right-size instances

    • Dev: t3.small (saves money)
    • Prod: t3.medium (enough power)
    • Don't over-provision
  2. Stop dev when not in use

    • Dev doesn't need to run 24/7
    • Stop instances when not testing
    • Saves ~70% of costs
  3. Clean up unused resources

    • Delete old instances
    • Remove unused security groups
    • Regular cleanup prevents waste

Part 11: Going Further

You've built a solid foundation. Here's where to go next.

Monitoring and Observability

Add CloudWatch:

  • Monitor CPU, memory, disk
  • Set up alarms
  • Track costs

Add Application Monitoring:

  • Prometheus + Grafana
  • ELK stack for logs
  • APM tools (New Relic, Datadog)

Scaling

Horizontal Scaling:

  • Add load balancer
  • Multiple instances
  • Auto-scaling groups

Vertical Scaling:

  • Larger instance types
  • More CPU/RAM
  • Better performance

Backup and Disaster Recovery

Backup Strategy:

  • Database backups
  • State file backups (S3 versioning)
  • Configuration backups

Disaster Recovery:

  • Multi-region deployment
  • Automated failover
  • Recovery procedures

Advanced Topics

  • Kubernetes: Container orchestration at scale
  • Terraform Modules: Reusable infrastructure code
  • Ansible Roles: Shareable configuration
  • GitOps: Git as source of truth
  • Infrastructure Testing: Test your infrastructure code

Conclusion: What You've Accomplished

Let's take a moment to appreciate what you've built:

A microservices application running in containers
Automated infrastructure with Terraform
Automated deployment with Ansible
CI/CD pipelines that detect problems
Multi-environment setup (dev/staging/prod)
Secure HTTPS with automatic certificates
Single-command deployment that just works
Production-ready practices and security

This isn't just a tutorial project - this is real infrastructure that follows industry best practices. You can use this as a foundation for actual production applications.

Key Takeaways

  1. Infrastructure as Code saves time and prevents mistakes
  2. Automation is your friend - manual processes break
  3. Security isn't optional - build it in from the start
  4. Testing in dev/staging prevents production disasters
  5. Documentation (this blog post!) helps you and others

Next Steps

  1. Deploy your own project using this as a template
  2. Experiment - break things in dev, learn from it
  3. Share - help others learn what you've learned
  4. Iterate - improve based on real-world experience

Resources


Thank you for reading! If this helped you, please share it with others who might benefit. And if you have questions or run into issues, don't hesitate to reach out.

Happy deploying! 🚀


This guide was written as part of the HNG Internship Stage 6 DevOps task. The complete implementation is available on GitHub.

Top comments (0)