Mart Young

Posted on Dec 9, 2025 • Edited on Dec 10, 2025

From Zero to Production: A Complete Guide to Deploying Microservices with Terraform, Ansible and CI/CD

#devops #cicd #tutorial #terraform

How I built a production-ready DevOps pipeline for a microservices TODO application - and how you can too, even if you're just starting out.

Introduction: Why This Matters

If you're reading this, you've probably heard terms like "DevOps," "Infrastructure as Code," and "CI/CD" thrown around, but maybe you're not entirely sure what they mean or how they fit together. That's exactly where I was when I started.

This guide isn't just about completing a task - it's about understanding the why behind each decision, learning from common mistakes, and building something you can be proud of. By the end, you'll have deployed a real application to the cloud with automated infrastructure, proper security, and a professional workflow.

What you'll build:

A microservices application with 5 different services (Vue.js, Go, Node.js, Java, Python)
Automated cloud infrastructure using Terraform
Server configuration and deployment with Ansible
CI/CD pipelines that detect when things go wrong
Multi-environment setup (dev, staging, production)
Secure HTTPS with automatic SSL certificates
A single command that deploys everything

What you'll learn:

How containerization actually works (beyond just "docker run")
Why infrastructure as code matters (and how it saves you from disasters)
How to think about security in a cloud environment
The importance of automation and what happens when you skip it

Let's dive in.

Part 1: Understanding What We're Building

Before we start writing code, let's understand what we're actually building. This isn't just a TODO app - it's a microservices architecture, which means instead of one big application, we have multiple small services that work together.

The Architecture

Think of it like a restaurant:

Frontend (Vue.js) - The dining room where customers interact
Auth API (Go) - The host who checks if you have a reservation (authentication)
Todos API (Node.js) - The waiter who takes your order (manages your todos)
Users API (Java) - The manager who knows all the customers (user management)
Log Processor (Python) - The kitchen staff who process orders (background processing)
Redis - The order board where everyone can see what's happening (message queue)

Each service runs in its own container, which is like giving each part of the restaurant its own kitchen. If the waiter (Todos API) has a problem, it doesn't crash the whole restaurant.

Why Containerization?

You might be thinking: "Why not just run everything on one server?" Great question! Here's why containers matter:

Isolation: If one service crashes, others keep running
Consistency: "It works on my machine" becomes "it works everywhere"
Scalability: Need more power? Spin up more containers
Portability: Move from AWS to Azure? Just change where you run containers

The Challenge

The real challenge isn't just getting containers to run - it's:

Making sure they can talk to each other
Securing them with HTTPS
Automating deployment so you don't manually SSH into servers
Detecting when someone changes things manually (drift detection)
Managing multiple environments without chaos

That's what makes this a real-world project.

Part 2: Setting Up Your Development Environment

Before we write any code, let's make sure you have everything you need. Don't worry if some of these are new - I'll explain what each one does.

Required Accounts

GitHub Account (Free)

This is where your code lives and where CI/CD runs
Think of it as your code's home and your automation's brain
Sign up at github.com if you don't have one

AWS Account (Free tier available)

This is where your servers will run
AWS has a free tier that's perfect for learning
You'll need a credit card, but we'll stay within free limits
Sign up at aws.amazon.com

Domain Name (Optional but recommended - ~$10-15/year)

This is your website's address (like yourname.com)
You can use services like Namecheap, GoDaddy, or Cloudflare
Why you need it: Let's Encrypt (free SSL) requires a real domain
Alternative: You can test with localhost but won't get real SSL

Installing Required Tools

Docker & Docker Compose

# On Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose-plugin

# Verify installation
docker --version
docker compose version

What is Docker? Think of it as a shipping container for software. Just like shipping containers standardize how goods are transported, Docker standardizes how applications run.

Terraform (Version 1.5.0 or higher)

# Download from hashicorp.com or use package manager
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/

# Verify
terraform version

What is Terraform? It's like a blueprint for your cloud infrastructure. Instead of clicking buttons in AWS console (which you'll forget), you write code that describes what you want, and Terraform makes it happen.

Ansible

sudo apt-get install ansible

# Verify
ansible --version

What is Ansible? Think of it as a remote control for servers. Instead of SSHing into each server and typing commands, you write a "playbook" that tells Ansible what to do, and it does it on all your servers.

AWS CLI

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# Configure with your credentials
aws configure

What is AWS CLI? It's a command-line interface to AWS. Instead of using the web console, you can control AWS from your terminal.

Setting Up AWS

This is where many beginners get stuck, so let's go through it step by step.

Step 1: Create an IAM User

Why not use your root account? Security best practice - root account has unlimited power. If it gets compromised, your entire AWS account is at risk.

Go to AWS Console → IAM → Users
Click "Create user"
Name it something like terraform-user
Important: Check "Provide user access to the AWS Management Console" if you want console access, OR just programmatic access for CLI/API
Attach policies:
- AmazonEC2FullAccess (for creating servers)
- AmazonS3FullAccess (for storing Terraform state)
- AmazonDynamoDBFullAccess (for state locking)
- AmazonSESFullAccess (for email notifications)
Save the Access Key ID and Secret Access Key - you'll need these!

Step 2: Create S3 Bucket for Terraform State

What is Terraform state? Terraform needs to remember what it created. This "memory" is stored in a state file. We put it in S3 so it's:

Backed up automatically
Accessible from anywhere
Versioned (can see history)

Go to S3 → Create bucket
Name it something like yourname-terraform-state
Important settings:
- Region: Choose one (remember which one!)
- Block Public Access: Keep all enabled (security)
- Versioning: Enable this (so you can recover if state gets corrupted)
Click Create

Step 3: Create DynamoDB Table for State Locking

Why do we need locking? Imagine two people trying to deploy at the same time. Without locking, they might both try to create the same server, causing conflicts. DynamoDB prevents this.

Go to DynamoDB → Create table
Table name: terraform-state-lock
Partition key: LockID (type: String)
Table settings: Use default
Capacity: On-demand (pay per request - perfect for this use case)
Click Create

Step 4: Create EC2 Key Pair

What is a key pair? It's like a password, but more secure. Instead of typing a password, you use a private key file to authenticate.

Go to EC2 → Key Pairs → Create key pair
Name: my-terraform-key (or whatever you prefer)
Key pair type: RSA
Private key file format: .pem
Click Create
IMPORTANT: The .pem file downloads automatically. Save it somewhere safe! You'll need it to SSH into your servers.

Step 5: Verify AWS CLI Works

# Test your credentials
aws sts get-caller-identity

# Should show your user ARN

If this works, you're all set! If not, check your aws configure settings.

Part 3: Containerizing Your Application

Now that your environment is set up, let's containerize the application. This is where the magic happens.

Understanding Dockerfiles

A Dockerfile is like a recipe. It tells Docker:

What base image to start with (like choosing an operating system)
What files to copy
What commands to run
What port to expose
What command to run when the container starts

Creating Dockerfiles for All Services

Now, you might be wondering: "How do I know what Dockerfile to create for each service?" Great question! Let me show you the pattern.

The rule of thumb: Each service folder needs its own Dockerfile. Look at your project structure:

DevOps-deployment/
├── frontend/          → Needs Dockerfile (Vue.js)
├── auth-api/          → Needs Dockerfile (Go) 
├── todos-api/         → Needs Dockerfile (Node.js)
├── users-api/         → Needs Dockerfile (Java)
└── log-message-processor/ → Needs Dockerfile (Python)

How to figure out what each service needs:

Check what language/framework it uses (look for package.json, pom.xml, requirements.txt, go.mod)
Find the entry point (usually server.js, main.go, main.py, or a compiled JAR)
Determine the port it runs on (check the code or config files)
Follow the pattern for that language

Frontend Dockerfile (Vue.js)

First, check the service:

cd frontend
ls -la
# You'll see: package.json, src/, public/
# This tells you: It's a Vue.js app that needs to be built

Vue.js apps are special - they compile to static HTML/CSS/JS files that need a web server. We use a "multi-stage build":

Build stage: Use Node.js to compile the Vue app
Runtime stage: Use nginx (lightweight web server) to serve the compiled files

# Step 1: Build the application
FROM node:18-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

# Step 2: Serve it with nginx
FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80

Breaking it down:

FROM node:18-alpine AS build - Start with Node.js for building (the AS build names this stage)
COPY package*.json ./ - Copy dependency files first (Docker caching optimization!)
RUN npm install - Install dependencies
RUN npm run build - Compile Vue.js to static files (creates dist/ folder)
FROM nginx:alpine - Start a NEW stage with nginx (much smaller image)
COPY --from=build - Copy the built files from the build stage
EXPOSE 80 - nginx serves on port 80

Why two stages? The build stage has Node.js + all build tools (~500MB). The runtime stage only has nginx + static files (~20MB). This makes the final image 25x smaller!

Auth API Dockerfile (Go)

Check the service:

cd auth-api
ls -la
# You'll see: go.mod, main.go
# This tells you: It's Go, entry point is main.go

Go is special - it compiles to a single binary file. No runtime needed! We also use multi-stage build:

# Build stage
FROM golang:1.21-alpine AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o auth-api

# Runtime stage
FROM alpine:latest
RUN apk --no-cache add ca-certificates
COPY --from=build /app/auth-api /auth-api
EXPOSE 8081
CMD ["/auth-api"]

Breaking it down:

FROM golang:1.21-alpine AS build - Go compiler for building
RUN go mod download - Download Go dependencies
RUN go build -o auth-api - Compile to a single binary file
FROM alpine:latest - Tiny Linux (only 5MB!)
COPY --from=build /app/auth-api - Copy the compiled binary
CMD ["/auth-api"] - Run the binary

Key differences from Vue.js:

Go compiles to a single binary (no runtime needed!)
Final image is super small (~10MB vs ~500MB for Node.js)
CGO_ENABLED=0 creates a static binary (no external dependencies)

Todos API Dockerfile (Node.js)

Check the service:

cd todos-api
ls -la
# You'll see: package.json, server.js, routes.js
# This tells you: It's Node.js, entry point is server.js

Check package.json for the start command:

{
  "scripts": {
    "start": "node server.js"
  }
}

Node.js API Dockerfile (simpler than Vue.js - no build step needed):

FROM node:18-alpine
WORKDIR /app

# Copy dependency files first (Docker caching optimization)
COPY package*.json ./

# Install dependencies (production only for smaller image)
RUN npm ci --only=production

# Copy application code
COPY . .

# Expose the port (check server.js to see which port)
EXPOSE 8082

# Start the application
CMD ["node", "server.js"]

Why this pattern?

npm ci --only=production - Faster, more reliable than npm install, and skips dev dependencies
Copy package*.json first - If dependencies don't change, Docker reuses the cached layer
node:18-alpine - Lightweight Node.js image

How to test it:

# Build the image
docker build -t todos-api ./todos-api

# Run it
docker run -p 8082:8082 todos-api

# Test it
curl http://localhost:8082

Users API Dockerfile (Java Spring Boot)

Check the service:

cd users-api
ls -la
# You'll see: pom.xml, src/
# This tells you: It's Java with Maven, needs to be compiled

Java services need two stages:

Build stage - Compile the code
Runtime stage - Run the compiled JAR

# Stage 1: Build
FROM maven:3.9-eclipse-temurin-17 AS build
WORKDIR /app

# Copy Maven config first (caching optimization)
COPY pom.xml .
# Download dependencies (cached if pom.xml doesn't change)
RUN mvn dependency:go-offline

# Copy source code
COPY src ./src

# Build the application
RUN mvn clean package -DskipTests

# Stage 2: Runtime
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app

# Install JAXB dependencies (needed for Java 17+)
RUN apk add --no-cache wget && \
    mkdir -p /app/lib && \
    wget -q -O /app/lib/jaxb-api.jar https://repo1.maven.org/maven2/javax/xml/bind/jaxb-api/2.3.1/jaxb-api-2.3.1.jar && \
    wget -q -O /app/lib/jaxb-runtime.jar https://repo1.maven.org/maven2/org/glassfish/jaxb/jaxb-runtime/2.3.1/jaxb-runtime-2.3.1.jar && \
    apk del wget

# Copy the built JAR from build stage
COPY --from=build /app/target/*.jar app.jar

EXPOSE 8083

# Run the Spring Boot application
ENTRYPOINT ["java", \
    "--add-opens", "java.base/java.lang=ALL-UNNAMED", \
    "--add-opens", "java.base/java.lang.reflect=ALL-UNNAMED", \
    "--add-opens", "java.base/java.util=ALL-UNNAMED", \
    "-cp", "app.jar:/app/lib/*", \
    "org.springframework.boot.loader.JarLauncher"]

Why this is complex:

Java needs compilation (Maven does this)
Spring Boot creates a "fat JAR" (includes everything)
Java 17+ removed some libraries (JAXB), so we add them back
The --add-opens flags are needed for Java 17+ module system

Don't worry if this looks complicated - Java Dockerfiles are the most complex. The pattern is always:

Build stage: Install dependencies, compile
Runtime stage: Copy compiled artifact, run it

Log Message Processor Dockerfile (Python)

Check the service:

cd log-message-processor
ls -la
# You'll see: requirements.txt, main.py
# This tells you: It's Python, entry point is main.py

Python Dockerfile:

FROM python:3.11-slim
WORKDIR /app

# Install build dependencies (needed to compile some Python packages)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    g++ \
    python3-dev \
    && rm -rf /var/lib/apt/lists/*

# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Remove build dependencies (they're not needed at runtime)
RUN apt-get purge -y gcc g++ python3-dev && \
    apt-get autoremove -y && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Copy application code
COPY . .

# Run the application
CMD ["python", "main.py"]

Why install then remove build dependencies?

Some Python packages need to compile C extensions
We install gcc, g++ to compile them
After installation, we remove them (saves ~200MB!)
The compiled packages still work without the compilers

Simpler alternative (if no C extensions needed):

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]

The Pattern: How to Create Any Dockerfile

Here's the mental model:

Identify the language → Check for language-specific files
- package.json → Node.js
- pom.xml or build.gradle → Java
- requirements.txt → Python
- go.mod → Go
- Cargo.toml → Rust
Find the base image → Use official images
- Node.js → node:18-alpine
- Java → eclipse-temurin:17-jre-alpine (runtime) or maven:3.9-eclipse-temurin-17 (build)
- Python → python:3.11-slim
- Go → golang:1.21-alpine (build) or alpine:latest (runtime)
Copy dependencies first → Docker caching optimization
- Copy package.json / pom.xml / requirements.txt / go.mod
- Install dependencies
- Then copy application code
Expose the port → Check the code for which port it uses
Set the command → How to start the application

Testing Each Dockerfile

Before adding to docker-compose, test each one:

# Test Frontend
cd frontend
docker build -t frontend-test .
docker run -p 8080:80 frontend-test
curl http://localhost:8080

# Test Auth API
cd ../auth-api
docker build -t auth-api-test .
docker run -p 8081:8081 auth-api-test
curl http://localhost:8081/health

# Test Todos API
cd ../todos-api
docker build -t todos-api-test .
docker run -p 8082:8082 todos-api-test
curl http://localhost:8082

# Test Users API
cd ../users-api
docker build -t users-api-test .
docker run -p 8083:8083 users-api-test
curl http://localhost:8083/health

# Test Log Processor
cd ../log-message-processor
docker build -t log-processor-test .
docker run log-processor-test
# (This might not have HTTP endpoint, check logs)

Common issues and fixes:

"Module not found" or "Package not found"
- Make sure you copied dependency files before installing
- Check that requirements.txt / package.json is in the right place
"Port already in use"
- Another container is using that port
- Use docker ps to see what's running
- Stop it with docker stop <container-id>
"Cannot connect to database" or "Connection refused"
- Services need to be in the same Docker network
- Use service names (e.g., redis) not localhost
- Wait for dependencies to be ready (use depends_on in docker-compose)
Image too large
- Use multi-stage builds (build in one stage, copy artifacts to smaller runtime stage)
- Use alpine or slim base images
- Remove build dependencies after installation

Creating docker-compose.yml

Now we need to orchestrate all these containers. That's where Docker Compose comes in.

Think of docker-compose.yml as a conductor's score - it tells all the musicians (containers) when to play, how to play together, and in what order.

Let's build it piece by piece:

Step 1: The Reverse Proxy (Traefik)

Traefik is like a smart receptionist at a hotel:

It receives all incoming requests (guests)
It looks at the URL and decides which service should handle it (which room)
It automatically gets SSL certificates from Let's Encrypt (security badges)
It handles HTTPS redirects (escorts HTTP guests to HTTPS)

services:
  # Reverse Proxy - Routes traffic to the right service
  traefik:
    image: traefik:latest
    container_name: traefik
    command:
      - "--api.insecure=true"  # Enable dashboard (for debugging)
      - "--providers.docker=true"  # Watch Docker containers
      - "--providers.docker.exposedbydefault=false"  # Only expose containers with labels
      - "--entrypoints.web.address=:80"  # HTTP entry point
      - "--entrypoints.websecure.address=:443"  # HTTPS entry point
      - "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"  # Use HTTP challenge
      - "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"  # Challenge on port 80
      - "--certificatesresolvers.letsencrypt.acme.email=${LETSENCRYPT_EMAIL:-your-email@example.com}"  # Email for cert
      - "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"  # Where to store certs
    ports:
      - "80:80"    # HTTP
      - "443:443"  # HTTPS
      - "8080:8080"  # Traefik dashboard (for debugging)
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro  # Let Traefik see other containers
      - ./letsencrypt:/letsencrypt  # Store SSL certificates
    networks:
      - app-network
    restart: unless-stopped  # Auto-restart if it crashes

Key concepts:

ports: "80:80" means "map host port 80 to container port 80"
volumes: /var/run/docker.sock lets Traefik discover other containers automatically
networks: app-network puts Traefik on the same network as other services

Step 2: The Frontend Service

The Vue.js frontend needs to be built and served:

  # Frontend - Vue.js application
  frontend:
    build:
      context: ./frontend  # Where the Dockerfile is
      dockerfile: Dockerfile
    container_name: frontend
    environment:
      - PORT=80  # Port the app runs on inside container
      - AUTH_API_ADDRESS=http://auth-api:8081  # Use service name, not localhost!
      - TODOS_API_ADDRESS=http://todos-api:8082
    labels:
      # Tell Traefik to route traffic to this service
      - "traefik.enable=true"
      # Route requests for your domain to this service
      - "traefik.http.routers.frontend.rule=Host(`${DOMAIN:-yourdomain.com}`)"
      # Use HTTPS
      - "traefik.http.routers.frontend.entrypoints=websecure"
      # Get SSL certificate automatically
      - "traefik.http.routers.frontend.tls.certresolver=letsencrypt"
      # Frontend runs on port 80 inside container
      - "traefik.http.services.frontend.loadbalancer.server.port=80"
      # Redirect HTTP to HTTPS
      - "traefik.http.routers.frontend-redirect.rule=Host(`${DOMAIN:-yourdomain.com}`)"
      - "traefik.http.routers.frontend-redirect.entrypoints=web"
      - "traefik.http.routers.frontend-redirect.middlewares=redirect-to-https"
      - "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"
    networks:
      - app-network
    depends_on:
      - auth-api
      - todos-api
      - users-api

Important points:

build: context: ./frontend tells Docker to build from the frontend folder
environment: sets variables the app can read
AUTH_API_ADDRESS=http://auth-api:8081 - Notice we use auth-api (the service name), not localhost!
depends_on: ensures these services start before frontend
Labels tell Traefik how to route traffic

Step 3: The Auth API (Go)

  # Auth API - Handles authentication
  auth-api:
    build:
      context: ./auth-api
      dockerfile: Dockerfile
    container_name: auth-api
    environment:
      - AUTH_API_PORT=8081
      - USERS_API_ADDRESS=http://users-api:8083
      - JWT_SECRET=${JWT_SECRET:-myfancysecret}  # Secret key for tokens
      - REDIS_URL=redis://redis:6379  # Redis connection
    labels:
      - "traefik.enable=true"
      # Route /api/auth requests to this service
      - "traefik.http.routers.auth.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/api/auth`)"
      - "traefik.http.routers.auth.entrypoints=websecure"
      - "traefik.http.routers.auth.tls.certresolver=letsencrypt"
      - "traefik.http.services.auth.loadbalancer.server.port=8081"
      # Also handle /login route (frontend calls this)
      - "traefik.http.routers.auth-login.rule=Host(`${DOMAIN:-yourdomian.com}`) && (Path(`/login`) || PathPrefix(`/login/`))"
      - "traefik.http.routers.auth-login.entrypoints=websecure"
      - "traefik.http.routers.auth-login.tls.certresolver=letsencrypt"
      - "traefik.http.routers.auth-login.service=auth"
    networks:
      - app-network
    depends_on:
      - redis  # Needs Redis for session storage

Routing explained:

PathPrefix(/api/auth) means any URL starting with /api/auth goes here
Example: https://your-domain.com/api/auth/login → auth-api
Path(/login) means exactly /login goes here

Step 4: The Todos API (Node.js)

  # Todos API - Manages todo items
  todos-api:
    build:
      context: ./todos-api
      dockerfile: Dockerfile
    container_name: todos-api
    environment:
      - PORT=8082
      - AUTH_API_URL=http://auth-api:8081  # To validate tokens
      - JWT_SECRET=${JWT_SECRET:-myfancysecret}
      - REDIS_URL=redis://redis:6379
    labels:
      - "traefik.enable=true"
      # Route /api/todos requests here
      - "traefik.http.routers.todos.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/api/todos`)"
      - "traefik.http.routers.todos.entrypoints=websecure"
      - "traefik.http.routers.todos.tls.certresolver=letsencrypt"
      - "traefik.http.services.todos.loadbalancer.server.port=8082"
    networks:
      - app-network
    depends_on:
      - redis
      - auth-api  # Needs auth-api to validate tokens

Step 5: The Users API (Java)

  # Users API - Manages user accounts
  users-api:
    build:
      context: ./users-api
      dockerfile: Dockerfile
    container_name: users-api
    environment:
      - SERVER_PORT=8083
      - JWT_SECRET=${JWT_SECRET:-myfancysecret}
      - REDIS_URL=redis://redis:6379
    labels:
      - "traefik.enable=true"
      # Route /api/users requests here
      - "traefik.http.routers.users.rule=Host(`${DOMAIN:-yourdomian.com}`) && PathPrefix(`/api/users`)"
      - "traefik.http.routers.users.entrypoints=websecure"
      - "traefik.http.routers.users.tls.certresolver=letsencrypt"
      - "traefik.http.services.users.loadbalancer.server.port=8083"
    networks:
      - app-network
    depends_on:
      - redis

Step 6: The Log Processor (Python)

This service doesn't need Traefik routing - it's a background worker:

  # Log Processor - Background worker that processes messages
  log-message-processor:
    build:
      context: ./log-message-processor
      dockerfile: Dockerfile
    container_name: log-message-processor
    environment:
      - REDIS_HOST=redis  # Use service name
      - REDIS_PORT=6379
      - REDIS_CHANNEL=log-messages
    networks:
      - app-network
    depends_on:
      - redis
    restart: unless-stopped  # Keep it running

Why no Traefik labels? This service doesn't serve HTTP requests - it just listens to Redis for messages.

Step 7: Supporting Services

  # Redis - Message queue and cache
  redis:
    image: redis:7-alpine  # Use pre-built image, no Dockerfile needed
    container_name: redis
    ports:
      - "6379:6379"  # Expose for debugging (optional)
    volumes:
      - redis-data:/data  # Persist data
    networks:
      - app-network
    restart: unless-stopped

  # Zipkin handler - Service for /zipkin endpoint
  zipkin-handler:
    image: nginx:alpine
    container_name: zipkin-handler
    labels:
      - "traefik.enable=true"
      - "traefik.http.routers.zipkin.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/zipkin`)"
      - "traefik.http.routers.zipkin.entrypoints=websecure"
      - "traefik.http.routers.zipkin.tls.certresolver=letsencrypt"
      - "traefik.http.services.zipkin.loadbalancer.server.port=80"
    networks:
      - app-network
    command: >
      sh -c "echo 'server {
        listen 80;
        location / {
          return 200 \"OK\";
          add_header Content-Type text/plain;
        }
      }' > /etc/nginx/conf.d/default.conf && nginx -g 'daemon off;'"

Step 8: Networks and Volumes

At the end of the file, define shared resources:

# Networks - How containers communicate
networks:
  app-network:
    driver: bridge  # Default network type

# Volumes - Persistent storage
volumes:
  redis-data:  # Named volume for Redis data

Why networks? Containers on the same network can talk to each other using service names (like auth-api instead of IP addresses).

Why volumes? Data in containers is lost when they're removed. Volumes persist data.

Complete docker-compose.yml Structure

Here's the mental model:

docker-compose.yml
├── services (all your containers)
│   ├── traefik (reverse proxy)
│   ├── frontend (Vue.js app)
│   ├── auth-api (Go service)
│   ├── todos-api (Node.js service)
│   ├── users-api (Java service)
│   ├── log-message-processor (Python worker)
│   ├── redis (database/queue)
│   └── zipkin-handler (dummy endpoint)
├── networks (how they connect)
│   └── app-network
└── volumes (persistent storage)
    └── redis-data

Understanding Traefik Labels (Deep Dive)

Labels are how you tell Traefik what to do. Let's break down a complex example:

labels:
  - "traefik.enable=true"  # Step 1: Enable Traefik for this service
  - "traefik.http.routers.auth.rule=Host(`example.com`) && PathPrefix(`/api/auth`)"  # Step 2: Define routing rule
  - "traefik.http.routers.auth.entrypoints=websecure"  # Step 3: Use HTTPS
  - "traefik.http.routers.auth.tls.certresolver=letsencrypt"  # Step 4: Get SSL cert
  - "traefik.http.services.auth.loadbalancer.server.port=8081"  # Step 5: Which port to forward to

Breaking it down:

Router = A set of rules for routing traffic
Rule = Conditions that must match (domain + path)
Entrypoint = Which port/protocol (web = HTTP, websecure = HTTPS)
Service = The actual container and port
Middleware = Transformations (redirects, rewrites, etc.)

Example flow:

User visits https://example.com/api/auth/login
Traefik receives request on port 443 (websecure entrypoint)
Traefik checks rules: "Does this match Host(example.com) && PathPrefix(/api/auth)?" → Yes!
Traefik forwards to auth service on port 8081
Auth-api container handles the request

Testing Your docker-compose.yml

Before deploying to the cloud, test locally:

Step 1: Create Environment File

Create a .env file in the root directory:

cat > .env <<EOF
DOMAIN=localhost
LETSENCRYPT_EMAIL=your-email@example.com
JWT_SECRET=test-secret-key-change-this-in-production
EOF

What's in .env?

DOMAIN - Your domain name (use localhost for local testing)
LETSENCRYPT_EMAIL - Email for SSL certificate notifications
JWT_SECRET - Secret key for JWT tokens (use a strong random string in production)

Step 2: Start All Services

# Build and start all containers in the background
docker compose up -d

# Watch all logs in real-time
docker compose logs -f

# Or watch specific service logs
docker compose logs -f frontend
docker compose logs -f auth-api

What -d means: Detached mode - runs in the background so you can use your terminal.

Step 3: Verify Everything is Running

# Check status of all containers
docker compose ps

# You should see something like:
# NAME                    STATUS              PORTS
# traefik                 Up 2 minutes        0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp
# frontend                Up 2 minutes        
# auth-api                Up 2 minutes        
# todos-api               Up 2 minutes        
# users-api               Up 2 minutes        
# log-message-processor   Up 2 minutes        
# redis                   Up 2 minutes        0.0.0.0:6379->6379/tcp

Step 4: Test Each Service

# Test frontend (should return HTML)
curl http://localhost

# Test auth API (should return "Not Found" - that's expected!)
curl http://localhost/api/auth

# Test todos API (should return "Invalid Token" - also expected!)
curl http://localhost/api/todos

# Test users API
curl http://localhost/api/users

# Check Traefik dashboard (optional)
# Open http://localhost:8080 in your browser

Expected responses:

Frontend: HTML page (login screen)
/api/auth without path: "Not Found" (correct - needs specific endpoint)
/api/todos without auth: "Invalid Token" (correct - needs authentication)
/api/users without auth: "Missing or invalid Authorization header" (correct)

Step 5: Test with Browser

Open http://localhost in your browser
You should see the login page
Try logging in (if you have test credentials)
Check browser console (F12) for any errors

Step 6: Check Logs for Errors

# View logs for a specific service
docker compose logs frontend
docker compose logs auth-api
docker compose logs traefik

# View last 100 lines
docker compose logs --tail=100 traefik

# Follow logs in real-time
docker compose logs -f traefik

What to look for:

✅ "Server started" or "Listening on port X" = Good!
❌ "Connection refused" = Service dependency not ready
❌ "Module not found" = Missing dependency in Dockerfile
❌ "Port already in use" = Another service is using that port

Step 7: Stop Everything

# Stop all containers
docker compose down

# Stop and remove volumes (clean slate)
docker compose down -v

# Stop and remove images too (full cleanup)
docker compose down --rmi all -v

Common Issues and Solutions

Issue 1: "Port 80 already in use"

Problem: Another service (like Apache, Nginx, or another Docker container) is using port 80.

Solution:

# Find what's using port 80
sudo lsof -i :80
# or
sudo netstat -tulpn | grep :80

# Stop the conflicting service
sudo systemctl stop apache2  # or nginx, or whatever it is

# Or change Traefik ports in docker-compose.yml:
ports:
  - "8080:80"   # Use 8080 instead of 80
  - "8443:443"  # Use 8443 instead of 443

Issue 2: "Build failed" or "Module not found"

Problem: Dockerfile has issues or dependencies are missing.

Solution:

# Build a specific service to see detailed errors
docker compose build frontend

# Check the Dockerfile syntax
# Make sure COPY commands are in the right order
# Make sure RUN commands install dependencies before copying code

Issue 3: "Container keeps restarting"

Problem: The application is crashing on startup.

Solution:

# Check why it's restarting
docker compose logs <service-name>

# Common causes:
# - Missing environment variables
# - Database/Redis not ready (add depends_on)
# - Port conflict
# - Missing files or dependencies

Issue 4: "Cannot connect to auth-api" or "Connection refused"

Problem: Services can't find each other.

Solution:

✅ Use service names (e.g., http://auth-api:8081), not localhost
✅ Make sure all services are on the same network (app-network)
✅ Check depends_on - services might be starting before dependencies are ready
✅ Add health checks or wait scripts if needed

Issue 5: "SSL certificate error" or "Let's Encrypt failed"

Problem: Let's Encrypt can't verify your domain.

Solution:

For local testing: Use localhost and HTTP only (remove HTTPS redirect)
For production: Make sure DNS points to your server
Make sure ports 80 and 443 are open in firewall
Check Traefik logs: docker compose logs traefik

Quick Reference: docker-compose Commands

# Start services
docker compose up              # Start and show logs
docker compose up -d           # Start in background

# Stop services
docker compose stop            # Stop but don't remove
docker compose down           # Stop and remove containers

# View logs
docker compose logs            # All services
docker compose logs -f         # Follow (live updates)
docker compose logs <service>  # Specific service

# Rebuild
docker compose build           # Build all
docker compose build <service> # Build specific service
docker compose up --build      # Build and start

# Status
docker compose ps              # Show running containers
docker compose top             # Show running processes

# Execute commands
docker compose exec <service> <command>  # Run command in container
docker compose exec frontend sh           # Get shell in frontend container

Part 4: Infrastructure as Code with Terraform

Now comes the infrastructure part. This is where many people get intimidated, but it's actually simpler than it seems.

Why Infrastructure as Code?

Imagine you're building a house. You could:

Manual approach: Tell the builder "put a window here, a door there" every time
Blueprint approach: Draw a blueprint once, builder follows it every time

Infrastructure as Code is the blueprint approach. Benefits:

Reproducible: Same code = same infrastructure, every time
Version controlled: See what changed and when
Testable: Try changes without breaking production
Documented: The code IS the documentation

Understanding Terraform Basics

Terraform uses a language called HCL (HashiCorp Configuration Language). It's designed to be human-readable.

Basic structure:

resource "aws_instance" "todo_app" {
  ami           = "ami-12345"
  instance_type = "t3.medium"
}

This says: "Create an AWS EC2 instance resource, call it 'todo_app', with these properties."

Creating Your First Terraform Configuration

Let's build it step by step:

Step 1: Provider Configuration

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }

  backend "s3" {
    # We'll configure this during init
  }
}

provider "aws" {
  region = var.aws_region
}

What's happening:

required_version - Ensures everyone uses compatible Terraform
required_providers - Tells Terraform which plugins to download
backend "s3" - Where to store state (we'll configure this later)
provider "aws" - Which cloud provider to use

Step 2: Data Sources (Getting Information)

Before creating resources, we often need to look things up:

data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"]  # Canonical (Ubuntu's publisher)

  filter {
    name   = "name"
    values = ["*ubuntu-jammy-22.04-amd64-server*"]
  }
}

What is an AMI? Amazon Machine Image - it's like a template for a virtual machine. This code finds the latest Ubuntu 22.04 image.

Why use data sources? AMI IDs change in different regions. Instead of hardcoding ami-12345, we let Terraform find the right one.

Step 3: Security Group (Firewall Rules)

resource "aws_security_group" "todo_app" {
  name        = "todo-app-sg-${var.environment}"
  description = "Security group for TODO application"

  ingress {
    description = "HTTP"
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Allow from anywhere
  }

  ingress {
    description = "HTTPS"
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    description = "SSH"
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = [var.ssh_cidr]  # Only from your IP (security!)
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"  # All protocols
    cidr_blocks = ["0.0.0.0/0"]  # Allow all outbound
  }

  tags = {
    Name        = "todo-app-sg-${var.environment}"
    Environment = var.environment
  }
}

What is a security group? It's AWS's firewall. It controls what traffic can reach your server.

Breaking it down:

ingress - Incoming traffic rules
egress - Outgoing traffic rules
cidr_blocks = ["0.0.0.0/0"] - From anywhere (0.0.0.0/0 means "everywhere")
var.ssh_cidr - A variable (we'll set this to your IP for security)

Security tip: In production, restrict SSH to your IP only! Use a service like whatismyip.com to find your IP, then set ssh_cidr = "YOUR_IP/32".

Step 4: EC2 Instance (Your Server)

resource "aws_instance" "todo_app" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  key_name               = var.key_pair_name
  vpc_security_group_ids = [aws_security_group.todo_app.id]

  user_data = <<-EOF
    #!/bin/bash
    apt-get update
    apt-get install -y python3 python3-pip
  EOF

  tags = {
    Name        = "todo-app-server-${var.environment}"
    Environment = var.environment
    Project     = "hngi13-stage6"
  }

  lifecycle {
    create_before_destroy = true
  }
}

What's happening:

ami - Which OS image to use (from our data source)
instance_type - Server size (t3.medium = 2 vCPU, 4GB RAM)
key_name - Which SSH key to install
vpc_security_group_ids - Which firewall rules to apply
user_data - Script that runs when server starts (bootstrap script)
lifecycle - Terraform behavior (create new before destroying old = zero downtime)

Step 5: Variables (Making It Flexible)

variable "aws_region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "instance_type" {
  description = "EC2 instance type"
  type        = string
  default     = "t3.medium"
}

variable "key_pair_name" {
  description = "AWS Key Pair name"
  type        = string  # Required - no default
}

variable "environment" {
  description = "Environment name (dev, stg, prod)"
  type        = string
  default     = "dev"

  validation {
    condition     = contains(["dev", "stg", "prod"], var.environment)
    error_message = "Environment must be one of: dev, stg, prod"
  }
}

Why variables? Makes your code reusable. Same code works for dev, staging, and production - just change the variables!

Step 6: Outputs (Getting Information Back)

output "server_ip" {
  description = "Public IP of the server"
  value       = aws_instance.todo_app.public_ip
}

output "ansible_inventory_path" {
  description = "Path to generated Ansible inventory"
  value       = local_file.ansible_inventory.filename
}

What are outputs? After Terraform creates resources, you often need information about them (like the server IP). Outputs make that information available.

Environment-Specific Configuration

Create separate .tfvars files for each environment. This is crucial - you'll have three files, one for each environment:

terraform.dev.tfvars:

environment = "dev"
aws_region  = "us-east-1"
instance_type = "t3.small"  # Smaller for dev (saves money)
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "0.0.0.0/0"  # Less restrictive for dev
server_user = "ubuntu"
skip_ansible_provision = false  # Run Ansible automatically

terraform.stg.tfvars:

environment = "stg"
aws_region  = "us-east-1"
instance_type = "t3.small"  # Can be same size as dev for staging
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "0.0.0.0/0"  # Can be less restrictive than prod
server_user = "ubuntu"
skip_ansible_provision = false

terraform.prod.tfvars:

environment = "prod"
aws_region  = "us-east-1"
instance_type = "t3.medium"  # More power for production
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "YOUR_IP/32"  # Restrict SSH to your IP only!
server_user = "ubuntu"
skip_ansible_provision = false

Why three separate files? Different environments have different needs:

Dev: Smaller instance, less security (for quick testing)
Staging: Similar to dev, but closer to production setup (for pre-production testing)
Prod: Larger instance, maximum security (for real users)

File structure:

infra/terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.dev.tfvars    ← Development environment
├── terraform.stg.tfvars    ← Staging environment
└── terraform.prod.tfvars   ← Production environment

Remote State Configuration

Remember the S3 bucket we created? Now we use it. Important: Each environment needs its own state file path!

For Development:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/dev/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

For Staging:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/stg/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

For Production:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/prod/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

What this does:

bucket - Where to store state (same bucket for all environments)
key - File path in bucket (different per environment!)
dynamodb_table - For locking (prevents conflicts when multiple people run Terraform)
encrypt=true - Encrypt state at rest (security)

Why separate keys per environment?

terraform-state/dev/terraform.tfstate → Development infrastructure
terraform-state/stg/terraform.tfstate → Staging infrastructure
terraform-state/prod/terraform.tfstate → Production infrastructure

These are completely separate files. This means:

✅ Dev, staging, and prod infrastructure are isolated
✅ You can destroy dev without affecting staging or prod
✅ Each environment has its own state history
✅ No risk of accidentally modifying the wrong environment

Your First Terraform Run

Let's deploy to development first (always start with dev!):

# 1. Initialize (downloads providers, sets up backend)
terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/dev/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

# 2. Plan (see what will be created - SAFE, doesn't change anything)
terraform plan -var-file=terraform.dev.tfvars

# 3. Apply (actually create resources)
terraform apply -var-file=terraform.dev.tfvars

What happens:

terraform init - Downloads AWS provider, configures backend for dev environment
terraform plan - Shows you what will be created/changed/destroyed (dry run)
terraform apply - Actually creates the resources

Pro tip: Always run plan first! It's like a dry run. Review the output carefully before applying.

For other environments, repeat the same steps but:

Use the appropriate -backend-config="key=terraform-state/ENV/terraform.tfstate"
Use the matching -var-file=terraform.ENV.tfvars

Example for staging:

terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/stg/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

terraform plan -var-file=terraform.stg.tfvars
terraform apply -var-file=terraform.stg.tfvars

Understanding Drift Detection (Critical for Safety!)

What is drift? Imagine you have a blueprint for a house (Terraform code), but someone goes and changes the actual house (AWS infrastructure) without updating the blueprint. That's drift - your code and reality don't match anymore.

Real-world example:

You deploy infrastructure with Terraform ✅
Later, you manually add a tag to your EC2 instance in AWS Console 🏷️
Terraform doesn't know about this change
Next time you run Terraform, it sees the difference → DRIFT DETECTED!

Why drift is dangerous:

🔴 Security risk: Someone might have changed something maliciously
🔴 Data loss: Terraform might try to "fix" things and delete your changes
🔴 Confusion: You don't know what changed or why
🔴 Breaking changes: Manual changes might break your application

How drift detection works:

Think of it like a security guard checking your house:

Terraform Plan = Security guard walks around and notes what's different
Check Git History = Did you change the blueprint (Terraform files)?
- ✅ If yes → Expected changes (you updated the code)
- ❌ If no → DRIFT! (Someone changed infrastructure without updating code)

Alert = Security guard calls you immediately

Approval = You review and decide what to do

Action = Apply changes or investigate further

The detection logic:

# Step 1: Run terraform plan to see what's different
terraform plan -out=tfplan

# Step 2: Check if Terraform files changed in this commit
git diff HEAD~1 HEAD --name-only | grep -E "\.tf$|\.tfvars$"

# Step 3: Determine the type of change
if [ no terraform files changed ] && [ plan shows changes ]; then
  echo "🚨 DRIFT DETECTED!"
  # Send email, create GitHub issue, wait for approval
else
  echo "✅ Expected changes (code was updated)"
  # Proceed automatically
fi

Setting Up Email Notifications for Drift

When drift is detected, you need to know immediately! That's where email notifications come in.

Step 1: Verify Your Email in AWS SES

AWS SES (Simple Email Service) is like a post office for your applications. First, you need to verify your email address:

Go to AWS Console → SES (Simple Email Service)
Click "Verified identities" → "Create identity"
Choose "Email address"
Enter your email (e.g., your-email@gmail.com)
Click "Create identity"
Check your email and click the verification link

Why verify? AWS prevents spam by requiring you to verify you own the email address.

Step 2: Create the Email Notification Script

Create infra/ci-cd/scripts/email-notification.sh:

#!/bin/bash
# Email Notification Script for Terraform Drift
# Sends email alert when infrastructure drift is detected

set -e

DRIFT_SUMMARY="${1:-}"

if [ -z "$DRIFT_SUMMARY" ]; then
  echo "Error: Drift summary not provided"
  exit 1
fi

# Email configuration from environment variables
EMAIL_TO="${EMAIL_TO:-}"
EMAIL_FROM="${EMAIL_FROM:-}"
AWS_REGION="${AWS_REGION:-us-east-1}"

# GitHub Actions variables for workflow link
GITHUB_SERVER_URL="${GITHUB_SERVER_URL:-https://github.com}"
GITHUB_REPOSITORY="${GITHUB_REPOSITORY:-}"
GITHUB_RUN_ID="${GITHUB_RUN_ID:-}"

# Check if email is configured
if [ -z "$EMAIL_TO" ] || [ -z "$EMAIL_FROM" ]; then
  echo "⚠️  Email not configured. Skipping email notification."
  exit 0
fi

# Build workflow run URL if GitHub variables are available
WORKFLOW_URL=""
if [ -n "$GITHUB_REPOSITORY" ] && [ -n "$GITHUB_RUN_ID" ]; then
  WORKFLOW_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}"
fi

# Create email body
SUBJECT="🚨 Terraform Drift Detected - Action Required"
BODY=$(cat <<EOF
Terraform infrastructure drift has been detected.

This means infrastructure was changed OUTSIDE of Terraform (e.g., manually in AWS Console).

Please review the changes and approve the deployment in GitHub Actions.

$(if [ -n "$WORKFLOW_URL" ]; then echo "🔗 View Workflow Run: $WORKFLOW_URL"; echo ""; fi)

Drift Summary:
$DRIFT_SUMMARY

---
This is an automated message from GitHub Actions.
EOF
)

# Send email via AWS SES
echo "📧 Sending drift alert email via AWS SES..."
aws ses send-email \
  --region "$AWS_REGION" \
  --from "$EMAIL_FROM" \
  --to "$EMAIL_TO" \
  --subject "$SUBJECT" \
  --text "$BODY" \
  || echo "⚠️  Failed to send email. Check AWS credentials and SES configuration."

echo "✅ Email notification sent!"

Make it executable:

chmod +x infra/ci-cd/scripts/email-notification.sh

What this script does:

Takes the drift summary as input
Checks if email is configured
Builds a nice email message with the drift details
Sends it via AWS SES
Includes a link to the GitHub workflow run

Step 3: Add GitHub Secrets

In your GitHub repository, go to Settings → Secrets and variables → Actions, and add:

EMAIL_TO - Your email address (where to send alerts)
EMAIL_FROM - Your verified SES email (must match the one you verified in AWS)
AWS_ACCESS_KEY_ID - Your AWS access key
AWS_SECRET_ACCESS_KEY - Your AWS secret key
AWS_REGION - Your AWS region (e.g., us-east-1)

Security tip: Never commit these values to your repository! Always use GitHub Secrets.

Understanding GitHub Issue Approval

When drift is detected, the workflow creates a GitHub issue and waits for your approval. This is like a safety checkpoint.

How it works:

Drift detected → Workflow pauses
GitHub issue created → Contains:
- What changed
- Why it's drift (no code changes)
- Link to workflow run
- Plan summary
You review → Check the issue
You approve → Comment "approve" or click approve button
Workflow continues → Terraform applies the changes

Example GitHub Issue:

🚨 REAL DRIFT DETECTED - Infrastructure Changed Outside Terraform (dev)

⚠️ CRITICAL: Real Infrastructure Drift Detected

Infrastructure has been modified outside of Terraform. This is unexpected.

Environment: dev

What happened:
- Terraform code files were NOT modified
- But infrastructure plan shows changes
- This indicates manual changes or changes from another process

Action Required:
1. Review the plan below
2. Investigate what caused the drift
3. Approve if changes are intentional, or revert if unauthorized

Plan Summary:

aws_instance.todo_app will be updated in-place

~ resource "aws_instance" "todo_app" {
~ tags = {
- "ManualTag" = "test" -> null
}
}


Workflow Run:
🔗 View Workflow Run: https://github.com/yourusername/repo/actions/runs/123456

Next Steps:
- Approve to apply these changes
- Or investigate and revert unauthorized changes

How to approve:

Go to the GitHub issue (you'll get a notification)
Review the changes carefully
If changes are OK: Comment "approve" or click the approve button
If changes are suspicious: Investigate first, then approve or revert

Why this matters:

✅ Prevents accidental changes
✅ Gives you time to investigate
✅ Creates an audit trail (who approved what, when)
✅ Protects production from unauthorized changes

Testing Drift Detection

Want to test if drift detection works? Here's how:

Step 1: Deploy infrastructure normally

terraform apply -var-file=terraform.dev.tfvars

Step 2: Manually change something in AWS Console

Go to AWS Console → EC2 → Instances
Find your instance
Click "Tags" → "Manage tags"
Add a new tag: TestTag = "drift-test"
Save

Added port 8080 via AWS console

Step 3: Trigger the workflow

Go to GitHub Actions
Run the "Infrastructure Deployment" workflow
Select "dev" environment
Watch it detect drift!

Step 4: Check your email

You should receive an email alert
Check spam folder if you don't see it

Step 5: Approve in GitHub

A GitHub issue should be created
Review and approve
Watch Terraform apply the changes

Approved github issue

Destroying Infrastructure (When You Need to Start Over)

Sometimes you need to tear everything down and start fresh. Here's how to do it safely.

⚠️ WARNING: Destroying infrastructure will DELETE EVERYTHING:

Your EC2 instance
All data on the server
Security groups
Everything created by Terraform

Make sure you:

✅ Have backups if you need data
✅ Are destroying the right environment (dev, not prod!)
✅ Really want to delete everything

Method 1: Destroy via Command Line

# Step 1: Initialize Terraform (if not already done)
cd infra/terraform
terraform init \
  -backend-config="bucket=yourname-terraform-state" \
  -backend-config="key=terraform-state/dev/terraform.tfstate" \
  -backend-config="region=us-east-1" \
  -backend-config="dynamodb_table=terraform-state-lock" \
  -backend-config="encrypt=true"

# Step 2: Plan the destruction (see what will be deleted)
terraform plan -destroy -var-file=terraform.dev.tfvars

# Step 3: Review the plan carefully!
# Make sure it's only deleting what you want

# Step 4: Destroy everything
terraform destroy -var-file=terraform.dev.tfvars

What happens:

Terraform reads the state file
Plans what needs to be destroyed
Shows you the plan (review it!)
Asks for confirmation (type yes)
Deletes everything in reverse order

Method 2: Destroy via GitHub Actions

If you have a destroy workflow set up:

Go to GitHub Actions

Find "Destroy Infrastructure" workflow (if you have one)
Click "Run workflow"
Select environment (be careful - don't destroy prod!)
Confirm and run

Example destroy workflow:

name: Destroy Infrastructure

on:
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment to destroy'
        required: true
        type: choice
        options: [dev, stg, prod]

jobs:
  destroy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        working-directory: infra/terraform
        run: |
          terraform init \
            -backend-config="bucket=${{ secrets.TERRAFORM_STATE_BUCKET }}" \
            -backend-config="key=terraform-state/${{ github.event.inputs.environment }}/terraform.tfstate" \
            -backend-config="region=${{ secrets.AWS_REGION }}" \
            -backend-config="dynamodb_table=${{ secrets.TERRAFORM_STATE_LOCK_TABLE }}" \
            -backend-config="encrypt=true"

      - name: Terraform Plan Destroy
        working-directory: infra/terraform
        run: |
          terraform plan -destroy \
            -var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
            -var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}" \
            -out=tfplan

      - name: Manual Approval
        uses: trstringer/manual-approval@v1
        with:
          secret: ${{ github.TOKEN }}
          approvers: ${{ github.actor }}
          minimum-approvals: 1
          issue-title: "⚠️ DESTROY Infrastructure - ${{ github.event.inputs.environment }}"
          issue-body: |
            **⚠️ WARNING: Infrastructure Destruction Requested**

            This will **DELETE ALL INFRASTRUCTURE** for environment: **${{ github.event.inputs.environment }}**

            **This action cannot be undone!**

            Review the plan and approve only if you're sure.

      - name: Terraform Destroy
        if: steps.manual-approval.outcome == 'success'
        working-directory: infra/terraform
        run: terraform apply -auto-approve tfplan

Safety features:

✅ Manual approval required (can't destroy by accident)
✅ Shows what will be destroyed
✅ Creates GitHub issue for review
✅ Environment selection (prevents destroying wrong env)

Method 3: Destroy Specific Resources

Don't want to destroy everything? You can target specific resources:

# Destroy only the EC2 instance (keep security group)
terraform destroy -target=aws_instance.todo_app -var-file=terraform.dev.tfvars

# Destroy only the security group
terraform destroy -target=aws_security_group.todo_app -var-file=terraform.dev.tfvars

When to use this:

You want to recreate just one resource
Something is broken and you want to rebuild it
You're testing changes

After Destruction

After destroying, your state file still exists in S3, but it's empty (or has no resources). You can:

Start fresh: Run terraform apply again to recreate everything
Clean up state: Delete the state file from S3 (optional)
Keep state: Leave it (Terraform will just create new resources)

Best practice: Keep the state file. It's useful for history and doesn't cost much.

Summary: The Complete Terraform Workflow

Here's the full picture of how everything works together:

┌─────────────────────────────────────────────────────────┐
│ 1. You make changes to Terraform code                   │
│    (or someone changes infrastructure manually)        │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ 2. GitHub Actions workflow runs                         │
│    - Checks out code                                    │
│    - Runs terraform plan                                │
└─────────────────┬───────────────────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────────────────┐
│ 3. Drift Detection                                      │
│    - Did Terraform files change?                        │
│    - Does plan show changes?                            │
│    - If no code changes + plan changes = DRIFT!         │
└─────────────────┬───────────────────────────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
        ▼                   ▼
┌──────────────┐   ┌──────────────────┐
│ DRIFT        │   │ EXPECTED CHANGES  │
│ DETECTED     │   │ (Code updated)    │
└──────┬───────┘   └────────┬───────────┘
       │                    │
       ▼                    ▼
┌──────────────────┐   ┌──────────────────┐
│ Send Email       │   │ Apply directly   │
│ Create Issue     │   │ (No approval)    │
│ Wait for Approval│   └──────────────────┘
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ You Review       │
│ & Approve        │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Apply Changes    │
│ (terraform apply) │
└──────────────────┘

Key takeaways:

✅ Always run terraform plan first (see what will happen)
✅ Drift detection protects you from unexpected changes
✅ Email notifications keep you informed
✅ Manual approval prevents accidents
✅ Destroy carefully - it's permanent!

Part 5: Server Configuration with Ansible

Terraform created your server, but it's just a blank Ubuntu machine. Now we need to:

Install Docker
Clone your code
Start the application

That's where Ansible comes in.

Understanding Ansible

Ansible is like having a robot assistant that can:

SSH into your servers
Run commands
Install software
Copy files
Start services

Why Ansible over SSH scripts?

Idempotent: Run it multiple times safely (won't break if run twice)
Declarative: You say "what" you want, not "how" to do it
Organized: Roles and playbooks keep things organized
Reusable: Write once, use for dev/stg/prod

Ansible Playbook Structure

---
- name: Configure TODO Application Server
  hosts: all
  become: yes  # Use sudo
  gather_facts: yes  # Collect info about the server

  vars:
    app_user: ubuntu
    app_dir: /opt/todo-app

  roles:
    - role: dependencies  # Install Docker, etc.
    - role: deploy        # Deploy the application

Breaking it down:

hosts: all - Run on all servers in inventory
become: yes - Use sudo (needed for installing packages)
gather_facts - Ansible learns about the server (OS, IP, etc.)
roles - Reusable collections of tasks

Creating the Dependencies Role

This role installs everything the server needs:

roles/dependencies/tasks/main.yml:

---
- name: Update apt cache
  apt:
    update_cache: yes
    cache_valid_time: 3600

- name: Install required packages
  apt:
    name:
      - git
      - curl
      - python3
      - python3-pip
    state: present

- name: Check if Docker is already installed
  command: docker --version
  register: docker_check
  changed_when: false
  failed_when: false
  ignore_errors: yes

- name: Install Docker (only if not installed)
  apt:
    name:
      - docker-ce
      - docker-ce-cli
      - containerd.io
      - docker-compose-plugin
    state: present
  when: docker_check.rc != 0  # Only if Docker not found

- name: Add user to docker group
  user:
    name: "{{ app_user }}"
    groups: docker
    append: yes

- name: Start and enable Docker
  systemd:
    name: docker
    state: started
    enabled: yes

Key concepts:

register - Save command output to a variable
when - Conditional execution (only if condition is true)
changed_when: false - This task never "changes" anything (just checks)
state: present - Ensure package is installed (idempotent!)

Creating the Deploy Role

This role actually deploys your application:

roles/deploy/tasks/main.yml:

---
- name: Create application directory
  file:
    path: "{{ app_dir }}"
    state: directory
    owner: "{{ app_user }}"
    group: "{{ app_user }}"
    mode: '0755'

- name: Clone repository
  git:
    repo: "{{ repo_url }}"
    dest: "{{ app_dir }}"
    version: "{{ repo_branch | default('main') }}"
    update: yes
  register: git_pull_result
  changed_when: git_pull_result.changed

- name: Create .env file
  copy:
    content: |
      DOMAIN="{{ domain }}"
      LETSENCRYPT_EMAIL="{{ letsencrypt_email }}"
      JWT_SECRET="{{ jwt_secret }}"
      # ... other variables
    dest: "{{ app_dir }}/.env"
    owner: "{{ app_user }}"
    mode: '0600'
  register: env_file_result
  changed_when: env_file_result.changed

- name: Determine if rebuild is needed
  set_fact:
    needs_rebuild: "{{ git_pull_result.changed | default(false) or env_file_result.changed | default(false) }}"

- name: Build images if code/config changed
  shell: docker compose build
  args:
    chdir: "{{ app_dir }}"
  when: needs_rebuild | default(false)

- name: Start/update containers
  shell: docker compose up -d
  args:
    chdir: "{{ app_dir }}"

Making it idempotent:

Only rebuilds if code or config changed
docker compose up -d is idempotent (won't restart if nothing changed)
Safe to run multiple times

Environment-Specific Variables

Just like Terraform, Ansible needs separate configuration files for each environment. Create three files:

group_vars/dev/vars.yml:

---
domain: "dev.yourdomain.com"
letsencrypt_email: "dev-email@example.com"
jwt_secret: "dev-secret-key"
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "dev"  # Use dev branch for development

group_vars/stg/vars.yml:

---
domain: "stg.yourdomain.com"
letsencrypt_email: "staging-email@example.com"
jwt_secret: "staging-secret-key"  # Different from dev!
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "staging"  # Use staging branch for staging

group_vars/prod/vars.yml:

---
domain: "yourdomain.com"
letsencrypt_email: "prod-email@example.com"
jwt_secret: "super-secure-production-secret"  # Different per environment!
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "main"  # Use main branch for production

Why three separate files? Each environment needs:

Different domains: dev.yourdomain.com, stg.yourdomain.com, yourdomain.com
Different secrets: If dev gets compromised, staging and prod are still safe
Different branches:
- dev branch → development environment (experimental features)
- staging branch → staging environment (testing before production)
- main branch → production environment (stable, tested code)

File structure:

infra/ansible/
├── playbook.yml
├── inventory/
│   ├── dev.yml
│   ├── stg.yml
│   └── prod.yml
└── group_vars/
    ├── dev/
    │   └── vars.yml      ← Development variables
    ├── stg/
    │   └── vars.yml      ← Staging variables
    └── prod/
        └── vars.yml      ← Production variables

Branch strategy explained:

Development (dev branch): Where you experiment and develop new features
Staging (staging branch): Where you test features before they go to production
Production (main branch): The stable code that real users interact with

This way, you can test changes in dev/staging without affecting production!

Generating Inventory

Terraform automatically generates the Ansible inventory:

templates/inventory.tpl:

all:
  hosts:
    todo-app-server:
      ansible_host: ${server_ip}
      ansible_user: ${server_user}
      ansible_ssh_private_key_file: ${ssh_key_path}
      ansible_ssh_common_args: '-o StrictHostKeyChecking=no'

This gets generated as ansible/inventory/dev.yml (or stg.yml, prod.yml) with the actual server IP.

Running Ansible

# From the ansible directory
cd infra/ansible

# Run the playbook
ansible-playbook -i inventory/dev.yml playbook.yml

# With verbose output (for debugging)
ansible-playbook -i inventory/dev.yml playbook.yml -vvv

What happens:

Ansible connects to your server via SSH
Runs the dependencies role (installs Docker)
Runs the deploy role (clones code, starts containers)
Your application is live!

Part 6: CI/CD with GitHub Actions

Now we automate everything. Instead of running commands manually, GitHub Actions does it for us.

Understanding CI/CD

CI (Continuous Integration): Automatically test and build when code changes
CD (Continuous Deployment): Automatically deploy when tests pass

Why CI/CD?

Consistency: Same process every time
Speed: Deploy in minutes, not hours
Safety: Automated tests catch bugs before production
History: See what was deployed when

Setting Up GitHub Secrets

Before workflows can run, they need credentials:

Go to your GitHub repo → Settings → Secrets and variables → Actions
Add these secrets:
- AWS_ACCESS_KEY_ID - From your IAM user
- AWS_SECRET_ACCESS_KEY - From your IAM user
- TERRAFORM_STATE_BUCKET - Your S3 bucket name
- TERRAFORM_STATE_LOCK_TABLE - Your DynamoDB table name
- TERRAFORM_KEY_PAIR_NAME - Your EC2 key pair name
- SSH_PRIVATE_KEY - Contents of your .pem file
- EMAIL_TO - Where to send drift alerts
- EMAIL_FROM - Your verified SES email

Infrastructure Workflow

This workflow runs when infrastructure code changes:

name: Infrastructure Deployment

on:
  push:
    paths:
      - 'infra/terraform/**'
      - 'infra/ansible/**'
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment (dev, stg, prod)'
        required: true
        type: choice
        options: [dev, stg, prod]

jobs:
  terraform-plan:
    name: Terraform Plan & Drift Detection
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init
        working-directory: infra/terraform
        run: terraform init -backend-config=...

      - name: Terraform Plan
        run: terraform plan -out=tfplan

      - name: Check for Drift
        id: drift-check
        run: |
          # Detect if this is drift (infrastructure changed outside Terraform)
          # vs expected changes (Terraform code changed)

      - name: Send Drift Email
        if: steps.drift-check.outputs.change_type == 'drift'
        run: ./infra/ci-cd/scripts/email-notification.sh "$(cat drift_summary.txt)"

      - name: Manual Approval
        if: steps.drift-check.outputs.change_type == 'drift'
        uses: trstringer/manual-approval@v1

      - name: Terraform Apply
        if: steps.manual-approval.outcome == 'success'
        run: terraform apply -auto-approve tfplan

Drift Detection in CI/CD (Quick Reference)

Note: For a detailed explanation of drift detection, email setup, and GitHub approval, see the "Understanding Drift Detection" section earlier in this guide.

Quick summary:

Drift = Infrastructure changed outside Terraform
Detected automatically in CI/CD
Email sent + GitHub issue created
Manual approval required before applying

Application Deployment Workflow

Separate workflow for application code changes:

name: Application Deployment

on:
  push:
    paths:
      - 'frontend/**'
      - 'auth-api/**'
      - 'todos-api/**'
      - 'docker-compose.yml'
  workflow_dispatch:
    inputs:
      environment:
        description: 'Environment'
        type: choice
        options: [dev, stg, prod]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Get Server IP
        run: |
          # Find server by tag
          INSTANCE_ID=$(aws ec2 describe-instances ...)
          SERVER_IP=$(aws ec2 describe-instances ...)

      - name: Deploy with Ansible
        run: |
          ansible-playbook -i inventory/${ENV}.yml playbook.yml --tags deploy

Why separate workflows? Infrastructure changes are rare and need careful review. Application changes are frequent and should deploy quickly.

Infrastructure Destruction Workflow

⚠️ CRITICAL: This workflow DESTROYS EVERYTHING. Use with extreme caution!

The destroy workflow is separate from the deployment workflow for safety. It has multiple confirmation steps to prevent accidental destruction.

How the Destroy Workflow Works

Step 1: Manual Trigger Only

Only runs when you manually trigger it (no automatic triggers)
Requires you to select the environment
Requires you to type "DESTROY" to confirm

Step 2: Validation

Checks that you typed "DESTROY" correctly (case-sensitive)
Prevents typos from accidentally destroying infrastructure

Step 3: State File Handling

Tries to download state from artifacts (most recent)
Falls back to S3 remote backend if artifacts missing
Imports resources if state is completely missing

Step 4: Destroy Plan

Shows you exactly what will be destroyed
Review this carefully before proceeding

Step 5: Destruction

Deletes all resources in the correct order
Handles dependencies (e.g., detaches volumes before deleting)

Step 6: Verification

Checks that everything was destroyed
Cleans up orphaned resources
Provides a summary

The Complete Destroy Workflow

Here's what the actual workflow looks like:

name: Infrastructure Destruction

on:
  workflow_dispatch:  # Manual trigger only - safe!
    inputs:
      environment:
        description: 'Environment to destroy (dev, stg, prod)'
        required: true
        type: choice
        options: [dev, stg, prod]
        default: 'dev'
      confirm_destroy:
        description: 'Type "DESTROY" to confirm (case-sensitive)'
        required: true
        type: string

jobs:
  validate-destroy:
    name: Validate Destruction Request
    runs-on: ubuntu-latest
    steps:
      - name: Validate confirmation
        run: |
          if [ "${{ github.event.inputs.confirm_destroy }}" != "DESTROY" ]; then
            echo "❌ Invalid confirmation. You must type 'DESTROY' to proceed."
            exit 1
          fi
          echo "✅ Destruction confirmed. Proceeding..."

  terraform-destroy:
    name: Destroy Infrastructure
    needs: validate-destroy
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ secrets.AWS_REGION }}

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3

      - name: Terraform Init (with S3 backend)
        working-directory: infra/terraform
        run: |
          terraform init \
            -backend-config="bucket=${{ secrets.TERRAFORM_STATE_BUCKET }}" \
            -backend-config="key=terraform-state/${{ github.event.inputs.environment }}/terraform.tfstate" \
            -backend-config="region=${{ secrets.AWS_REGION }}" \
            -backend-config="dynamodb_table=${{ secrets.TERRAFORM_STATE_LOCK_TABLE }}" \
            -backend-config="encrypt=true"

      - name: Terraform Plan Destroy
        working-directory: infra/terraform
        run: |
          terraform plan -destroy \
            -var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
            -var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}" \
            -out=destroy.tfplan
          echo ""
          echo "⚠️ DESTRUCTION PLAN SUMMARY:"
          terraform show -no-color destroy.tfplan | head -100

      - name: Terraform Destroy
        working-directory: infra/terraform
        run: |
          echo "🔥 Starting infrastructure destruction..."
          terraform destroy -auto-approve \
            -var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
            -var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}"

      - name: Verify Destruction
        working-directory: infra/terraform
        run: |
          echo "🔍 Verifying all resources are destroyed..."
          # Check for orphaned resources and clean them up

How to Use the Destroy Workflow

Step 1: Go to GitHub Actions

Open your repository on GitHub
Click the "Actions" tab
Find "Infrastructure Destruction" in the workflow list

Step 2: Run the Workflow

Click "Run workflow" button (top right)
Select the environment you want to destroy:
- ⚠️ Be very careful - make sure you select the right one!
- Dev is usually safe to destroy
- Staging should be destroyed carefully
- NEVER destroy production unless absolutely necessary!
In the "Type DESTROY to confirm" field, type exactly: DESTROY
- Must be all caps
- Must be exactly "DESTROY" (no extra spaces)
Click "Run workflow"

Step 3: Watch It Run

The workflow will start with a validation job
- Checks that you typed "DESTROY" correctly
- If wrong, workflow fails immediately (safe!)
Then the terraform-destroy job runs:
- Initializes Terraform with the correct backend
- Creates a destroy plan (shows what will be deleted)
- Review the plan carefully!
- Destroys all resources
- Verifies everything is gone

Step 4: Review the Results

Check the workflow logs
Verify in AWS Console that resources are gone
Check that costs are now $0.00

Safety Features

The destroy workflow has multiple safety features:

Manual trigger only - Can't be triggered automatically
Confirmation required - Must type "DESTROY" exactly
Environment selection - Prevents destroying wrong environment
Plan before destroy - Shows you what will be deleted
Validation job - Double-checks confirmation before proceeding
State file handling - Works with remote state (S3)
Verification - Checks that everything was destroyed

What Gets Destroyed

When you run the destroy workflow, it deletes:

✅ EC2 instance - Your server and everything on it
✅ Security groups - Firewall rules
✅ EBS volumes - Any attached storage (if using EBS for state)
✅ All containers - Docker containers running on the instance
✅ All data - Everything on the server is permanently lost

What stays:

✅ S3 bucket - Your Terraform state bucket (not deleted)
✅ DynamoDB table - State locking table (not deleted)
✅ GitHub repository - Your code (not deleted)

When to Use the Destroy Workflow

Good reasons to destroy:

✅ You're done with the project and want to stop costs
✅ You want to start completely fresh
✅ You're testing and need to clean up
✅ You're moving to a different AWS account

Bad reasons to destroy:

❌ Just to restart services (use Ansible instead)
❌ To fix a small issue (fix the issue, don't destroy)
❌ Because something isn't working (debug first)
❌ In production without a backup plan

After Destruction

What happens:

All infrastructure is deleted
State file in S3 is updated (shows no resources)
You stop paying for AWS resources
All data is permanently lost

To recreate:

Run the Infrastructure Deployment workflow again
It will create everything from scratch
You'll need to redeploy your application

Important notes:

State file history is preserved in S3
You can see what was destroyed in the workflow logs
GitHub Actions artifacts are kept for 90 days
You can manually delete artifacts if needed

Destroy Workflow vs Manual Destroy

Use the workflow when:

✅ You want safety features (confirmation, validation)
✅ You want to destroy from anywhere (don't need local setup)
✅ You want an audit trail (GitHub Actions logs)
✅ You're working with a team (everyone can see what happened)

Use manual destroy when:

✅ You need to destroy specific resources only
✅ You're debugging and need more control
✅ You don't have GitHub Actions set up

Example: Destroying Development Environment

Let's walk through destroying a dev environment:

Go to Actions → Infrastructure Destruction
Click "Run workflow"
Select environment: dev
Type confirmation: DESTROY
Click "Run workflow"

What you'll see:

✅ validate-destroy: Validation passed
✅ terraform-destroy: 
   - Terraform Init: Success
   - Terraform Plan Destroy: Shows what will be deleted
   - Terraform Destroy: Deleting resources...
   - Verify Destruction: All resources destroyed

After completion:

Check AWS Console → EC2 → No instances
Check AWS Console → Security Groups → No groups
Check AWS Billing → Costs should be $0.00

Troubleshooting Destroy Workflow

Issue: "Invalid confirmation"

Problem: You didn't type "DESTROY" exactly
Solution: Type exactly DESTROY (all caps, no spaces)

Issue: "State file not found"

Problem: State file is missing or in wrong location
Solution: Workflow will try to import resources automatically

Issue: "Resources still exist after destroy"

Problem: Some resources might be stuck
Solution: Check the verification step - it will try to clean up orphaned resources

Issue: "Can't destroy because of dependencies"

Problem: Resources have dependencies (e.g., volume attached)
Solution: Workflow handles this automatically (detaches volumes first)

Best Practices for Destruction

Always destroy dev first - Test the workflow in dev before using in staging/prod
Review the plan - Check what will be destroyed before confirming
Backup important data - If you need any data, back it up first
Destroy during off-hours - If others are using the environment
Document why - Add a comment in the workflow run explaining why you destroyed
Verify after - Check AWS Console to confirm everything is gone
Clean up artifacts - Delete GitHub Actions artifacts if you want

Part 7: Single Command Deployment

The ultimate goal: one command that does everything.

How It Works

When you run:

terraform apply -var-file=terraform.dev.tfvars -auto-approve

Here's what happens behind the scenes:

Terraform provisions infrastructure
- Creates security group
- Launches EC2 instance
- Waits for instance to be ready
Terraform generates Ansible inventory
- Creates ansible/inventory/dev.yml with server IP
- Ready for Ansible to use
Terraform triggers Ansible (via null_resource)
- Waits for SSH to be available
- Runs Ansible playbook
- Installs Docker
- Clones repository
- Starts containers
Traefik gets SSL certificate
- Contacts Let's Encrypt
- Verifies domain ownership
- Gets certificate
- Enables HTTPS
Application is live!
- Frontend accessible at https://yourdomain.com
- APIs at https://yourdomain.com/api/*

The Magic: null_resource

resource "null_resource" "ansible_provision" {
  triggers = {
    instance_id = aws_instance.todo_app.id
  }

  provisioner "local-exec" {
    command = <<-EOT
      # Wait for SSH
      until ssh ... 'echo "ready"'; do sleep 10; done

      # Run Ansible
      cd ../ansible
      ansible-playbook -i inventory/${var.environment}.yml playbook.yml
    EOT
  }
}

What is null_resource? It's a Terraform resource that doesn't create anything in AWS. It just runs a command. Perfect for triggering Ansible!

Why the wait? EC2 instances take 30-60 seconds to boot. We wait for SSH to be ready before running Ansible.

Testing the Single Command

# Make sure you're in the terraform directory
cd infra/terraform

# Initialize (one-time setup)
terraform init -backend-config=...

# The magic command
terraform apply -var-file=terraform.dev.tfvars -auto-approve

# Watch it work!
# You'll see:
# 1. Security group created
# 2. EC2 instance launching
# 3. Waiting for SSH...
# 4. Running Ansible...
# 5. Application deployed!

Pro tip: The first run takes 5-10 minutes. Subsequent runs are faster (only changes what's needed).

Part 8: Multi-Environment Setup

Real applications need multiple environments. Here's how to set it up properly.

Why Multiple Environments?

Dev: Where you experiment (break things safely)
Staging: Mirror of production (test before going live)
Production: The real thing (users depend on it)

Environment Isolation

Each environment is completely separate:

Different EC2 instances
Different security groups
Different state files in S3
Different domains
Different secrets

Why isolation matters: If dev gets hacked, staging and prod are still safe. If you break dev, staging and prod keep running. This is why we have three separate environments!

Setting Up Per-Environment Configuration

Terraform:

terraform.dev.tfvars - Dev configuration
terraform.stg.tfvars - Staging configuration
terraform.prod.tfvars - Production configuration

Ansible:

group_vars/dev/vars.yml - Dev variables
group_vars/stg/vars.yml - Staging variables
group_vars/prod/vars.yml - Production variables

State files:

terraform-state/dev/terraform.tfstate
terraform-state/stg/terraform.tfstate
terraform-state/prod/terraform.tfstate

Deploying to Different Environments

Via GitHub Actions:

Go to Actions → Infrastructure Deployment
Click "Run workflow"
Select environment (dev/stg/prod)
Click "Run workflow"

Via command line:

# Dev
terraform apply -var-file=terraform.dev.tfvars -auto-approve

# Staging
terraform apply -var-file=terraform.stg.tfvars -auto-approve

# Production
terraform apply -var-file=terraform.prod.tfvars -auto-approve

Important: Always test in dev first! Never deploy to prod without testing.

Part 9: Common Issues and Solutions

Every project has issues. Here are the ones you'll likely encounter and how to fix them.

Issue 1: Let's Encrypt Certificate Errors

Symptom: Timeout during connect (likely firewall problem)

Causes:

DNS not pointing to your server
Security group blocking ports 80/443
Firewall on the server blocking ports

Solutions:

# 1. Verify DNS
dig yourdomain.com
# Should show your server IP

# 2. Check security group
# AWS Console → EC2 → Security Groups
# Verify ports 80 and 443 allow 0.0.0.0/0

# 3. Check server firewall
sudo ufw status
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp

# 4. Switch to HTTP challenge (more reliable)
# In docker-compose.yml:
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"

Issue 2: Terraform State Lock

Symptom: Error acquiring the state lock

Cause: Another Terraform run is in progress, or a previous run crashed

Solution:

# Check what's locking
aws dynamodb scan --table-name terraform-state-lock

# If you're sure no one else is running Terraform:
terraform force-unlock <LOCK_ID>

# Be careful! Only do this if you're certain.

Issue 3: Ansible Connection Failed

Symptom: SSH connection failed or Permission denied

Causes:

Security group doesn't allow SSH from your IP
Wrong key pair
Server not ready yet

Solutions:

# 1. Test SSH manually
ssh -i ~/.ssh/your-key.pem ubuntu@<server-ip>

# 2. Check security group
# Make sure it allows port 22 from your IP

# 3. Verify key pair name matches
# AWS Console → EC2 → Key Pairs
# Should match what's in terraform.tfvars

# 4. Wait longer (server might still be booting)
# EC2 instances take 1-2 minutes to be ready

Issue 4: Drift Detection Not Working

Symptom: Changes made manually but drift not detected

Check:

Did Terraform files change? (That's "expected", not drift)
Is state in S3? (Local state won't work properly)
Check drift detection logic in workflow

Test drift:

Manually add a tag to your EC2 instance in AWS Console
Run the infrastructure workflow
Should detect drift and send email

Issue 5: Containers Keep Restarting

Symptom: docker ps shows containers restarting constantly

Debug:

# Check logs
docker logs <container-name>

# Check all containers
docker compose logs

# Common causes:
# - Configuration error in .env
# - Port conflict
# - Missing environment variables
# - Application crash on startup

Part 10: Best Practices and Security

Now that everything works, let's make it production-ready.

Security Best Practices

Never commit secrets
- Use GitHub Secrets
- Use environment variables
- Add .env to .gitignore
Restrict SSH access
- In production, set ssh_cidr to your IP only
- Use YOUR_IP/32 format (e.g., 1.2.3.4/32)
Use different secrets per environment
- Dev JWT secret ≠ Staging JWT secret ≠ Prod JWT secret
- If dev is compromised, staging and prod are still safe
Enable MFA
- On AWS account
- On GitHub account
- Extra layer of protection
Regular updates
- Keep Docker images updated
- Keep system packages updated
- Security patches are important!

Infrastructure Best Practices

Always use remote state
- S3 + DynamoDB
- Never commit state files
- Enable versioning on S3 bucket
Separate state per environment
- Different S3 keys
- Complete isolation
- Can't accidentally affect prod from dev
Use version constraints
- In Terraform: version = "~> 5.0"
- Prevents unexpected breaking changes
Tag everything
- Makes it easy to find resources
- Helps with cost tracking
- Required for organization

Deployment Best Practices

Test in dev first
- Always deploy to dev → staging → prod (in that order!)
- Catch issues early before they reach production
- Dev is for breaking things
Review drift alerts
- Don't ignore them!
- Investigate unexpected changes
- Could be security issue
Use idempotent deployments
- Safe to run multiple times
- Ansible should be idempotent
- Terraform is idempotent by design
Monitor your infrastructure
- Set up CloudWatch alarms
- Monitor costs
- Watch for unusual activity

Cost Optimization

Right-size instances
- Dev: t3.small (saves money)
- Prod: t3.medium (enough power)
- Don't over-provision
Stop dev when not in use
- Dev doesn't need to run 24/7
- Stop instances when not testing
- Saves ~70% of costs
Clean up unused resources
- Delete old instances
- Remove unused security groups
- Regular cleanup prevents waste

Part 11: Going Further

You've built a solid foundation. Here's where to go next.

Monitoring and Observability

Add CloudWatch:

Monitor CPU, memory, disk
Set up alarms
Track costs

Add Application Monitoring:

Prometheus + Grafana
ELK stack for logs
APM tools (New Relic, Datadog)

Scaling

Horizontal Scaling:

Add load balancer
Multiple instances
Auto-scaling groups

Vertical Scaling:

Larger instance types
More CPU/RAM
Better performance

Backup and Disaster Recovery

Backup Strategy:

Database backups
State file backups (S3 versioning)
Configuration backups

Disaster Recovery:

Multi-region deployment
Automated failover
Recovery procedures

Advanced Topics

Kubernetes: Container orchestration at scale
Terraform Modules: Reusable infrastructure code
Ansible Roles: Shareable configuration
GitOps: Git as source of truth
Infrastructure Testing: Test your infrastructure code

Conclusion: What You've Accomplished

Let's take a moment to appreciate what you've built:

✅ A microservices application running in containers
✅ Automated infrastructure with Terraform
✅ Automated deployment with Ansible
✅ CI/CD pipelines that detect problems
✅ Multi-environment setup (dev/staging/prod)
✅ Secure HTTPS with automatic certificates
✅ Single-command deployment that just works
✅ Production-ready practices and security

This isn't just a tutorial project - this is real infrastructure that follows industry best practices. You can use this as a foundation for actual production applications.

Key Takeaways

Infrastructure as Code saves time and prevents mistakes
Automation is your friend - manual processes break
Security isn't optional - build it in from the start
Testing in dev/staging prevents production disasters
Documentation (this blog post!) helps you and others

Next Steps

Deploy your own project using this as a template
Experiment - break things in dev, learn from it
Share - help others learn what you've learned
Iterate - improve based on real-world experience

Resources

Thank you for reading! If this helped you, please share it with others who might benefit. And if you have questions or run into issues, don't hesitate to reach out.

Happy deploying! 🚀

This guide was written as part of the HNG Internship Stage 6 DevOps task. The complete implementation is available on GitHub.