How I built a production-ready DevOps pipeline for a microservices TODO application - and how you can too, even if you're just starting out.
Introduction: Why This Matters
If you're reading this, you've probably heard terms like "DevOps," "Infrastructure as Code," and "CI/CD" thrown around, but maybe you're not entirely sure what they mean or how they fit together. That's exactly where I was when I started.
This guide isn't just about completing a task - it's about understanding the why behind each decision, learning from common mistakes, and building something you can be proud of. By the end, you'll have deployed a real application to the cloud with automated infrastructure, proper security, and a professional workflow.
What you'll build:
- A microservices application with 5 different services (Vue.js, Go, Node.js, Java, Python)
- Automated cloud infrastructure using Terraform
- Server configuration and deployment with Ansible
- CI/CD pipelines that detect when things go wrong
- Multi-environment setup (dev, staging, production)
- Secure HTTPS with automatic SSL certificates
- A single command that deploys everything
What you'll learn:
- How containerization actually works (beyond just "docker run")
- Why infrastructure as code matters (and how it saves you from disasters)
- How to think about security in a cloud environment
- The importance of automation and what happens when you skip it
Let's dive in.
Part 1: Understanding What We're Building
Before we start writing code, let's understand what we're actually building. This isn't just a TODO app - it's a microservices architecture, which means instead of one big application, we have multiple small services that work together.
The Architecture
Think of it like a restaurant:
- Frontend (Vue.js) - The dining room where customers interact
- Auth API (Go) - The host who checks if you have a reservation (authentication)
- Todos API (Node.js) - The waiter who takes your order (manages your todos)
- Users API (Java) - The manager who knows all the customers (user management)
- Log Processor (Python) - The kitchen staff who process orders (background processing)
- Redis - The order board where everyone can see what's happening (message queue)
Each service runs in its own container, which is like giving each part of the restaurant its own kitchen. If the waiter (Todos API) has a problem, it doesn't crash the whole restaurant.
Why Containerization?
You might be thinking: "Why not just run everything on one server?" Great question! Here's why containers matter:
- Isolation: If one service crashes, others keep running
- Consistency: "It works on my machine" becomes "it works everywhere"
- Scalability: Need more power? Spin up more containers
- Portability: Move from AWS to Azure? Just change where you run containers
The Challenge
The real challenge isn't just getting containers to run - it's:
- Making sure they can talk to each other
- Securing them with HTTPS
- Automating deployment so you don't manually SSH into servers
- Detecting when someone changes things manually (drift detection)
- Managing multiple environments without chaos
That's what makes this a real-world project.
Part 2: Setting Up Your Development Environment
Before we write any code, let's make sure you have everything you need. Don't worry if some of these are new - I'll explain what each one does.
Required Accounts
GitHub Account (Free)
- This is where your code lives and where CI/CD runs
- Think of it as your code's home and your automation's brain
- Sign up at github.com if you don't have one
AWS Account (Free tier available)
- This is where your servers will run
- AWS has a free tier that's perfect for learning
- You'll need a credit card, but we'll stay within free limits
- Sign up at aws.amazon.com
Domain Name (Optional but recommended - ~$10-15/year)
- This is your website's address (like
yourname.com) - You can use services like Namecheap, GoDaddy, or Cloudflare
- Why you need it: Let's Encrypt (free SSL) requires a real domain
-
Alternative: You can test with
localhostbut won't get real SSL
Installing Required Tools
Docker & Docker Compose
# On Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose-plugin
# Verify installation
docker --version
docker compose version
What is Docker? Think of it as a shipping container for software. Just like shipping containers standardize how goods are transported, Docker standardizes how applications run.
Terraform (Version 1.5.0 or higher)
# Download from hashicorp.com or use package manager
wget https://releases.hashicorp.com/terraform/1.5.0/terraform_1.5.0_linux_amd64.zip
unzip terraform_1.5.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
# Verify
terraform version
What is Terraform? It's like a blueprint for your cloud infrastructure. Instead of clicking buttons in AWS console (which you'll forget), you write code that describes what you want, and Terraform makes it happen.
Ansible
sudo apt-get install ansible
# Verify
ansible --version
What is Ansible? Think of it as a remote control for servers. Instead of SSHing into each server and typing commands, you write a "playbook" that tells Ansible what to do, and it does it on all your servers.
AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Configure with your credentials
aws configure
What is AWS CLI? It's a command-line interface to AWS. Instead of using the web console, you can control AWS from your terminal.
Setting Up AWS
This is where many beginners get stuck, so let's go through it step by step.
Step 1: Create an IAM User
Why not use your root account? Security best practice - root account has unlimited power. If it gets compromised, your entire AWS account is at risk.
- Go to AWS Console → IAM → Users
- Click "Create user"
- Name it something like
terraform-user - Important: Check "Provide user access to the AWS Management Console" if you want console access, OR just programmatic access for CLI/API
- Attach policies:
-
AmazonEC2FullAccess(for creating servers) -
AmazonS3FullAccess(for storing Terraform state) -
AmazonDynamoDBFullAccess(for state locking) -
AmazonSESFullAccess(for email notifications)
-
- Save the Access Key ID and Secret Access Key - you'll need these!
Step 2: Create S3 Bucket for Terraform State
What is Terraform state? Terraform needs to remember what it created. This "memory" is stored in a state file. We put it in S3 so it's:
- Backed up automatically
- Accessible from anywhere
- Versioned (can see history)
- Go to S3 → Create bucket
- Name it something like
yourname-terraform-state -
Important settings:
- Region: Choose one (remember which one!)
- Block Public Access: Keep all enabled (security)
- Versioning: Enable this (so you can recover if state gets corrupted)
- Click Create
Step 3: Create DynamoDB Table for State Locking
Why do we need locking? Imagine two people trying to deploy at the same time. Without locking, they might both try to create the same server, causing conflicts. DynamoDB prevents this.
- Go to DynamoDB → Create table
- Table name:
terraform-state-lock - Partition key:
LockID(type: String) - Table settings: Use default
- Capacity: On-demand (pay per request - perfect for this use case)
- Click Create
Step 4: Create EC2 Key Pair
What is a key pair? It's like a password, but more secure. Instead of typing a password, you use a private key file to authenticate.
- Go to EC2 → Key Pairs → Create key pair
- Name:
my-terraform-key(or whatever you prefer) - Key pair type: RSA
- Private key file format:
.pem - Click Create
-
IMPORTANT: The
.pemfile downloads automatically. Save it somewhere safe! You'll need it to SSH into your servers.
Step 5: Verify AWS CLI Works
# Test your credentials
aws sts get-caller-identity
# Should show your user ARN
If this works, you're all set! If not, check your aws configure settings.
Part 3: Containerizing Your Application
Now that your environment is set up, let's containerize the application. This is where the magic happens.
Understanding Dockerfiles
A Dockerfile is like a recipe. It tells Docker:
- What base image to start with (like choosing an operating system)
- What files to copy
- What commands to run
- What port to expose
- What command to run when the container starts
Creating Dockerfiles for All Services
Now, you might be wondering: "How do I know what Dockerfile to create for each service?" Great question! Let me show you the pattern.
The rule of thumb: Each service folder needs its own Dockerfile. Look at your project structure:
DevOps-deployment/
├── frontend/ → Needs Dockerfile (Vue.js)
├── auth-api/ → Needs Dockerfile (Go)
├── todos-api/ → Needs Dockerfile (Node.js)
├── users-api/ → Needs Dockerfile (Java)
└── log-message-processor/ → Needs Dockerfile (Python)
How to figure out what each service needs:
- Check what language/framework it uses (look for
package.json,pom.xml,requirements.txt,go.mod) - Find the entry point (usually
server.js,main.go,main.py, or a compiled JAR) - Determine the port it runs on (check the code or config files)
- Follow the pattern for that language
Frontend Dockerfile (Vue.js)
First, check the service:
cd frontend
ls -la
# You'll see: package.json, src/, public/
# This tells you: It's a Vue.js app that needs to be built
Vue.js apps are special - they compile to static HTML/CSS/JS files that need a web server. We use a "multi-stage build":
- Build stage: Use Node.js to compile the Vue app
- Runtime stage: Use nginx (lightweight web server) to serve the compiled files
# Step 1: Build the application
FROM node:18-alpine AS build
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
# Step 2: Serve it with nginx
FROM nginx:alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
Breaking it down:
-
FROM node:18-alpine AS build- Start with Node.js for building (theAS buildnames this stage) -
COPY package*.json ./- Copy dependency files first (Docker caching optimization!) -
RUN npm install- Install dependencies -
RUN npm run build- Compile Vue.js to static files (createsdist/folder) -
FROM nginx:alpine- Start a NEW stage with nginx (much smaller image) -
COPY --from=build- Copy the built files from the build stage -
EXPOSE 80- nginx serves on port 80
Why two stages? The build stage has Node.js + all build tools (~500MB). The runtime stage only has nginx + static files (~20MB). This makes the final image 25x smaller!
Auth API Dockerfile (Go)
Check the service:
cd auth-api
ls -la
# You'll see: go.mod, main.go
# This tells you: It's Go, entry point is main.go
Go is special - it compiles to a single binary file. No runtime needed! We also use multi-stage build:
# Build stage
FROM golang:1.21-alpine AS build
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o auth-api
# Runtime stage
FROM alpine:latest
RUN apk --no-cache add ca-certificates
COPY --from=build /app/auth-api /auth-api
EXPOSE 8081
CMD ["/auth-api"]
Breaking it down:
-
FROM golang:1.21-alpine AS build- Go compiler for building -
RUN go mod download- Download Go dependencies -
RUN go build -o auth-api- Compile to a single binary file -
FROM alpine:latest- Tiny Linux (only 5MB!) -
COPY --from=build /app/auth-api- Copy the compiled binary -
CMD ["/auth-api"]- Run the binary
Key differences from Vue.js:
- Go compiles to a single binary (no runtime needed!)
- Final image is super small (~10MB vs ~500MB for Node.js)
-
CGO_ENABLED=0creates a static binary (no external dependencies)
Todos API Dockerfile (Node.js)
Check the service:
cd todos-api
ls -la
# You'll see: package.json, server.js, routes.js
# This tells you: It's Node.js, entry point is server.js
Check package.json for the start command:
{
"scripts": {
"start": "node server.js"
}
}
Node.js API Dockerfile (simpler than Vue.js - no build step needed):
FROM node:18-alpine
WORKDIR /app
# Copy dependency files first (Docker caching optimization)
COPY package*.json ./
# Install dependencies (production only for smaller image)
RUN npm ci --only=production
# Copy application code
COPY . .
# Expose the port (check server.js to see which port)
EXPOSE 8082
# Start the application
CMD ["node", "server.js"]
Why this pattern?
-
npm ci --only=production- Faster, more reliable thannpm install, and skips dev dependencies - Copy
package*.jsonfirst - If dependencies don't change, Docker reuses the cached layer -
node:18-alpine- Lightweight Node.js image
How to test it:
# Build the image
docker build -t todos-api ./todos-api
# Run it
docker run -p 8082:8082 todos-api
# Test it
curl http://localhost:8082
Users API Dockerfile (Java Spring Boot)
Check the service:
cd users-api
ls -la
# You'll see: pom.xml, src/
# This tells you: It's Java with Maven, needs to be compiled
Java services need two stages:
- Build stage - Compile the code
- Runtime stage - Run the compiled JAR
# Stage 1: Build
FROM maven:3.9-eclipse-temurin-17 AS build
WORKDIR /app
# Copy Maven config first (caching optimization)
COPY pom.xml .
# Download dependencies (cached if pom.xml doesn't change)
RUN mvn dependency:go-offline
# Copy source code
COPY src ./src
# Build the application
RUN mvn clean package -DskipTests
# Stage 2: Runtime
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
# Install JAXB dependencies (needed for Java 17+)
RUN apk add --no-cache wget && \
mkdir -p /app/lib && \
wget -q -O /app/lib/jaxb-api.jar https://repo1.maven.org/maven2/javax/xml/bind/jaxb-api/2.3.1/jaxb-api-2.3.1.jar && \
wget -q -O /app/lib/jaxb-runtime.jar https://repo1.maven.org/maven2/org/glassfish/jaxb/jaxb-runtime/2.3.1/jaxb-runtime-2.3.1.jar && \
apk del wget
# Copy the built JAR from build stage
COPY --from=build /app/target/*.jar app.jar
EXPOSE 8083
# Run the Spring Boot application
ENTRYPOINT ["java", \
"--add-opens", "java.base/java.lang=ALL-UNNAMED", \
"--add-opens", "java.base/java.lang.reflect=ALL-UNNAMED", \
"--add-opens", "java.base/java.util=ALL-UNNAMED", \
"-cp", "app.jar:/app/lib/*", \
"org.springframework.boot.loader.JarLauncher"]
Why this is complex:
- Java needs compilation (Maven does this)
- Spring Boot creates a "fat JAR" (includes everything)
- Java 17+ removed some libraries (JAXB), so we add them back
- The
--add-opensflags are needed for Java 17+ module system
Don't worry if this looks complicated - Java Dockerfiles are the most complex. The pattern is always:
- Build stage: Install dependencies, compile
- Runtime stage: Copy compiled artifact, run it
Log Message Processor Dockerfile (Python)
Check the service:
cd log-message-processor
ls -la
# You'll see: requirements.txt, main.py
# This tells you: It's Python, entry point is main.py
Python Dockerfile:
FROM python:3.11-slim
WORKDIR /app
# Install build dependencies (needed to compile some Python packages)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
g++ \
python3-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy and install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Remove build dependencies (they're not needed at runtime)
RUN apt-get purge -y gcc g++ python3-dev && \
apt-get autoremove -y && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Copy application code
COPY . .
# Run the application
CMD ["python", "main.py"]
Why install then remove build dependencies?
- Some Python packages need to compile C extensions
- We install
gcc,g++to compile them - After installation, we remove them (saves ~200MB!)
- The compiled packages still work without the compilers
Simpler alternative (if no C extensions needed):
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "main.py"]
The Pattern: How to Create Any Dockerfile
Here's the mental model:
-
Identify the language → Check for language-specific files
-
package.json→ Node.js -
pom.xmlorbuild.gradle→ Java -
requirements.txt→ Python -
go.mod→ Go -
Cargo.toml→ Rust
-
-
Find the base image → Use official images
- Node.js →
node:18-alpine - Java →
eclipse-temurin:17-jre-alpine(runtime) ormaven:3.9-eclipse-temurin-17(build) - Python →
python:3.11-slim - Go →
golang:1.21-alpine(build) oralpine:latest(runtime)
- Node.js →
-
Copy dependencies first → Docker caching optimization
- Copy
package.json/pom.xml/requirements.txt/go.mod - Install dependencies
- Then copy application code
- Copy
Expose the port → Check the code for which port it uses
Set the command → How to start the application
Testing Each Dockerfile
Before adding to docker-compose, test each one:
# Test Frontend
cd frontend
docker build -t frontend-test .
docker run -p 8080:80 frontend-test
curl http://localhost:8080
# Test Auth API
cd ../auth-api
docker build -t auth-api-test .
docker run -p 8081:8081 auth-api-test
curl http://localhost:8081/health
# Test Todos API
cd ../todos-api
docker build -t todos-api-test .
docker run -p 8082:8082 todos-api-test
curl http://localhost:8082
# Test Users API
cd ../users-api
docker build -t users-api-test .
docker run -p 8083:8083 users-api-test
curl http://localhost:8083/health
# Test Log Processor
cd ../log-message-processor
docker build -t log-processor-test .
docker run log-processor-test
# (This might not have HTTP endpoint, check logs)
Common issues and fixes:
-
"Module not found" or "Package not found"
- Make sure you copied dependency files before installing
- Check that
requirements.txt/package.jsonis in the right place
-
"Port already in use"
- Another container is using that port
- Use
docker psto see what's running - Stop it with
docker stop <container-id>
-
"Cannot connect to database" or "Connection refused"
- Services need to be in the same Docker network
- Use service names (e.g.,
redis) notlocalhost - Wait for dependencies to be ready (use
depends_onin docker-compose)
-
Image too large
- Use multi-stage builds (build in one stage, copy artifacts to smaller runtime stage)
- Use
alpineorslimbase images - Remove build dependencies after installation
Creating docker-compose.yml
Now we need to orchestrate all these containers. That's where Docker Compose comes in.
Think of docker-compose.yml as a conductor's score - it tells all the musicians (containers) when to play, how to play together, and in what order.
Let's build it piece by piece:
Step 1: The Reverse Proxy (Traefik)
Traefik is like a smart receptionist at a hotel:
- It receives all incoming requests (guests)
- It looks at the URL and decides which service should handle it (which room)
- It automatically gets SSL certificates from Let's Encrypt (security badges)
- It handles HTTPS redirects (escorts HTTP guests to HTTPS)
services:
# Reverse Proxy - Routes traffic to the right service
traefik:
image: traefik:latest
container_name: traefik
command:
- "--api.insecure=true" # Enable dashboard (for debugging)
- "--providers.docker=true" # Watch Docker containers
- "--providers.docker.exposedbydefault=false" # Only expose containers with labels
- "--entrypoints.web.address=:80" # HTTP entry point
- "--entrypoints.websecure.address=:443" # HTTPS entry point
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true" # Use HTTP challenge
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web" # Challenge on port 80
- "--certificatesresolvers.letsencrypt.acme.email=${LETSENCRYPT_EMAIL:-your-email@example.com}" # Email for cert
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json" # Where to store certs
ports:
- "80:80" # HTTP
- "443:443" # HTTPS
- "8080:8080" # Traefik dashboard (for debugging)
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro # Let Traefik see other containers
- ./letsencrypt:/letsencrypt # Store SSL certificates
networks:
- app-network
restart: unless-stopped # Auto-restart if it crashes
Key concepts:
-
ports: "80:80"means "map host port 80 to container port 80" -
volumes: /var/run/docker.socklets Traefik discover other containers automatically -
networks: app-networkputs Traefik on the same network as other services
Step 2: The Frontend Service
The Vue.js frontend needs to be built and served:
# Frontend - Vue.js application
frontend:
build:
context: ./frontend # Where the Dockerfile is
dockerfile: Dockerfile
container_name: frontend
environment:
- PORT=80 # Port the app runs on inside container
- AUTH_API_ADDRESS=http://auth-api:8081 # Use service name, not localhost!
- TODOS_API_ADDRESS=http://todos-api:8082
labels:
# Tell Traefik to route traffic to this service
- "traefik.enable=true"
# Route requests for your domain to this service
- "traefik.http.routers.frontend.rule=Host(`${DOMAIN:-yourdomain.com}`)"
# Use HTTPS
- "traefik.http.routers.frontend.entrypoints=websecure"
# Get SSL certificate automatically
- "traefik.http.routers.frontend.tls.certresolver=letsencrypt"
# Frontend runs on port 80 inside container
- "traefik.http.services.frontend.loadbalancer.server.port=80"
# Redirect HTTP to HTTPS
- "traefik.http.routers.frontend-redirect.rule=Host(`${DOMAIN:-yourdomain.com}`)"
- "traefik.http.routers.frontend-redirect.entrypoints=web"
- "traefik.http.routers.frontend-redirect.middlewares=redirect-to-https"
- "traefik.http.middlewares.redirect-to-https.redirectscheme.scheme=https"
networks:
- app-network
depends_on:
- auth-api
- todos-api
- users-api
Important points:
-
build: context: ./frontendtells Docker to build from the frontend folder -
environment:sets variables the app can read -
AUTH_API_ADDRESS=http://auth-api:8081- Notice we useauth-api(the service name), notlocalhost! -
depends_on:ensures these services start before frontend - Labels tell Traefik how to route traffic
Step 3: The Auth API (Go)
# Auth API - Handles authentication
auth-api:
build:
context: ./auth-api
dockerfile: Dockerfile
container_name: auth-api
environment:
- AUTH_API_PORT=8081
- USERS_API_ADDRESS=http://users-api:8083
- JWT_SECRET=${JWT_SECRET:-myfancysecret} # Secret key for tokens
- REDIS_URL=redis://redis:6379 # Redis connection
labels:
- "traefik.enable=true"
# Route /api/auth requests to this service
- "traefik.http.routers.auth.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/api/auth`)"
- "traefik.http.routers.auth.entrypoints=websecure"
- "traefik.http.routers.auth.tls.certresolver=letsencrypt"
- "traefik.http.services.auth.loadbalancer.server.port=8081"
# Also handle /login route (frontend calls this)
- "traefik.http.routers.auth-login.rule=Host(`${DOMAIN:-yourdomian.com}`) && (Path(`/login`) || PathPrefix(`/login/`))"
- "traefik.http.routers.auth-login.entrypoints=websecure"
- "traefik.http.routers.auth-login.tls.certresolver=letsencrypt"
- "traefik.http.routers.auth-login.service=auth"
networks:
- app-network
depends_on:
- redis # Needs Redis for session storage
Routing explained:
-
PathPrefix(/api/auth)means any URL starting with/api/authgoes here - Example:
https://your-domain.com/api/auth/login→ auth-api -
Path(/login)means exactly/logingoes here
Step 4: The Todos API (Node.js)
# Todos API - Manages todo items
todos-api:
build:
context: ./todos-api
dockerfile: Dockerfile
container_name: todos-api
environment:
- PORT=8082
- AUTH_API_URL=http://auth-api:8081 # To validate tokens
- JWT_SECRET=${JWT_SECRET:-myfancysecret}
- REDIS_URL=redis://redis:6379
labels:
- "traefik.enable=true"
# Route /api/todos requests here
- "traefik.http.routers.todos.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/api/todos`)"
- "traefik.http.routers.todos.entrypoints=websecure"
- "traefik.http.routers.todos.tls.certresolver=letsencrypt"
- "traefik.http.services.todos.loadbalancer.server.port=8082"
networks:
- app-network
depends_on:
- redis
- auth-api # Needs auth-api to validate tokens
Step 5: The Users API (Java)
# Users API - Manages user accounts
users-api:
build:
context: ./users-api
dockerfile: Dockerfile
container_name: users-api
environment:
- SERVER_PORT=8083
- JWT_SECRET=${JWT_SECRET:-myfancysecret}
- REDIS_URL=redis://redis:6379
labels:
- "traefik.enable=true"
# Route /api/users requests here
- "traefik.http.routers.users.rule=Host(`${DOMAIN:-yourdomian.com}`) && PathPrefix(`/api/users`)"
- "traefik.http.routers.users.entrypoints=websecure"
- "traefik.http.routers.users.tls.certresolver=letsencrypt"
- "traefik.http.services.users.loadbalancer.server.port=8083"
networks:
- app-network
depends_on:
- redis
Step 6: The Log Processor (Python)
This service doesn't need Traefik routing - it's a background worker:
# Log Processor - Background worker that processes messages
log-message-processor:
build:
context: ./log-message-processor
dockerfile: Dockerfile
container_name: log-message-processor
environment:
- REDIS_HOST=redis # Use service name
- REDIS_PORT=6379
- REDIS_CHANNEL=log-messages
networks:
- app-network
depends_on:
- redis
restart: unless-stopped # Keep it running
Why no Traefik labels? This service doesn't serve HTTP requests - it just listens to Redis for messages.
Step 7: Supporting Services
# Redis - Message queue and cache
redis:
image: redis:7-alpine # Use pre-built image, no Dockerfile needed
container_name: redis
ports:
- "6379:6379" # Expose for debugging (optional)
volumes:
- redis-data:/data # Persist data
networks:
- app-network
restart: unless-stopped
# Zipkin handler - Service for /zipkin endpoint
zipkin-handler:
image: nginx:alpine
container_name: zipkin-handler
labels:
- "traefik.enable=true"
- "traefik.http.routers.zipkin.rule=Host(`${DOMAIN:-yourdomain.com}`) && PathPrefix(`/zipkin`)"
- "traefik.http.routers.zipkin.entrypoints=websecure"
- "traefik.http.routers.zipkin.tls.certresolver=letsencrypt"
- "traefik.http.services.zipkin.loadbalancer.server.port=80"
networks:
- app-network
command: >
sh -c "echo 'server {
listen 80;
location / {
return 200 \"OK\";
add_header Content-Type text/plain;
}
}' > /etc/nginx/conf.d/default.conf && nginx -g 'daemon off;'"
Step 8: Networks and Volumes
At the end of the file, define shared resources:
# Networks - How containers communicate
networks:
app-network:
driver: bridge # Default network type
# Volumes - Persistent storage
volumes:
redis-data: # Named volume for Redis data
Why networks? Containers on the same network can talk to each other using service names (like auth-api instead of IP addresses).
Why volumes? Data in containers is lost when they're removed. Volumes persist data.
Complete docker-compose.yml Structure
Here's the mental model:
docker-compose.yml
├── services (all your containers)
│ ├── traefik (reverse proxy)
│ ├── frontend (Vue.js app)
│ ├── auth-api (Go service)
│ ├── todos-api (Node.js service)
│ ├── users-api (Java service)
│ ├── log-message-processor (Python worker)
│ ├── redis (database/queue)
│ └── zipkin-handler (dummy endpoint)
├── networks (how they connect)
│ └── app-network
└── volumes (persistent storage)
└── redis-data
Understanding Traefik Labels (Deep Dive)
Labels are how you tell Traefik what to do. Let's break down a complex example:
labels:
- "traefik.enable=true" # Step 1: Enable Traefik for this service
- "traefik.http.routers.auth.rule=Host(`example.com`) && PathPrefix(`/api/auth`)" # Step 2: Define routing rule
- "traefik.http.routers.auth.entrypoints=websecure" # Step 3: Use HTTPS
- "traefik.http.routers.auth.tls.certresolver=letsencrypt" # Step 4: Get SSL cert
- "traefik.http.services.auth.loadbalancer.server.port=8081" # Step 5: Which port to forward to
Breaking it down:
- Router = A set of rules for routing traffic
- Rule = Conditions that must match (domain + path)
- Entrypoint = Which port/protocol (web = HTTP, websecure = HTTPS)
- Service = The actual container and port
- Middleware = Transformations (redirects, rewrites, etc.)
Example flow:
- User visits
https://example.com/api/auth/login - Traefik receives request on port 443 (websecure entrypoint)
- Traefik checks rules: "Does this match
Host(example.com) && PathPrefix(/api/auth)?" → Yes! - Traefik forwards to
authservice on port 8081 - Auth-api container handles the request
Testing Your docker-compose.yml
Before deploying to the cloud, test locally:
Step 1: Create Environment File
Create a .env file in the root directory:
cat > .env <<EOF
DOMAIN=localhost
LETSENCRYPT_EMAIL=your-email@example.com
JWT_SECRET=test-secret-key-change-this-in-production
EOF
What's in .env?
-
DOMAIN- Your domain name (uselocalhostfor local testing) -
LETSENCRYPT_EMAIL- Email for SSL certificate notifications -
JWT_SECRET- Secret key for JWT tokens (use a strong random string in production)
Step 2: Start All Services
# Build and start all containers in the background
docker compose up -d
# Watch all logs in real-time
docker compose logs -f
# Or watch specific service logs
docker compose logs -f frontend
docker compose logs -f auth-api
What -d means: Detached mode - runs in the background so you can use your terminal.
Step 3: Verify Everything is Running
# Check status of all containers
docker compose ps
# You should see something like:
# NAME STATUS PORTS
# traefik Up 2 minutes 0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp
# frontend Up 2 minutes
# auth-api Up 2 minutes
# todos-api Up 2 minutes
# users-api Up 2 minutes
# log-message-processor Up 2 minutes
# redis Up 2 minutes 0.0.0.0:6379->6379/tcp
Step 4: Test Each Service
# Test frontend (should return HTML)
curl http://localhost
# Test auth API (should return "Not Found" - that's expected!)
curl http://localhost/api/auth
# Test todos API (should return "Invalid Token" - also expected!)
curl http://localhost/api/todos
# Test users API
curl http://localhost/api/users
# Check Traefik dashboard (optional)
# Open http://localhost:8080 in your browser
Expected responses:
- Frontend: HTML page (login screen)
-
/api/authwithout path: "Not Found" (correct - needs specific endpoint) -
/api/todoswithout auth: "Invalid Token" (correct - needs authentication) -
/api/userswithout auth: "Missing or invalid Authorization header" (correct)
Step 5: Test with Browser
- Open
http://localhostin your browser - You should see the login page
- Try logging in (if you have test credentials)
- Check browser console (F12) for any errors
Step 6: Check Logs for Errors
# View logs for a specific service
docker compose logs frontend
docker compose logs auth-api
docker compose logs traefik
# View last 100 lines
docker compose logs --tail=100 traefik
# Follow logs in real-time
docker compose logs -f traefik
What to look for:
- ✅ "Server started" or "Listening on port X" = Good!
- ❌ "Connection refused" = Service dependency not ready
- ❌ "Module not found" = Missing dependency in Dockerfile
- ❌ "Port already in use" = Another service is using that port
Step 7: Stop Everything
# Stop all containers
docker compose down
# Stop and remove volumes (clean slate)
docker compose down -v
# Stop and remove images too (full cleanup)
docker compose down --rmi all -v
Common Issues and Solutions
Issue 1: "Port 80 already in use"
Problem: Another service (like Apache, Nginx, or another Docker container) is using port 80.
Solution:
# Find what's using port 80
sudo lsof -i :80
# or
sudo netstat -tulpn | grep :80
# Stop the conflicting service
sudo systemctl stop apache2 # or nginx, or whatever it is
# Or change Traefik ports in docker-compose.yml:
ports:
- "8080:80" # Use 8080 instead of 80
- "8443:443" # Use 8443 instead of 443
Issue 2: "Build failed" or "Module not found"
Problem: Dockerfile has issues or dependencies are missing.
Solution:
# Build a specific service to see detailed errors
docker compose build frontend
# Check the Dockerfile syntax
# Make sure COPY commands are in the right order
# Make sure RUN commands install dependencies before copying code
Issue 3: "Container keeps restarting"
Problem: The application is crashing on startup.
Solution:
# Check why it's restarting
docker compose logs <service-name>
# Common causes:
# - Missing environment variables
# - Database/Redis not ready (add depends_on)
# - Port conflict
# - Missing files or dependencies
Issue 4: "Cannot connect to auth-api" or "Connection refused"
Problem: Services can't find each other.
Solution:
- ✅ Use service names (e.g.,
http://auth-api:8081), notlocalhost - ✅ Make sure all services are on the same network (
app-network) - ✅ Check
depends_on- services might be starting before dependencies are ready - ✅ Add health checks or wait scripts if needed
Issue 5: "SSL certificate error" or "Let's Encrypt failed"
Problem: Let's Encrypt can't verify your domain.
Solution:
- For local testing: Use
localhostand HTTP only (remove HTTPS redirect) - For production: Make sure DNS points to your server
- Make sure ports 80 and 443 are open in firewall
- Check Traefik logs:
docker compose logs traefik
Quick Reference: docker-compose Commands
# Start services
docker compose up # Start and show logs
docker compose up -d # Start in background
# Stop services
docker compose stop # Stop but don't remove
docker compose down # Stop and remove containers
# View logs
docker compose logs # All services
docker compose logs -f # Follow (live updates)
docker compose logs <service> # Specific service
# Rebuild
docker compose build # Build all
docker compose build <service> # Build specific service
docker compose up --build # Build and start
# Status
docker compose ps # Show running containers
docker compose top # Show running processes
# Execute commands
docker compose exec <service> <command> # Run command in container
docker compose exec frontend sh # Get shell in frontend container
Part 4: Infrastructure as Code with Terraform
Now comes the infrastructure part. This is where many people get intimidated, but it's actually simpler than it seems.
Why Infrastructure as Code?
Imagine you're building a house. You could:
- Manual approach: Tell the builder "put a window here, a door there" every time
- Blueprint approach: Draw a blueprint once, builder follows it every time
Infrastructure as Code is the blueprint approach. Benefits:
- Reproducible: Same code = same infrastructure, every time
- Version controlled: See what changed and when
- Testable: Try changes without breaking production
- Documented: The code IS the documentation
Understanding Terraform Basics
Terraform uses a language called HCL (HashiCorp Configuration Language). It's designed to be human-readable.
Basic structure:
resource "aws_instance" "todo_app" {
ami = "ami-12345"
instance_type = "t3.medium"
}
This says: "Create an AWS EC2 instance resource, call it 'todo_app', with these properties."
Creating Your First Terraform Configuration
Let's build it step by step:
Step 1: Provider Configuration
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
backend "s3" {
# We'll configure this during init
}
}
provider "aws" {
region = var.aws_region
}
What's happening:
-
required_version- Ensures everyone uses compatible Terraform -
required_providers- Tells Terraform which plugins to download -
backend "s3"- Where to store state (we'll configure this later) -
provider "aws"- Which cloud provider to use
Step 2: Data Sources (Getting Information)
Before creating resources, we often need to look things up:
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical (Ubuntu's publisher)
filter {
name = "name"
values = ["*ubuntu-jammy-22.04-amd64-server*"]
}
}
What is an AMI? Amazon Machine Image - it's like a template for a virtual machine. This code finds the latest Ubuntu 22.04 image.
Why use data sources? AMI IDs change in different regions. Instead of hardcoding ami-12345, we let Terraform find the right one.
Step 3: Security Group (Firewall Rules)
resource "aws_security_group" "todo_app" {
name = "todo-app-sg-${var.environment}"
description = "Security group for TODO application"
ingress {
description = "HTTP"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # Allow from anywhere
}
ingress {
description = "HTTPS"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
ingress {
description = "SSH"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.ssh_cidr] # Only from your IP (security!)
}
egress {
from_port = 0
to_port = 0
protocol = "-1" # All protocols
cidr_blocks = ["0.0.0.0/0"] # Allow all outbound
}
tags = {
Name = "todo-app-sg-${var.environment}"
Environment = var.environment
}
}
What is a security group? It's AWS's firewall. It controls what traffic can reach your server.
Breaking it down:
-
ingress- Incoming traffic rules -
egress- Outgoing traffic rules -
cidr_blocks = ["0.0.0.0/0"]- From anywhere (0.0.0.0/0 means "everywhere") -
var.ssh_cidr- A variable (we'll set this to your IP for security)
Security tip: In production, restrict SSH to your IP only! Use a service like whatismyip.com to find your IP, then set ssh_cidr = "YOUR_IP/32".
Step 4: EC2 Instance (Your Server)
resource "aws_instance" "todo_app" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
key_name = var.key_pair_name
vpc_security_group_ids = [aws_security_group.todo_app.id]
user_data = <<-EOF
#!/bin/bash
apt-get update
apt-get install -y python3 python3-pip
EOF
tags = {
Name = "todo-app-server-${var.environment}"
Environment = var.environment
Project = "hngi13-stage6"
}
lifecycle {
create_before_destroy = true
}
}
What's happening:
-
ami- Which OS image to use (from our data source) -
instance_type- Server size (t3.medium = 2 vCPU, 4GB RAM) -
key_name- Which SSH key to install -
vpc_security_group_ids- Which firewall rules to apply -
user_data- Script that runs when server starts (bootstrap script) -
lifecycle- Terraform behavior (create new before destroying old = zero downtime)
Step 5: Variables (Making It Flexible)
variable "aws_region" {
description = "AWS region"
type = string
default = "us-east-1"
}
variable "instance_type" {
description = "EC2 instance type"
type = string
default = "t3.medium"
}
variable "key_pair_name" {
description = "AWS Key Pair name"
type = string # Required - no default
}
variable "environment" {
description = "Environment name (dev, stg, prod)"
type = string
default = "dev"
validation {
condition = contains(["dev", "stg", "prod"], var.environment)
error_message = "Environment must be one of: dev, stg, prod"
}
}
Why variables? Makes your code reusable. Same code works for dev, staging, and production - just change the variables!
Step 6: Outputs (Getting Information Back)
output "server_ip" {
description = "Public IP of the server"
value = aws_instance.todo_app.public_ip
}
output "ansible_inventory_path" {
description = "Path to generated Ansible inventory"
value = local_file.ansible_inventory.filename
}
What are outputs? After Terraform creates resources, you often need information about them (like the server IP). Outputs make that information available.
Environment-Specific Configuration
Create separate .tfvars files for each environment. This is crucial - you'll have three files, one for each environment:
terraform.dev.tfvars:
environment = "dev"
aws_region = "us-east-1"
instance_type = "t3.small" # Smaller for dev (saves money)
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "0.0.0.0/0" # Less restrictive for dev
server_user = "ubuntu"
skip_ansible_provision = false # Run Ansible automatically
terraform.stg.tfvars:
environment = "stg"
aws_region = "us-east-1"
instance_type = "t3.small" # Can be same size as dev for staging
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "0.0.0.0/0" # Can be less restrictive than prod
server_user = "ubuntu"
skip_ansible_provision = false
terraform.prod.tfvars:
environment = "prod"
aws_region = "us-east-1"
instance_type = "t3.medium" # More power for production
key_pair_name = "my-terraform-key"
ssh_key_path = "~/.ssh/my-terraform-key.pem"
ssh_cidr = "YOUR_IP/32" # Restrict SSH to your IP only!
server_user = "ubuntu"
skip_ansible_provision = false
Why three separate files? Different environments have different needs:
- Dev: Smaller instance, less security (for quick testing)
- Staging: Similar to dev, but closer to production setup (for pre-production testing)
- Prod: Larger instance, maximum security (for real users)
File structure:
infra/terraform/
├── main.tf
├── variables.tf
├── outputs.tf
├── terraform.dev.tfvars ← Development environment
├── terraform.stg.tfvars ← Staging environment
└── terraform.prod.tfvars ← Production environment
Remote State Configuration
Remember the S3 bucket we created? Now we use it. Important: Each environment needs its own state file path!
For Development:
terraform init \
-backend-config="bucket=yourname-terraform-state" \
-backend-config="key=terraform-state/dev/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=terraform-state-lock" \
-backend-config="encrypt=true"
For Staging:
terraform init \
-backend-config="bucket=yourname-terraform-state" \
-backend-config="key=terraform-state/stg/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=terraform-state-lock" \
-backend-config="encrypt=true"
For Production:
terraform init \
-backend-config="bucket=yourname-terraform-state" \
-backend-config="key=terraform-state/prod/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=terraform-state-lock" \
-backend-config="encrypt=true"
What this does:
-
bucket- Where to store state (same bucket for all environments) -
key- File path in bucket (different per environment!) -
dynamodb_table- For locking (prevents conflicts when multiple people run Terraform) -
encrypt=true- Encrypt state at rest (security)
Why separate keys per environment?
-
terraform-state/dev/terraform.tfstate→ Development infrastructure -
terraform-state/stg/terraform.tfstate→ Staging infrastructure -
terraform-state/prod/terraform.tfstate→ Production infrastructure
These are completely separate files. This means:
- ✅ Dev, staging, and prod infrastructure are isolated
- ✅ You can destroy dev without affecting staging or prod
- ✅ Each environment has its own state history
- ✅ No risk of accidentally modifying the wrong environment
Your First Terraform Run
Let's deploy to development first (always start with dev!):
# 1. Initialize (downloads providers, sets up backend)
terraform init \
-backend-config="bucket=yourname-terraform-state" \
-backend-config="key=terraform-state/dev/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=terraform-state-lock" \
-backend-config="encrypt=true"
# 2. Plan (see what will be created - SAFE, doesn't change anything)
terraform plan -var-file=terraform.dev.tfvars
# 3. Apply (actually create resources)
terraform apply -var-file=terraform.dev.tfvars
What happens:
-
terraform init- Downloads AWS provider, configures backend for dev environment -
terraform plan- Shows you what will be created/changed/destroyed (dry run) -
terraform apply- Actually creates the resources
Pro tip: Always run plan first! It's like a dry run. Review the output carefully before applying.
For other environments, repeat the same steps but:
- Use the appropriate
-backend-config="key=terraform-state/ENV/terraform.tfstate" - Use the matching
-var-file=terraform.ENV.tfvars
Example for staging:
terraform init \
-backend-config="bucket=yourname-terraform-state" \
-backend-config="key=terraform-state/stg/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=terraform-state-lock" \
-backend-config="encrypt=true"
terraform plan -var-file=terraform.stg.tfvars
terraform apply -var-file=terraform.stg.tfvars
Understanding Drift Detection (Critical for Safety!)
What is drift? Imagine you have a blueprint for a house (Terraform code), but someone goes and changes the actual house (AWS infrastructure) without updating the blueprint. That's drift - your code and reality don't match anymore.
Real-world example:
- You deploy infrastructure with Terraform ✅
- Later, you manually add a tag to your EC2 instance in AWS Console 🏷️
- Terraform doesn't know about this change
- Next time you run Terraform, it sees the difference → DRIFT DETECTED!
Why drift is dangerous:
- 🔴 Security risk: Someone might have changed something maliciously
- 🔴 Data loss: Terraform might try to "fix" things and delete your changes
- 🔴 Confusion: You don't know what changed or why
- 🔴 Breaking changes: Manual changes might break your application
How drift detection works:
Think of it like a security guard checking your house:
- Terraform Plan = Security guard walks around and notes what's different
-
Check Git History = Did you change the blueprint (Terraform files)?
- ✅ If yes → Expected changes (you updated the code)
- ❌ If no → DRIFT! (Someone changed infrastructure without updating code)
- Alert = Security guard calls you immediately
- Approval = You review and decide what to do
- Action = Apply changes or investigate further
The detection logic:
# Step 1: Run terraform plan to see what's different
terraform plan -out=tfplan
# Step 2: Check if Terraform files changed in this commit
git diff HEAD~1 HEAD --name-only | grep -E "\.tf$|\.tfvars$"
# Step 3: Determine the type of change
if [ no terraform files changed ] && [ plan shows changes ]; then
echo "🚨 DRIFT DETECTED!"
# Send email, create GitHub issue, wait for approval
else
echo "✅ Expected changes (code was updated)"
# Proceed automatically
fi
Setting Up Email Notifications for Drift
When drift is detected, you need to know immediately! That's where email notifications come in.
Step 1: Verify Your Email in AWS SES
AWS SES (Simple Email Service) is like a post office for your applications. First, you need to verify your email address:
- Go to AWS Console → SES (Simple Email Service)
- Click "Verified identities" → "Create identity"
- Choose "Email address"
-
Enter your email (e.g.,
your-email@gmail.com) - Click "Create identity"
- Check your email and click the verification link
Why verify? AWS prevents spam by requiring you to verify you own the email address.
Step 2: Create the Email Notification Script
Create infra/ci-cd/scripts/email-notification.sh:
#!/bin/bash
# Email Notification Script for Terraform Drift
# Sends email alert when infrastructure drift is detected
set -e
DRIFT_SUMMARY="${1:-}"
if [ -z "$DRIFT_SUMMARY" ]; then
echo "Error: Drift summary not provided"
exit 1
fi
# Email configuration from environment variables
EMAIL_TO="${EMAIL_TO:-}"
EMAIL_FROM="${EMAIL_FROM:-}"
AWS_REGION="${AWS_REGION:-us-east-1}"
# GitHub Actions variables for workflow link
GITHUB_SERVER_URL="${GITHUB_SERVER_URL:-https://github.com}"
GITHUB_REPOSITORY="${GITHUB_REPOSITORY:-}"
GITHUB_RUN_ID="${GITHUB_RUN_ID:-}"
# Check if email is configured
if [ -z "$EMAIL_TO" ] || [ -z "$EMAIL_FROM" ]; then
echo "⚠️ Email not configured. Skipping email notification."
exit 0
fi
# Build workflow run URL if GitHub variables are available
WORKFLOW_URL=""
if [ -n "$GITHUB_REPOSITORY" ] && [ -n "$GITHUB_RUN_ID" ]; then
WORKFLOW_URL="${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}/actions/runs/${GITHUB_RUN_ID}"
fi
# Create email body
SUBJECT="🚨 Terraform Drift Detected - Action Required"
BODY=$(cat <<EOF
Terraform infrastructure drift has been detected.
This means infrastructure was changed OUTSIDE of Terraform (e.g., manually in AWS Console).
Please review the changes and approve the deployment in GitHub Actions.
$(if [ -n "$WORKFLOW_URL" ]; then echo "🔗 View Workflow Run: $WORKFLOW_URL"; echo ""; fi)
Drift Summary:
$DRIFT_SUMMARY
---
This is an automated message from GitHub Actions.
EOF
)
# Send email via AWS SES
echo "📧 Sending drift alert email via AWS SES..."
aws ses send-email \
--region "$AWS_REGION" \
--from "$EMAIL_FROM" \
--to "$EMAIL_TO" \
--subject "$SUBJECT" \
--text "$BODY" \
|| echo "⚠️ Failed to send email. Check AWS credentials and SES configuration."
echo "✅ Email notification sent!"
Make it executable:
chmod +x infra/ci-cd/scripts/email-notification.sh
What this script does:
- Takes the drift summary as input
- Checks if email is configured
- Builds a nice email message with the drift details
- Sends it via AWS SES
- Includes a link to the GitHub workflow run
Step 3: Add GitHub Secrets
In your GitHub repository, go to Settings → Secrets and variables → Actions, and add:
-
EMAIL_TO- Your email address (where to send alerts) -
EMAIL_FROM- Your verified SES email (must match the one you verified in AWS) -
AWS_ACCESS_KEY_ID- Your AWS access key -
AWS_SECRET_ACCESS_KEY- Your AWS secret key -
AWS_REGION- Your AWS region (e.g.,us-east-1)
Security tip: Never commit these values to your repository! Always use GitHub Secrets.
Understanding GitHub Issue Approval
When drift is detected, the workflow creates a GitHub issue and waits for your approval. This is like a safety checkpoint.
How it works:
- Drift detected → Workflow pauses
-
GitHub issue created → Contains:
- What changed
- Why it's drift (no code changes)
- Link to workflow run
- Plan summary
- You review → Check the issue
- You approve → Comment "approve" or click approve button
- Workflow continues → Terraform applies the changes
Example GitHub Issue:
🚨 REAL DRIFT DETECTED - Infrastructure Changed Outside Terraform (dev)
⚠️ CRITICAL: Real Infrastructure Drift Detected
Infrastructure has been modified outside of Terraform. This is unexpected.
Environment: dev
What happened:
- Terraform code files were NOT modified
- But infrastructure plan shows changes
- This indicates manual changes or changes from another process
Action Required:
1. Review the plan below
2. Investigate what caused the drift
3. Approve if changes are intentional, or revert if unauthorized
Plan Summary:
aws_instance.todo_app will be updated in-place
~ resource "aws_instance" "todo_app" {
~ tags = {
- "ManualTag" = "test" -> null
}
}
Workflow Run:
🔗 View Workflow Run: https://github.com/yourusername/repo/actions/runs/123456
Next Steps:
- Approve to apply these changes
- Or investigate and revert unauthorized changes
How to approve:
- Go to the GitHub issue (you'll get a notification)
- Review the changes carefully
- If changes are OK: Comment "approve" or click the approve button
- If changes are suspicious: Investigate first, then approve or revert
Why this matters:
- ✅ Prevents accidental changes
- ✅ Gives you time to investigate
- ✅ Creates an audit trail (who approved what, when)
- ✅ Protects production from unauthorized changes
Testing Drift Detection
Want to test if drift detection works? Here's how:
Step 1: Deploy infrastructure normally
terraform apply -var-file=terraform.dev.tfvars
Step 2: Manually change something in AWS Console
- Go to AWS Console → EC2 → Instances
- Find your instance
- Click "Tags" → "Manage tags"
- Add a new tag:
TestTag = "drift-test" - Save
Added port 8080 via AWS console

Step 3: Trigger the workflow
- Go to GitHub Actions
- Run the "Infrastructure Deployment" workflow
- Select "dev" environment
- Watch it detect drift!
Step 4: Check your email
- You should receive an email alert
- Check spam folder if you don't see it
Step 5: Approve in GitHub
- A GitHub issue should be created
- Review and approve
- Watch Terraform apply the changes
Destroying Infrastructure (When You Need to Start Over)
Sometimes you need to tear everything down and start fresh. Here's how to do it safely.
⚠️ WARNING: Destroying infrastructure will DELETE EVERYTHING:
- Your EC2 instance
- All data on the server
- Security groups
- Everything created by Terraform
Make sure you:
- ✅ Have backups if you need data
- ✅ Are destroying the right environment (dev, not prod!)
- ✅ Really want to delete everything
Method 1: Destroy via Command Line
# Step 1: Initialize Terraform (if not already done)
cd infra/terraform
terraform init \
-backend-config="bucket=yourname-terraform-state" \
-backend-config="key=terraform-state/dev/terraform.tfstate" \
-backend-config="region=us-east-1" \
-backend-config="dynamodb_table=terraform-state-lock" \
-backend-config="encrypt=true"
# Step 2: Plan the destruction (see what will be deleted)
terraform plan -destroy -var-file=terraform.dev.tfvars
# Step 3: Review the plan carefully!
# Make sure it's only deleting what you want
# Step 4: Destroy everything
terraform destroy -var-file=terraform.dev.tfvars
What happens:
- Terraform reads the state file
- Plans what needs to be destroyed
- Shows you the plan (review it!)
- Asks for confirmation (type
yes) - Deletes everything in reverse order
Method 2: Destroy via GitHub Actions
If you have a destroy workflow set up:
- Go to GitHub Actions
- Find "Destroy Infrastructure" workflow (if you have one)
- Click "Run workflow"
- Select environment (be careful - don't destroy prod!)
-
Confirm and run
Example destroy workflow:
name: Destroy Infrastructure
on:
workflow_dispatch:
inputs:
environment:
description: 'Environment to destroy'
required: true
type: choice
options: [dev, stg, prod]
jobs:
destroy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
working-directory: infra/terraform
run: |
terraform init \
-backend-config="bucket=${{ secrets.TERRAFORM_STATE_BUCKET }}" \
-backend-config="key=terraform-state/${{ github.event.inputs.environment }}/terraform.tfstate" \
-backend-config="region=${{ secrets.AWS_REGION }}" \
-backend-config="dynamodb_table=${{ secrets.TERRAFORM_STATE_LOCK_TABLE }}" \
-backend-config="encrypt=true"
- name: Terraform Plan Destroy
working-directory: infra/terraform
run: |
terraform plan -destroy \
-var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
-var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}" \
-out=tfplan
- name: Manual Approval
uses: trstringer/manual-approval@v1
with:
secret: ${{ github.TOKEN }}
approvers: ${{ github.actor }}
minimum-approvals: 1
issue-title: "⚠️ DESTROY Infrastructure - ${{ github.event.inputs.environment }}"
issue-body: |
**⚠️ WARNING: Infrastructure Destruction Requested**
This will **DELETE ALL INFRASTRUCTURE** for environment: **${{ github.event.inputs.environment }}**
**This action cannot be undone!**
Review the plan and approve only if you're sure.
- name: Terraform Destroy
if: steps.manual-approval.outcome == 'success'
working-directory: infra/terraform
run: terraform apply -auto-approve tfplan
Safety features:
- ✅ Manual approval required (can't destroy by accident)
- ✅ Shows what will be destroyed
- ✅ Creates GitHub issue for review
- ✅ Environment selection (prevents destroying wrong env)
Method 3: Destroy Specific Resources
Don't want to destroy everything? You can target specific resources:
# Destroy only the EC2 instance (keep security group)
terraform destroy -target=aws_instance.todo_app -var-file=terraform.dev.tfvars
# Destroy only the security group
terraform destroy -target=aws_security_group.todo_app -var-file=terraform.dev.tfvars
When to use this:
- You want to recreate just one resource
- Something is broken and you want to rebuild it
- You're testing changes
After Destruction
After destroying, your state file still exists in S3, but it's empty (or has no resources). You can:
-
Start fresh: Run
terraform applyagain to recreate everything - Clean up state: Delete the state file from S3 (optional)
- Keep state: Leave it (Terraform will just create new resources)
Best practice: Keep the state file. It's useful for history and doesn't cost much.
Summary: The Complete Terraform Workflow
Here's the full picture of how everything works together:
┌─────────────────────────────────────────────────────────┐
│ 1. You make changes to Terraform code │
│ (or someone changes infrastructure manually) │
└─────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 2. GitHub Actions workflow runs │
│ - Checks out code │
│ - Runs terraform plan │
└─────────────────┬───────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Drift Detection │
│ - Did Terraform files change? │
│ - Does plan show changes? │
│ - If no code changes + plan changes = DRIFT! │
└─────────────────┬───────────────────────────────────────┘
│
┌─────────┴─────────┐
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ DRIFT │ │ EXPECTED CHANGES │
│ DETECTED │ │ (Code updated) │
└──────┬───────┘ └────────┬───────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Send Email │ │ Apply directly │
│ Create Issue │ │ (No approval) │
│ Wait for Approval│ └──────────────────┘
└────────┬─────────┘
│
▼
┌──────────────────┐
│ You Review │
│ & Approve │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Apply Changes │
│ (terraform apply) │
└──────────────────┘
Key takeaways:
- ✅ Always run
terraform planfirst (see what will happen) - ✅ Drift detection protects you from unexpected changes
- ✅ Email notifications keep you informed
- ✅ Manual approval prevents accidents
- ✅ Destroy carefully - it's permanent!
Part 5: Server Configuration with Ansible
Terraform created your server, but it's just a blank Ubuntu machine. Now we need to:
- Install Docker
- Clone your code
- Start the application
That's where Ansible comes in.
Understanding Ansible
Ansible is like having a robot assistant that can:
- SSH into your servers
- Run commands
- Install software
- Copy files
- Start services
Why Ansible over SSH scripts?
- Idempotent: Run it multiple times safely (won't break if run twice)
- Declarative: You say "what" you want, not "how" to do it
- Organized: Roles and playbooks keep things organized
- Reusable: Write once, use for dev/stg/prod
Ansible Playbook Structure
---
- name: Configure TODO Application Server
hosts: all
become: yes # Use sudo
gather_facts: yes # Collect info about the server
vars:
app_user: ubuntu
app_dir: /opt/todo-app
roles:
- role: dependencies # Install Docker, etc.
- role: deploy # Deploy the application
Breaking it down:
-
hosts: all- Run on all servers in inventory -
become: yes- Use sudo (needed for installing packages) -
gather_facts- Ansible learns about the server (OS, IP, etc.) -
roles- Reusable collections of tasks
Creating the Dependencies Role
This role installs everything the server needs:
roles/dependencies/tasks/main.yml:
---
- name: Update apt cache
apt:
update_cache: yes
cache_valid_time: 3600
- name: Install required packages
apt:
name:
- git
- curl
- python3
- python3-pip
state: present
- name: Check if Docker is already installed
command: docker --version
register: docker_check
changed_when: false
failed_when: false
ignore_errors: yes
- name: Install Docker (only if not installed)
apt:
name:
- docker-ce
- docker-ce-cli
- containerd.io
- docker-compose-plugin
state: present
when: docker_check.rc != 0 # Only if Docker not found
- name: Add user to docker group
user:
name: "{{ app_user }}"
groups: docker
append: yes
- name: Start and enable Docker
systemd:
name: docker
state: started
enabled: yes
Key concepts:
-
register- Save command output to a variable -
when- Conditional execution (only if condition is true) -
changed_when: false- This task never "changes" anything (just checks) -
state: present- Ensure package is installed (idempotent!)
Creating the Deploy Role
This role actually deploys your application:
roles/deploy/tasks/main.yml:
---
- name: Create application directory
file:
path: "{{ app_dir }}"
state: directory
owner: "{{ app_user }}"
group: "{{ app_user }}"
mode: '0755'
- name: Clone repository
git:
repo: "{{ repo_url }}"
dest: "{{ app_dir }}"
version: "{{ repo_branch | default('main') }}"
update: yes
register: git_pull_result
changed_when: git_pull_result.changed
- name: Create .env file
copy:
content: |
DOMAIN="{{ domain }}"
LETSENCRYPT_EMAIL="{{ letsencrypt_email }}"
JWT_SECRET="{{ jwt_secret }}"
# ... other variables
dest: "{{ app_dir }}/.env"
owner: "{{ app_user }}"
mode: '0600'
register: env_file_result
changed_when: env_file_result.changed
- name: Determine if rebuild is needed
set_fact:
needs_rebuild: "{{ git_pull_result.changed | default(false) or env_file_result.changed | default(false) }}"
- name: Build images if code/config changed
shell: docker compose build
args:
chdir: "{{ app_dir }}"
when: needs_rebuild | default(false)
- name: Start/update containers
shell: docker compose up -d
args:
chdir: "{{ app_dir }}"
Making it idempotent:
- Only rebuilds if code or config changed
-
docker compose up -dis idempotent (won't restart if nothing changed) - Safe to run multiple times
Environment-Specific Variables
Just like Terraform, Ansible needs separate configuration files for each environment. Create three files:
group_vars/dev/vars.yml:
---
domain: "dev.yourdomain.com"
letsencrypt_email: "dev-email@example.com"
jwt_secret: "dev-secret-key"
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "dev" # Use dev branch for development
group_vars/stg/vars.yml:
---
domain: "stg.yourdomain.com"
letsencrypt_email: "staging-email@example.com"
jwt_secret: "staging-secret-key" # Different from dev!
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "staging" # Use staging branch for staging
group_vars/prod/vars.yml:
---
domain: "yourdomain.com"
letsencrypt_email: "prod-email@example.com"
jwt_secret: "super-secure-production-secret" # Different per environment!
repo_url: "https://github.com/yourusername/path-to-codebase.git"
repo_branch: "main" # Use main branch for production
Why three separate files? Each environment needs:
-
Different domains:
dev.yourdomain.com,stg.yourdomain.com,yourdomain.com - Different secrets: If dev gets compromised, staging and prod are still safe
-
Different branches:
-
devbranch → development environment (experimental features) -
stagingbranch → staging environment (testing before production) -
mainbranch → production environment (stable, tested code)
-
File structure:
infra/ansible/
├── playbook.yml
├── inventory/
│ ├── dev.yml
│ ├── stg.yml
│ └── prod.yml
└── group_vars/
├── dev/
│ └── vars.yml ← Development variables
├── stg/
│ └── vars.yml ← Staging variables
└── prod/
└── vars.yml ← Production variables
Branch strategy explained:
-
Development (
devbranch): Where you experiment and develop new features -
Staging (
stagingbranch): Where you test features before they go to production -
Production (
mainbranch): The stable code that real users interact with
This way, you can test changes in dev/staging without affecting production!
Generating Inventory
Terraform automatically generates the Ansible inventory:
templates/inventory.tpl:
all:
hosts:
todo-app-server:
ansible_host: ${server_ip}
ansible_user: ${server_user}
ansible_ssh_private_key_file: ${ssh_key_path}
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
This gets generated as ansible/inventory/dev.yml (or stg.yml, prod.yml) with the actual server IP.
Running Ansible
# From the ansible directory
cd infra/ansible
# Run the playbook
ansible-playbook -i inventory/dev.yml playbook.yml
# With verbose output (for debugging)
ansible-playbook -i inventory/dev.yml playbook.yml -vvv
What happens:
- Ansible connects to your server via SSH
- Runs the dependencies role (installs Docker)
- Runs the deploy role (clones code, starts containers)
- Your application is live!
Part 6: CI/CD with GitHub Actions
Now we automate everything. Instead of running commands manually, GitHub Actions does it for us.
Understanding CI/CD
CI (Continuous Integration): Automatically test and build when code changes
CD (Continuous Deployment): Automatically deploy when tests pass
Why CI/CD?
- Consistency: Same process every time
- Speed: Deploy in minutes, not hours
- Safety: Automated tests catch bugs before production
- History: See what was deployed when
Setting Up GitHub Secrets
Before workflows can run, they need credentials:
- Go to your GitHub repo → Settings → Secrets and variables → Actions
- Add these secrets:
-
AWS_ACCESS_KEY_ID- From your IAM user -
AWS_SECRET_ACCESS_KEY- From your IAM user -
TERRAFORM_STATE_BUCKET- Your S3 bucket name -
TERRAFORM_STATE_LOCK_TABLE- Your DynamoDB table name -
TERRAFORM_KEY_PAIR_NAME- Your EC2 key pair name -
SSH_PRIVATE_KEY- Contents of your.pemfile -
EMAIL_TO- Where to send drift alerts -
EMAIL_FROM- Your verified SES email
-
Infrastructure Workflow
This workflow runs when infrastructure code changes:
name: Infrastructure Deployment
on:
push:
paths:
- 'infra/terraform/**'
- 'infra/ansible/**'
workflow_dispatch:
inputs:
environment:
description: 'Environment (dev, stg, prod)'
required: true
type: choice
options: [dev, stg, prod]
jobs:
terraform-plan:
name: Terraform Plan & Drift Detection
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init
working-directory: infra/terraform
run: terraform init -backend-config=...
- name: Terraform Plan
run: terraform plan -out=tfplan
- name: Check for Drift
id: drift-check
run: |
# Detect if this is drift (infrastructure changed outside Terraform)
# vs expected changes (Terraform code changed)
- name: Send Drift Email
if: steps.drift-check.outputs.change_type == 'drift'
run: ./infra/ci-cd/scripts/email-notification.sh "$(cat drift_summary.txt)"
- name: Manual Approval
if: steps.drift-check.outputs.change_type == 'drift'
uses: trstringer/manual-approval@v1
- name: Terraform Apply
if: steps.manual-approval.outcome == 'success'
run: terraform apply -auto-approve tfplan
Drift Detection in CI/CD (Quick Reference)
Note: For a detailed explanation of drift detection, email setup, and GitHub approval, see the "Understanding Drift Detection" section earlier in this guide.
Quick summary:
- Drift = Infrastructure changed outside Terraform
- Detected automatically in CI/CD
- Email sent + GitHub issue created
- Manual approval required before applying
Application Deployment Workflow
Separate workflow for application code changes:
name: Application Deployment
on:
push:
paths:
- 'frontend/**'
- 'auth-api/**'
- 'todos-api/**'
- 'docker-compose.yml'
workflow_dispatch:
inputs:
environment:
description: 'Environment'
type: choice
options: [dev, stg, prod]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Get Server IP
run: |
# Find server by tag
INSTANCE_ID=$(aws ec2 describe-instances ...)
SERVER_IP=$(aws ec2 describe-instances ...)
- name: Deploy with Ansible
run: |
ansible-playbook -i inventory/${ENV}.yml playbook.yml --tags deploy
Why separate workflows? Infrastructure changes are rare and need careful review. Application changes are frequent and should deploy quickly.
Infrastructure Destruction Workflow
⚠️ CRITICAL: This workflow DESTROYS EVERYTHING. Use with extreme caution!
The destroy workflow is separate from the deployment workflow for safety. It has multiple confirmation steps to prevent accidental destruction.
How the Destroy Workflow Works
Step 1: Manual Trigger Only
- Only runs when you manually trigger it (no automatic triggers)
- Requires you to select the environment
- Requires you to type "DESTROY" to confirm
Step 2: Validation
- Checks that you typed "DESTROY" correctly (case-sensitive)
- Prevents typos from accidentally destroying infrastructure
Step 3: State File Handling
- Tries to download state from artifacts (most recent)
- Falls back to S3 remote backend if artifacts missing
- Imports resources if state is completely missing
Step 4: Destroy Plan
- Shows you exactly what will be destroyed
- Review this carefully before proceeding
Step 5: Destruction
- Deletes all resources in the correct order
- Handles dependencies (e.g., detaches volumes before deleting)
Step 6: Verification
- Checks that everything was destroyed
- Cleans up orphaned resources
- Provides a summary
The Complete Destroy Workflow
Here's what the actual workflow looks like:
name: Infrastructure Destruction
on:
workflow_dispatch: # Manual trigger only - safe!
inputs:
environment:
description: 'Environment to destroy (dev, stg, prod)'
required: true
type: choice
options: [dev, stg, prod]
default: 'dev'
confirm_destroy:
description: 'Type "DESTROY" to confirm (case-sensitive)'
required: true
type: string
jobs:
validate-destroy:
name: Validate Destruction Request
runs-on: ubuntu-latest
steps:
- name: Validate confirmation
run: |
if [ "${{ github.event.inputs.confirm_destroy }}" != "DESTROY" ]; then
echo "❌ Invalid confirmation. You must type 'DESTROY' to proceed."
exit 1
fi
echo "✅ Destruction confirmed. Proceeding..."
terraform-destroy:
name: Destroy Infrastructure
needs: validate-destroy
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
- name: Terraform Init (with S3 backend)
working-directory: infra/terraform
run: |
terraform init \
-backend-config="bucket=${{ secrets.TERRAFORM_STATE_BUCKET }}" \
-backend-config="key=terraform-state/${{ github.event.inputs.environment }}/terraform.tfstate" \
-backend-config="region=${{ secrets.AWS_REGION }}" \
-backend-config="dynamodb_table=${{ secrets.TERRAFORM_STATE_LOCK_TABLE }}" \
-backend-config="encrypt=true"
- name: Terraform Plan Destroy
working-directory: infra/terraform
run: |
terraform plan -destroy \
-var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
-var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}" \
-out=destroy.tfplan
echo ""
echo "⚠️ DESTRUCTION PLAN SUMMARY:"
terraform show -no-color destroy.tfplan | head -100
- name: Terraform Destroy
working-directory: infra/terraform
run: |
echo "🔥 Starting infrastructure destruction..."
terraform destroy -auto-approve \
-var-file=terraform.${{ github.event.inputs.environment }}.tfvars \
-var="key_pair_name=${{ secrets.TERRAFORM_KEY_PAIR_NAME }}"
- name: Verify Destruction
working-directory: infra/terraform
run: |
echo "🔍 Verifying all resources are destroyed..."
# Check for orphaned resources and clean them up
How to Use the Destroy Workflow
Step 1: Go to GitHub Actions
- Open your repository on GitHub
- Click the "Actions" tab
- Find "Infrastructure Destruction" in the workflow list
Step 2: Run the Workflow
- Click "Run workflow" button (top right)
- Select the environment you want to destroy:
- ⚠️ Be very careful - make sure you select the right one!
- Dev is usually safe to destroy
- Staging should be destroyed carefully
- NEVER destroy production unless absolutely necessary!
- In the "Type DESTROY to confirm" field, type exactly:
DESTROY- Must be all caps
- Must be exactly "DESTROY" (no extra spaces)
- Click "Run workflow"
Step 3: Watch It Run
- The workflow will start with a validation job
- Checks that you typed "DESTROY" correctly
- If wrong, workflow fails immediately (safe!)
- Then the terraform-destroy job runs:
- Initializes Terraform with the correct backend
- Creates a destroy plan (shows what will be deleted)
- Review the plan carefully!
- Destroys all resources
- Verifies everything is gone
Step 4: Review the Results
- Check the workflow logs
- Verify in AWS Console that resources are gone
- Check that costs are now $0.00
Safety Features
The destroy workflow has multiple safety features:
- Manual trigger only - Can't be triggered automatically
- Confirmation required - Must type "DESTROY" exactly
- Environment selection - Prevents destroying wrong environment
- Plan before destroy - Shows you what will be deleted
- Validation job - Double-checks confirmation before proceeding
- State file handling - Works with remote state (S3)
- Verification - Checks that everything was destroyed
What Gets Destroyed
When you run the destroy workflow, it deletes:
- ✅ EC2 instance - Your server and everything on it
- ✅ Security groups - Firewall rules
- ✅ EBS volumes - Any attached storage (if using EBS for state)
- ✅ All containers - Docker containers running on the instance
- ✅ All data - Everything on the server is permanently lost
What stays:
- ✅ S3 bucket - Your Terraform state bucket (not deleted)
- ✅ DynamoDB table - State locking table (not deleted)
- ✅ GitHub repository - Your code (not deleted)
When to Use the Destroy Workflow
Good reasons to destroy:
- ✅ You're done with the project and want to stop costs
- ✅ You want to start completely fresh
- ✅ You're testing and need to clean up
- ✅ You're moving to a different AWS account
Bad reasons to destroy:
- ❌ Just to restart services (use Ansible instead)
- ❌ To fix a small issue (fix the issue, don't destroy)
- ❌ Because something isn't working (debug first)
- ❌ In production without a backup plan
After Destruction
What happens:
- All infrastructure is deleted
- State file in S3 is updated (shows no resources)
- You stop paying for AWS resources
- All data is permanently lost
To recreate:
- Run the Infrastructure Deployment workflow again
- It will create everything from scratch
- You'll need to redeploy your application
Important notes:
- State file history is preserved in S3
- You can see what was destroyed in the workflow logs
- GitHub Actions artifacts are kept for 90 days
- You can manually delete artifacts if needed
Destroy Workflow vs Manual Destroy
Use the workflow when:
- ✅ You want safety features (confirmation, validation)
- ✅ You want to destroy from anywhere (don't need local setup)
- ✅ You want an audit trail (GitHub Actions logs)
- ✅ You're working with a team (everyone can see what happened)
Use manual destroy when:
- ✅ You need to destroy specific resources only
- ✅ You're debugging and need more control
- ✅ You don't have GitHub Actions set up
Example: Destroying Development Environment
Let's walk through destroying a dev environment:
- Go to Actions → Infrastructure Destruction
- Click "Run workflow"
-
Select environment:
dev -
Type confirmation:
DESTROY - Click "Run workflow"
What you'll see:
✅ validate-destroy: Validation passed
✅ terraform-destroy:
- Terraform Init: Success
- Terraform Plan Destroy: Shows what will be deleted
- Terraform Destroy: Deleting resources...
- Verify Destruction: All resources destroyed
After completion:
- Check AWS Console → EC2 → No instances
- Check AWS Console → Security Groups → No groups
- Check AWS Billing → Costs should be $0.00
Troubleshooting Destroy Workflow
Issue: "Invalid confirmation"
- Problem: You didn't type "DESTROY" exactly
-
Solution: Type exactly
DESTROY(all caps, no spaces)
Issue: "State file not found"
- Problem: State file is missing or in wrong location
- Solution: Workflow will try to import resources automatically
Issue: "Resources still exist after destroy"
- Problem: Some resources might be stuck
- Solution: Check the verification step - it will try to clean up orphaned resources
Issue: "Can't destroy because of dependencies"
- Problem: Resources have dependencies (e.g., volume attached)
- Solution: Workflow handles this automatically (detaches volumes first)
Best Practices for Destruction
- Always destroy dev first - Test the workflow in dev before using in staging/prod
- Review the plan - Check what will be destroyed before confirming
- Backup important data - If you need any data, back it up first
- Destroy during off-hours - If others are using the environment
- Document why - Add a comment in the workflow run explaining why you destroyed
- Verify after - Check AWS Console to confirm everything is gone
- Clean up artifacts - Delete GitHub Actions artifacts if you want
Part 7: Single Command Deployment
The ultimate goal: one command that does everything.
How It Works
When you run:
terraform apply -var-file=terraform.dev.tfvars -auto-approve
Here's what happens behind the scenes:
-
Terraform provisions infrastructure
- Creates security group
- Launches EC2 instance
- Waits for instance to be ready
-
Terraform generates Ansible inventory
- Creates
ansible/inventory/dev.ymlwith server IP - Ready for Ansible to use
- Creates
-
Terraform triggers Ansible (via null_resource)
- Waits for SSH to be available
- Runs Ansible playbook
- Installs Docker
- Clones repository
- Starts containers
-
Traefik gets SSL certificate
- Contacts Let's Encrypt
- Verifies domain ownership
- Gets certificate
- Enables HTTPS
-
Application is live!
- Frontend accessible at
https://yourdomain.com - APIs at
https://yourdomain.com/api/*
- Frontend accessible at
The Magic: null_resource
resource "null_resource" "ansible_provision" {
triggers = {
instance_id = aws_instance.todo_app.id
}
provisioner "local-exec" {
command = <<-EOT
# Wait for SSH
until ssh ... 'echo "ready"'; do sleep 10; done
# Run Ansible
cd ../ansible
ansible-playbook -i inventory/${var.environment}.yml playbook.yml
EOT
}
}
What is null_resource? It's a Terraform resource that doesn't create anything in AWS. It just runs a command. Perfect for triggering Ansible!
Why the wait? EC2 instances take 30-60 seconds to boot. We wait for SSH to be ready before running Ansible.
Testing the Single Command
# Make sure you're in the terraform directory
cd infra/terraform
# Initialize (one-time setup)
terraform init -backend-config=...
# The magic command
terraform apply -var-file=terraform.dev.tfvars -auto-approve
# Watch it work!
# You'll see:
# 1. Security group created
# 2. EC2 instance launching
# 3. Waiting for SSH...
# 4. Running Ansible...
# 5. Application deployed!
Pro tip: The first run takes 5-10 minutes. Subsequent runs are faster (only changes what's needed).
Part 8: Multi-Environment Setup
Real applications need multiple environments. Here's how to set it up properly.
Why Multiple Environments?
- Dev: Where you experiment (break things safely)
- Staging: Mirror of production (test before going live)
- Production: The real thing (users depend on it)
Environment Isolation
Each environment is completely separate:
- Different EC2 instances
- Different security groups
- Different state files in S3
- Different domains
- Different secrets
Why isolation matters: If dev gets hacked, staging and prod are still safe. If you break dev, staging and prod keep running. This is why we have three separate environments!
Setting Up Per-Environment Configuration
Terraform:
-
terraform.dev.tfvars- Dev configuration -
terraform.stg.tfvars- Staging configuration -
terraform.prod.tfvars- Production configuration
Ansible:
-
group_vars/dev/vars.yml- Dev variables -
group_vars/stg/vars.yml- Staging variables -
group_vars/prod/vars.yml- Production variables
State files:
terraform-state/dev/terraform.tfstateterraform-state/stg/terraform.tfstateterraform-state/prod/terraform.tfstate
Deploying to Different Environments
Via GitHub Actions:
- Go to Actions → Infrastructure Deployment
- Click "Run workflow"
- Select environment (dev/stg/prod)
- Click "Run workflow"
Via command line:
# Dev
terraform apply -var-file=terraform.dev.tfvars -auto-approve
# Staging
terraform apply -var-file=terraform.stg.tfvars -auto-approve
# Production
terraform apply -var-file=terraform.prod.tfvars -auto-approve
Important: Always test in dev first! Never deploy to prod without testing.
Part 9: Common Issues and Solutions
Every project has issues. Here are the ones you'll likely encounter and how to fix them.
Issue 1: Let's Encrypt Certificate Errors
Symptom: Timeout during connect (likely firewall problem)
Causes:
- DNS not pointing to your server
- Security group blocking ports 80/443
- Firewall on the server blocking ports
Solutions:
# 1. Verify DNS
dig yourdomain.com
# Should show your server IP
# 2. Check security group
# AWS Console → EC2 → Security Groups
# Verify ports 80 and 443 allow 0.0.0.0/0
# 3. Check server firewall
sudo ufw status
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
# 4. Switch to HTTP challenge (more reliable)
# In docker-compose.yml:
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
Issue 2: Terraform State Lock
Symptom: Error acquiring the state lock
Cause: Another Terraform run is in progress, or a previous run crashed
Solution:
# Check what's locking
aws dynamodb scan --table-name terraform-state-lock
# If you're sure no one else is running Terraform:
terraform force-unlock <LOCK_ID>
# Be careful! Only do this if you're certain.
Issue 3: Ansible Connection Failed
Symptom: SSH connection failed or Permission denied
Causes:
- Security group doesn't allow SSH from your IP
- Wrong key pair
- Server not ready yet
Solutions:
# 1. Test SSH manually
ssh -i ~/.ssh/your-key.pem ubuntu@<server-ip>
# 2. Check security group
# Make sure it allows port 22 from your IP
# 3. Verify key pair name matches
# AWS Console → EC2 → Key Pairs
# Should match what's in terraform.tfvars
# 4. Wait longer (server might still be booting)
# EC2 instances take 1-2 minutes to be ready
Issue 4: Drift Detection Not Working
Symptom: Changes made manually but drift not detected
Check:
- Did Terraform files change? (That's "expected", not drift)
- Is state in S3? (Local state won't work properly)
- Check drift detection logic in workflow
Test drift:
- Manually add a tag to your EC2 instance in AWS Console
- Run the infrastructure workflow
- Should detect drift and send email
Issue 5: Containers Keep Restarting
Symptom: docker ps shows containers restarting constantly
Debug:
# Check logs
docker logs <container-name>
# Check all containers
docker compose logs
# Common causes:
# - Configuration error in .env
# - Port conflict
# - Missing environment variables
# - Application crash on startup
Part 10: Best Practices and Security
Now that everything works, let's make it production-ready.
Security Best Practices
-
Never commit secrets
- Use GitHub Secrets
- Use environment variables
- Add
.envto.gitignore
-
Restrict SSH access
- In production, set
ssh_cidrto your IP only - Use
YOUR_IP/32format (e.g.,1.2.3.4/32)
- In production, set
-
Use different secrets per environment
- Dev JWT secret ≠ Staging JWT secret ≠ Prod JWT secret
- If dev is compromised, staging and prod are still safe
-
Enable MFA
- On AWS account
- On GitHub account
- Extra layer of protection
-
Regular updates
- Keep Docker images updated
- Keep system packages updated
- Security patches are important!
Infrastructure Best Practices
-
Always use remote state
- S3 + DynamoDB
- Never commit state files
- Enable versioning on S3 bucket
-
Separate state per environment
- Different S3 keys
- Complete isolation
- Can't accidentally affect prod from dev
-
Use version constraints
- In Terraform:
version = "~> 5.0" - Prevents unexpected breaking changes
- In Terraform:
-
Tag everything
- Makes it easy to find resources
- Helps with cost tracking
- Required for organization
Deployment Best Practices
-
Test in dev first
- Always deploy to dev → staging → prod (in that order!)
- Catch issues early before they reach production
- Dev is for breaking things
-
Review drift alerts
- Don't ignore them!
- Investigate unexpected changes
- Could be security issue
-
Use idempotent deployments
- Safe to run multiple times
- Ansible should be idempotent
- Terraform is idempotent by design
-
Monitor your infrastructure
- Set up CloudWatch alarms
- Monitor costs
- Watch for unusual activity
Cost Optimization
-
Right-size instances
- Dev: t3.small (saves money)
- Prod: t3.medium (enough power)
- Don't over-provision
-
Stop dev when not in use
- Dev doesn't need to run 24/7
- Stop instances when not testing
- Saves ~70% of costs
-
Clean up unused resources
- Delete old instances
- Remove unused security groups
- Regular cleanup prevents waste
Part 11: Going Further
You've built a solid foundation. Here's where to go next.
Monitoring and Observability
Add CloudWatch:
- Monitor CPU, memory, disk
- Set up alarms
- Track costs
Add Application Monitoring:
- Prometheus + Grafana
- ELK stack for logs
- APM tools (New Relic, Datadog)
Scaling
Horizontal Scaling:
- Add load balancer
- Multiple instances
- Auto-scaling groups
Vertical Scaling:
- Larger instance types
- More CPU/RAM
- Better performance
Backup and Disaster Recovery
Backup Strategy:
- Database backups
- State file backups (S3 versioning)
- Configuration backups
Disaster Recovery:
- Multi-region deployment
- Automated failover
- Recovery procedures
Advanced Topics
- Kubernetes: Container orchestration at scale
- Terraform Modules: Reusable infrastructure code
- Ansible Roles: Shareable configuration
- GitOps: Git as source of truth
- Infrastructure Testing: Test your infrastructure code
Conclusion: What You've Accomplished
Let's take a moment to appreciate what you've built:
✅ A microservices application running in containers
✅ Automated infrastructure with Terraform
✅ Automated deployment with Ansible
✅ CI/CD pipelines that detect problems
✅ Multi-environment setup (dev/staging/prod)
✅ Secure HTTPS with automatic certificates
✅ Single-command deployment that just works
✅ Production-ready practices and security
This isn't just a tutorial project - this is real infrastructure that follows industry best practices. You can use this as a foundation for actual production applications.
Key Takeaways
- Infrastructure as Code saves time and prevents mistakes
- Automation is your friend - manual processes break
- Security isn't optional - build it in from the start
- Testing in dev/staging prevents production disasters
- Documentation (this blog post!) helps you and others
Next Steps
- Deploy your own project using this as a template
- Experiment - break things in dev, learn from it
- Share - help others learn what you've learned
- Iterate - improve based on real-world experience
Resources
- Terraform Documentation
- Ansible Documentation
- Traefik Documentation
- GitHub Actions Documentation
- AWS Well-Architected Framework
Thank you for reading! If this helped you, please share it with others who might benefit. And if you have questions or run into issues, don't hesitate to reach out.
Happy deploying! 🚀
This guide was written as part of the HNG Internship Stage 6 DevOps task. The complete implementation is available on GitHub.






Top comments (0)