In the modern DevOps landscape, manual infrastructure management and application deployment are rapidly becoming obsolete. This comprehensive guide walks you through building a complete, production-ready CI/CD pipeline for a microservices application, covering infrastructure provisioning, automated deployments, drift detection, and continuous delivery—all using industry-standard DevOps tools and best practices.
[TABLE OF CONTENTS]
- Project Goals and Overview
- System Architecture and Design
- Infrastructure as Code with Terraform
- CI/CD Pipeline Implementation with GitHub Actions
- Configuration Management with Ansible
- Container Orchestration with Docker Compose
- Security Implementation and Best Practices
- Observability and Distributed Tracing
- Lessons Learned and Key Takeaways
- Challenges Encountered and Solutions
- Future Improvements and Roadmap
[PROJECT GOALS AND OVERVIEW]
The primary objective of this project was to create a fully automated deployment pipeline for a multi-service TODO application with complete infrastructure automation. The solution needed to address several key requirements:
Core Requirements:
- Complete automation from code commit to production deployment
- Infrastructure provisioning using declarative configuration
- Automated configuration management for consistent server setup
- Zero-downtime deployments with SSL/TLS termination
- Drift detection to maintain infrastructure consistency
- Distributed tracing for debugging microservices interactions
- Security-first approach with encrypted secrets and minimal attack surface
Technology Stack Selected:
- Infrastructure as Code: Terraform for AWS resource provisioning
- Configuration Management: Ansible for server configuration and application deployment
- CI/CD Orchestration: GitHub Actions for workflow automation
- Containerization: Docker and Docker Compose for service isolation
- Reverse Proxy: Traefik for routing, load balancing, and automatic SSL
- Observability: Zipkin for distributed request tracing
- Message Queue: Redis for asynchronous log processing
The end goal was a system where infrastructure changes and application updates could be deployed with a single git push, with built-in safety mechanisms including drift detection, email notifications, and manual approval gates for production environments.
[SYSTEM ARCHITECTURE AND DESIGN]
The application architecture follows a microservices pattern with seven distinct services, each serving a specific purpose. This polyglot architecture demonstrates real-world complexity where different services are written in different programming languages based on their specific requirements.
Architecture Diagram
Figure 1: Complete microservices architecture showing all 7 services, their technologies, and data flow between components
Service Responsibilities
1. Frontend Service (Vue.js)
- Single-page application providing the complete user interface
- Communicates with backend APIs via RESTful endpoints
- Implements distributed tracing via Zipkin client
- Served as static assets with client-side routing
2. Auth API (Go)
- Handles user authentication and authorization
- Generates and validates JWT tokens for session management
- Communicates with Users API to validate credentials
- Written in Go for performance and concurrency
- Port: 8081 (internal)
3. Todos API (Node.js)
- Provides full CRUD operations for user TODO items
- Publishes create/delete events to Redis message queue
- Validates JWT tokens for authenticated requests
- Asynchronous, event-driven architecture
- Port: 8082 (internal)
4. Users API (Spring Boot / Java)
- Manages user profiles and account information
- Provides user lookup for authentication service
- Simplified implementation (read-only operations)
- Leverages Spring Boot ecosystem
- Port: 8083 (internal)
5. Log Message Processor (Python)
- Consumes messages from Redis queue
- Processes TODO creation and deletion events
- Logs events to stdout for monitoring/aggregation
- Demonstrates asynchronous processing pattern
- Queue-based, no exposed ports
6. Redis
- In-memory data store used as message queue
- Pub/sub pattern for event broadcasting
- Minimal configuration, Alpine-based image
- Port: 6379 (internal only)
7. Zipkin
- Distributed tracing system for microservices
- Collects timing data from all services
- Provides visualization of request flows
- Helps identify performance bottlenecks
- Port: 9411 (exposed via Traefik)
8. Traefik
- Modern reverse proxy and load balancer
- Automatic service discovery via Docker labels
- Let's Encrypt integration for automatic SSL certificates
- Path-based and host-based routing
- HTTP to HTTPS automatic redirection
- Ports: 80 (HTTP), 443 (HTTPS), 8080 (Dashboard)
Network Architecture
Figure 2: Docker networking showing isolated app-network with Traefik as the only external gateway
All services communicate via a dedicated Docker bridge network named app-network. This provides:
- Network isolation from the host system
- Service-to-service communication using container names (DNS resolution)
- No exposed ports except through Traefik
- Encrypted traffic between external clients and Traefik
[INFRASTRUCTURE AS CODE WITH TERRAFORM]
Terraform was chosen for infrastructure provisioning because it provides declarative configuration, state management, and a mature AWS provider with extensive resource coverage.
AWS Resources Provisioned
The Terraform configuration provisions the following AWS resources:
1. EC2 Instance
- Instance Type: Configurable via variable (default: t2.medium recommended)
- AMI: Latest Ubuntu 22.04 LTS (Jammy Jellyfish)
- Automatically tagged for easy identification and billing
- Uses data source to always fetch the latest Ubuntu AMI
# Data source ensures we always use the latest Ubuntu AMI
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical's AWS account
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
# EC2 Instance resource definition
resource "aws_instance" "todo_app" {
ami = data.aws_ami.ubuntu.id
instance_type = var.instance_type
key_name = aws_key_pair.deployer.key_name
vpc_security_group_ids = [aws_security_group.todo_app.id]
tags = {
Name = "todo-app-server-v2"
Environment = "production"
Project = "hngi13-stage6"
}
}
`
2. Security Group
- Ingress: SSH (22), HTTP (80), HTTPS (443)
- Egress: All traffic allowed (for package downloads, API calls, etc.)
- SSH access restricted to specific CIDR block for security
`hcl
resource "aws_security_group" "todo_app" {
name = "todo-app-sg"
description = "Security group for TODO application"
# HTTP access for initial Let's Encrypt challenges
ingress {
description = "HTTP"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
# HTTPS for production traffic
ingress {
description = "HTTPS"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
# SSH restricted to specific CIDR for security
ingress {
description = "SSH"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.ssh_cidr] # Only allow from specific IP range
}
# Allow all outbound traffic
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "todo-app-sg"
}
}
`
3. SSH Key Pair
- Public key uploaded to AWS for instance access
- Private key stored securely in GitHub Secrets
- Used by both Terraform and Ansible for authentication
hcl
resource "aws_key_pair" "deployer" {
key_name = var.key_name
public_key = file(var.public_key_path)
}
4. Remote State Configuration (S3 + DynamoDB)
- S3 bucket stores Terraform state file with encryption
- DynamoDB table provides state locking to prevent concurrent modifications
- Configured in separate backend.tf file
Dynamic Ansible Inventory Generation
One of the most elegant aspects of this setup is the automatic generation of Ansible inventory files. Since the EC2 instance's public IP address is only known after Terraform creates it, we need a mechanism to pass this information to Ansible.
`hcl
Template file for inventory generation
resource "local_file" "ansible_inventory" {
content = templatefile("${path.module}/inventory.tftpl", {
host = aws_instance.todo_app.public_ip
user = var.server_user
key = var.private_key_path
})
filename = "${path.module}/../ansible/inventory/hosts.yml"
}
`
The inventory.tftpl template file looks like this:
yaml
[web]
${host} ansible_user=${user} ansible_ssh_private_key_file=${key}
After Terraform applies, this becomes a fully functional Ansible inventory file with the actual IP address populated.
Integrated Terraform-Ansible Provisioning
To create a truly seamless deployment experience, Terraform automatically triggers Ansible configuration after infrastructure creation. This is achieved using a null_resource with a local-exec provisioner:
`hcl
resource "null_resource" "ansible_provision" {
# Trigger re-provisioning when instance or inventory changes
triggers = {
instance_id = aws_instance.todo_app.id
inventory = local_file.ansible_inventory.content
}
provisioner "local-exec" {
command = <<-EOT
echo "Waiting for SSH to be available..."
# Wait up to 5 minutes for SSH to become available
for i in {1..30}; do
nc -z -w 5 ${aws_instance.todo_app.public_ip} 22 && break
echo "Waiting for port 22... (attempt $i/30)"
sleep 10
done
echo "Running Ansible playbook..."
# Disable host key checking for automated deployments
export ANSIBLE_HOST_KEY_CHECKING=False
export ANSIBLE_CONFIG=${path.module}/../ansible/ansible.cfg
ansible-playbook \
-i ${path.module}/../ansible/inventory/hosts.yml \
${path.module}/../ansible/playbook.yml \
--extra-vars "domain_name=${var.domain_name} email=${var.email}"
EOT
}
depends_on = [
aws_instance.todo_app,
local_file.ansible_inventory
]
}
`
This approach provides several benefits:
- Infrastructure and configuration are provisioned in a single Terraform apply
- No manual intervention required between infrastructure and configuration steps
- SSH availability check prevents Ansible from failing on a booting instance
- Extra variables (domain, email) are passed from Terraform to Ansible seamlessly
[CI/CD PIPELINE IMPLEMENTATION WITH GITHUB ACTIONS]
Figure 3: Complete CI/CD pipeline showing infrastructure and application deployment workflows with drift detection and manual approval gates
GitHub Actions provides the orchestration layer for our CI/CD pipeline. Two separate workflows handle infrastructure changes and application deployments respectively.
Infrastructure Pipeline (infra.yml)
This workflow implements a sophisticated drift detection and approval mechanism:
`yaml
name: Infrastructure Pipeline
on:
push:
paths:
- "infra/terraform/"
- "infra/ansible/"
jobs:
terraform-plan:
runs-on: ubuntu-latest
outputs:
drift_detected: ${{ steps.plan.outputs.exitcode == 2 }}
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: "us-east-1"
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_wrapper: false # Allows capturing raw output
- name: Terraform Init
run: terraform init
working-directory: infra/terraform
- name: Create SSH Keys for Plan
run: |
echo "${{ secrets.SSH_PUBLIC_KEY }}" > infra/terraform/deployer_key.pub
echo "${{ secrets.SSH_PRIVATE_KEY }}" > infra/terraform/deployer_key
chmod 600 infra/terraform/deployer_key
- name: Terraform Plan
id: plan
run: |
exit_code=0
terraform plan -detailed-exitcode -out=tfplan || exit_code=$?
echo "exitcode=$exit_code" >> $GITHUB_OUTPUT
if [ $exit_code -eq 2 ]; then
echo "Infrastructure drift detected!"
elif [ $exit_code -eq 1 ]; then
echo "Terraform plan failed with errors"
exit 1
else
echo "No infrastructure changes detected"
fi
working-directory: infra/terraform
env:
TF_VAR_public_key_path: "${{ github.workspace }}/infra/terraform/deployer_key.pub"
TF_VAR_private_key_path: "${{ github.workspace }}/infra/terraform/deployer_key"
TF_VAR_domain_name: ${{ secrets.DOMAIN_NAME }}
TF_VAR_email: ${{ secrets.ACME_EMAIL }}
- name: Upload Terraform Plan
uses: actions/upload-artifact@v4
with:
name: tfplan
path: infra/terraform/tfplan
- name: Send Email on Drift
if: steps.plan.outputs.exitcode == 2
uses: dawidd6/action-send-mail@v3
with:
server_address: smtp.gmail.com
server_port: 465
username: ${{ secrets.MAIL_USERNAME }}
password: ${{ secrets.MAIL_PASSWORD }}
subject: "Infrastructure Drift Detected: Manual Review Required"
html_body: |
<h3>Terraform Drift Detected</h3>
<p>Infrastructure changes have been detected for <b>${{ github.repository }}</b>.</p>
<p>Please review the Terraform plan and approve the deployment to apply changes.</p>
<p>
<a href="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"
style="background-color: #2ea44f; color: white; padding: 10px 20px;
text-decoration: none; border-radius: 5px;">
View Plan & Approve Deployment
</a>
</p>
to: ${{ secrets.MAIL_TO }}
from: GitHub Actions CI/CD
terraform-apply:
needs: terraform-plan
if: needs.terraform-plan.outputs.drift_detected == 'true'
runs-on: ubuntu-latest
environment: production # Requires manual approval
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: "us-east-1"
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Create SSH Keys
run: |
echo "${{ secrets.SSH_PUBLIC_KEY }}" > infra/terraform/deployer_key.pub
echo "${{ secrets.SSH_PRIVATE_KEY }}" > infra/terraform/deployer_key
chmod 600 infra/terraform/deployer_key
- name: Verify SSH Key Format
run: |
chmod 600 infra/terraform/deployer_key
ssh-keygen -l -f infra/terraform/deployer_key || \
echo "::error::SSH Private Key is invalid! Check your GitHub Secret."
- name: Terraform Init
run: terraform init
working-directory: infra/terraform
- name: Download Terraform Plan
uses: actions/download-artifact@v4
with:
name: tfplan
path: infra/terraform
- name: Terraform Apply
run: terraform apply -auto-approve tfplan
working-directory: infra/terraform
env:
TF_VAR_public_key_path: "${{ github.workspace }}/infra/terraform/deployer_key.pub"
TF_VAR_private_key_path: "${{ github.workspace}}/infra/terraform/deployer_key"
TF_VAR_domain_name: ${{ secrets.DOMAIN_NAME }}
TF_VAR_email: ${{ secrets.ACME_EMAIL }}
`
Pipeline Features Explained:
1. Trigger Conditions:
- The workflow is triggered on
pushevents to theinfra/terraform/**orinfra/ansible/**paths. This ensures that any changes to infrastructure code or Ansible playbooks automatically initiate a plan.
2. Terraform Plan Job (terraform-plan):
-
runs-on: ubuntu-latest: Executes on a fresh Ubuntu runner. -
AWS Credentials:
AWS_ACCESS_KEY_IDandAWS_SECRET_ACCESS_KEYare securely injected from GitHub Secrets, ensuring the workflow has permissions to interact with AWS. -
actions/checkout@v3: Checks out the repository content. -
hashicorp/setup-terraform@v2: Installs the specified Terraform version.terraform_wrapper: falseis crucial here to allow capturing the raw exit code fromterraform plan. -
Terraform Init: Initializes the Terraform working directory, downloading providers and setting up the S3 backend for state management. -
Create SSH Keys for Plan: Dynamically createsdeployer_key.pubanddeployer_keyfiles from GitHub Secrets. These are needed for Terraform to pass the public key to AWS and for thenull_resourceto use the private key for Ansible. Properchmod 600is applied for security. -
Terraform Plan(id: plan):- Executes
terraform plan -detailed-exitcode -out=tfplan. The-detailed-exitcodeoption is key for drift detection: -
0: No changes, infrastructure matches state. -
1: Error occurred. -
2: Changes detected, infrastructure differs from state. - The
exit_codeis captured and set as an output (drift_detected) for subsequent jobs. - Terraform variables (
TF_VAR_public_key_path,TF_VAR_private_key_path,TF_VAR_domain_name,TF_VAR_email) are passed from GitHub Actions environment variables, which in turn are populated from GitHub Secrets.
- Executes
-
Upload Terraform Plan: The generatedtfplanfile (which contains the proposed changes) is uploaded as an artifact. This allows theterraform-applyjob to use the exact same plan, preventing "plan drift" between the plan and apply stages. -
Send Email on Drift:- This step runs
if: steps.plan.outputs.exitcode == 2, meaning it only executes if drift is detected. - It uses
dawidd6/action-send-mail@v3to send an email notification to a configured address (MAIL_TOfrom secrets). - The email includes a direct link to the GitHub Actions run, prompting a manual review.
- This step runs
3. Terraform Apply Job (terraform-apply):
-
needs: terraform-plan: This job depends on theterraform-planjob completing successfully. -
if: needs.terraform-plan.outputs.drift_detected == 'true': This job only runs if drift was detected in the planning phase. If there's no drift, there's nothing to apply. -
environment: production: This is a critical security feature. GitHub Environments allow you to configure protection rules, such as requiring manual approval before a workflow can proceed to this step. This acts as a "human in the loop" for production infrastructure changes. - AWS Credentials: Same as the plan job.
-
actions/checkout@v3: Checks out the repository. -
Setup Terraform: Installs Terraform. -
Create SSH Keys: Recreates the SSH key files, as runners are ephemeral. -
Verify SSH Key Format: A defensive step to ensure the private key from secrets is valid before attempting to use it. This helps catch misconfigurations early. -
Terraform Init: Initializes Terraform. -
Download Terraform Plan: Downloads thetfplanartifact generated by theterraform-planjob. This ensures that theapplyoperation is based on the exact plan that was reviewed. -
Terraform Apply:- Executes
terraform apply -auto-approve tfplan. The-auto-approveflag is used because the manual approval for theproductionenvironment already serves as the explicit approval. - The
tfplanfile is passed directly, guaranteeing that only the planned changes are applied. - Terraform variables are passed as in the plan step.
- Executes
This infrastructure pipeline provides a robust, secure, and auditable process for managing infrastructure changes, incorporating drift detection, email notifications, and manual approval for critical environments.
Deployment Pipeline (deploy.yml)
The application deployment pipeline handles code changes and responds to infrastructure updates:
`yaml
name: Application Deployment
on:
workflow_run:
workflows: ["Infrastructure Pipeline"]
types: [completed]
push:
paths:
- "frontend/"
- "auth-api/"
- "todos-api/"
- "users-api/"
- "log-message-processor/**"
- "docker-compose.yml"
jobs:
deploy:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'push' }}
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: "us-east-1"
steps:
- uses: actions/checkout@v3
- name: Get Server IP Dynamically
id: get-ip
run: |
IP=$(aws ec2 describe-instances \
--filters "Name=tag:Name,Values=todo-app-server-v2" \
"Name=instance-state-name,Values=running" \
--query "Reservations[*].Instances[*].PublicIpAddress" \
--output text)
echo "SERVER_IP=$IP" >> $GITHUB_ENV
echo "Deploying to instance at IP: $IP"
- name: Deploy via Ansible
uses: dawidd6/action-ansible-playbook@v2
with:
playbook: infra/ansible/playbook.yml
directory: ./
key: ${{ secrets.SSH_PRIVATE_KEY }}
inventory: |
[web]
${{ env.SERVER_IP }} ansible_user=ubuntu
options: |
--extra-vars "domain_name=${{ secrets.DOMAIN_NAME }} email=${{ secrets.ACME_EMAIL }}"
env:
ANSIBLE_CONFIG: infra/ansible/ansible.cfg
`
Key Aspects:
Trigger Conditions:
-
workflow_run: This pipeline is triggered when theInfrastructure Pipelinecompletes. This ensures that if infrastructure changes (e.g., a new EC2 instance is provisioned), the application deployment automatically follows. Thetypes: [completed]ensures it runs after the workflow finishes. -
push: It also triggers on direct pushes to specific application code directories (frontend/**,auth-api/**, etc.) or thedocker-compose.ymlfile. This allows for rapid iteration on application code without requiring an infrastructure change. -
ifcondition:if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'push' }}ensures the deployment only proceeds if the infrastructure pipeline was successful (if triggered byworkflow_run) or if it's a direct code push.
Dynamic IP Resolution:
-
Get Server IP Dynamically: This step uses the AWS CLI to find the public IP address of the EC2 instance.- It filters instances by the
Nametag (todo-app-server-v2) and ensures the instance isrunning. - The
jq-like query (--query "Reservations[*].Instances[*].PublicIpAddress") extracts the IP address. - The IP is then stored in a GitHub Actions environment variable (
SERVER_IP) for use in subsequent steps. This is crucial because the EC2 instance's IP might change if it's stopped and started, or if a new instance replaces an old one.
- It filters instances by the
Inline Inventory:
-
Deploy via Ansible: This step uses thedawidd6/action-ansible-playbook@v2action to run the Ansible playbook.-
playbook: infra/ansible/playbook.yml: Specifies the main Ansible playbook. -
key: ${{ secrets.SSH_PRIVATE_KEY }}: The SSH private key is securely passed from GitHub Secrets, allowing Ansible to connect to the EC2 instance. -
inventory: | ...: Instead of a static inventory file, an inline inventory is generated using the dynamically fetchedSERVER_IP. This makes the deployment highly flexible and resilient to IP changes. Theansible_user=ubuntuspecifies the default user for SSH connection. -
options: | --extra-vars ...: Additional variables likedomain_nameandemailare passed to Ansible from GitHub Secrets, ensuring consistency across the pipeline. -
ANSIBLE_CONFIG: infra/ansible/ansible.cfg: Points Ansible to a custom configuration file if needed.
-
This application deployment pipeline is designed for efficiency and reliability, automatically reacting to both infrastructure and code changes, and dynamically adapting to the current state of the infrastructure.
[CONFIGURATION MANAGEMENT WITH ANSIBLE]
Ansible handles all server configuration and application deployment tasks. The playbook is structured using roles for modularity and reusability.
Playbook Structure
The main Ansible playbook (infra/ansible/playbook.yml) orchestrates the execution of different roles:
`yaml
- hosts: web
become: yes # Run tasks with sudo privileges
vars:
project_root: /opt/todo-app # Base directory for the application
repo_url: https://github.com/PrimoCrypt/DevOps-Stage-6.git # Application repository URL
# jwt_secret is passed via --extra-vars from GitHub Actions
# domain_name and email are also passed via --extra-vars
roles:
- dependencies # Installs Docker, Docker Compose, Git, configures firewall
- deploy # Clones repo, creates .env, runs docker-compose
`
This structure clearly separates concerns: dependencies role sets up the server environment, and deploy role handles the application-specific deployment.
Dependencies Role
The dependencies role (infra/ansible/roles/dependencies/tasks/main.yml) ensures all prerequisite software is installed and configured on the EC2 instance. This makes the instance ready to host Dockerized applications.
Tasks performed:
-
System package updates: Ensures the system is up-to-date (
apt update && apt upgrade). - Docker Engine installation: Installs the latest stable version of Docker CE (Community Edition) by adding Docker's official GPG key and repository.
-
Docker Compose installation: Downloads and installs the latest Docker Compose binary (v2.x) to
/usr/local/bin. - Git installation: Installs Git for cloning the application repository.
- UFW firewall configuration: Configures the Uncomplicated Firewall (UFW) to allow SSH (port 22), HTTP (port 80), and HTTPS (port 443) traffic, then enables the firewall.
- Docker service enablement and startup: Ensures the Docker daemon starts automatically on boot and is currently running.
-
User permissions for Docker socket: Adds the
ubuntuuser to thedockergroup, allowing them to run Docker commands withoutsudo. This requires a reboot or re-login to take effect, which is handled implicitly by subsequent SSH connections.
Example task (abbreviated):
`yaml
-
name: Install Docker dependencies
ansible.builtin.apt:
name:- apt-transport-https
- ca-certificates
- curl
- gnupg
- lsb-release state: present update_cache: yes
name: Add Docker GPG key
ansible.builtin.apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: presentname: Add Docker APT repository
ansible.builtin.apt_repository:
repo: "deb [arch={{ 'amd64' if ansible_architecture == 'x86_64' else ansible_architecture }}] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"
state: present-
name: Install Docker Engine
ansible.builtin.apt:
name:- docker-ce
- docker-ce-cli
- containerd.io
- docker-buildx-plugin # For buildx support
- docker-compose-plugin # For docker compose v2 state: present
-
name: Install Docker Compose (legacy v1 if needed, but v2 is preferred via plugin)
This task is often not needed if docker-compose-plugin is installed
For older systems or specific needs, you might still download the binary
get_url:
url: https://github.com/docker/compose/releases/download/v2.20.0/docker-compose-linux-x86_64
dest: /usr/local/bin/docker-compose
mode: "0755"
when: false # Disable this if using docker-compose-plugin name: Ensure Docker service is running and enabled
ansible.builtin.systemd:
name: docker
state: started
enabled: yesname: Add 'ubuntu' user to the 'docker' group
ansible.builtin.user:
name: ubuntu
groups: docker
append: yes-
name: Configure UFW to allow SSH, HTTP, HTTPS
community.general.ufw:
rule: allow
port: "{{ item }}"
proto: tcp
loop:- "22"
- "80"
- "443"
name: Enable UFW
community.general.ufw:
state: enabled
`
Deploy Role
The deploy role (infra/ansible/roles/deploy/tasks/main.yml) handles the actual application deployment and lifecycle management.
Workflow:
-
Clone or update the application repository: Uses
gitmodule to clone the repository if it doesn't exist, or pull the latest changes if it does. This ensures the server always has the most recent application code. -
Generate
.envfile: Creates an.envfile in the application's root directory using Ansible'stemplatemodule. This file contains environment variables required by the Docker Compose services, such asDOMAIN_NAME,ACME_EMAIL, andJWT_SECRET. -
Stop existing containers (if running): The
docker_composemodule handles this implicitly whenstate: presentandpull: yesare used, as it will recreate containers if images have changed. -
Pull latest Docker images:
pull: yesindocker_composeensures that the latest images for all services are downloaded from Docker Hub or a private registry. -
Start containers with
docker-compose: Thecommunity.docker.docker_composemodule orchestrates the startup of all services defined indocker-compose.yml. -
Verify service health: While not explicitly shown in the playbook snippet, a production setup would include tasks to wait for services to become healthy (e.g., using
wait_formodule or health checks).
Environment file generation:
`yaml
name: Ensure project root directory exists
ansible.builtin.file:
path: "{{ project_root }}"
state: directory
mode: "0755"name: Clone or update application repository
ansible.builtin.git:
repo: "{{ repo_url }}"
dest: "{{ project_root }}"
version: master # Or a specific branch/tag
update: yes
force: yes # Force update in case of local changesname: Create .env file from template
ansible.builtin.template:
src: env.j2 # Template file located in infra/ansible/roles/deploy/templates/
dest: "{{ project_root }}/.env"
mode: "0600" # Secure permissions for sensitive environment variablesname: Start application services with Docker Compose
community.docker.docker_compose:
project_src: "{{ project_root }}"
state: present # Ensures services are running
pull: yes # Pulls latest images before starting
build: yes # Builds images if necessary (e.g., local Dockerfiles)
`
The env.j2 template (infra/ansible/roles/deploy/templates/env.j2) injects runtime configuration:
DOMAIN_NAME={{ domain_name }}
ACME_EMAIL={{ email }}
JWT_SECRET={{ jwt_secret }}
The jwt_secret variable would typically be passed as an --extra-var from GitHub Actions, similar to domain_name and email, ensuring it's never hardcoded in the repository.
[CONTAINER ORCHESTRATION WITH DOCKER COMPOSE]
Docker Compose orchestrates all seven services with a single configuration file (docker-compose.yml). This file defines the services, their dependencies, network configurations, and how they interact with Traefik.
Complete Docker Compose Configuration
`yaml
version: "3.8"
services:
traefik:
image: traefik:v3.6
command:
- "--api.insecure=true" # Enable Traefik dashboard (for debugging, disable in prod)
- "--providers.docker=true" # Enable Docker provider
- "--providers.docker.exposedbydefault=false" # Only expose services with traefik.enable=true
- "--entrypoints.web.address=:80" # HTTP entrypoint
- "--entrypoints.websecure.address=:443" # HTTPS entrypoint
- "--certificatesresolvers.myresolver.acme.tlschallenge=true" # Use TLS challenge for Let's Encrypt
- "--certificatesresolvers.myresolver.acme.email=${ACME_EMAIL}" # Email for Let's Encrypt notifications
- "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json" # Storage for certificates
ports:
- "80:80" # Expose HTTP
- "443:443" # Expose HTTPS
- "8080:8080" # Expose Traefik dashboard (for debugging, disable in prod)
volumes:
- "./letsencrypt:/letsencrypt" # Persistent storage for Let's Encrypt certificates
- "/var/run/docker.sock:/var/run/docker.sock:ro" # Mount Docker socket for service discovery
networks:
- app-network
restart: unless-stopped # Always restart unless explicitly stopped
frontend:
build: ./frontend # Build from local Dockerfile
image: frontend:latest # Tag the built image
container_name: frontend
labels:
- "traefik.enable=true"
- "traefik.http.routers.frontend.rule=Host(${DOMAIN_NAME})" # Route based on domain name
- "traefik.http.routers.frontend.entrypoints=websecure" # Use HTTPS entrypoint
- "traefik.http.routers.frontend.tls.certresolver=myresolver" # Use Let's Encrypt resolver
- "traefik.http.routers.frontend-http.rule=Host(${DOMAIN_NAME})" # HTTP router for redirect
- "traefik.http.routers.frontend-http.entrypoints=web" # Use HTTP entrypoint
- "traefik.http.routers.frontend-http.middlewares=https-redirect" # Apply HTTPS redirect middleware
- "traefik.http.middlewares.https-redirect.redirectscheme.scheme=https" # Define HTTPS redirect
networks:
- app-network
restart: unless-stopped
auth-api:
build: ./auth-api
image: auth-api:latest
container_name: auth-api
environment:
- USERS_API_ADDRESS=http://users-api:8083 # Internal service communication
- JWT_SECRET=${JWT_SECRET} # Injected from .env
- AUTH_API_PORT=8081
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans # Zipkin endpoint
labels:
- "traefik.enable=true"
- "traefik.http.routers.auth-api.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/auth)" # Route by path prefix
- "traefik.http.routers.auth-api.entrypoints=websecure"
- "traefik.http.routers.auth-api.tls.certresolver=myresolver"
- "traefik.http.middlewares.auth-strip.stripprefix.prefixes=/api/auth" # Strip path prefix before forwarding
- "traefik.http.routers.auth-api.middlewares=auth-strip"
networks:
- app-network
restart: unless-stopped
todos-api:
build: ./todos-api
image: todos-api:latest
container_name: todos-api
environment:
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_CHANNEL=log_channel
- TODO_API_PORT=8082
- JWT_SECRET=${JWT_SECRET}
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans
labels:
- "traefik.enable=true"
- "traefik.http.routers.todos-api.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/todos)"
- "traefik.http.routers.todos-api.entrypoints=websecure"
- "traefik.http.routers.todos-api.tls.certresolver=myresolver"
- "traefik.http.middlewares.todos-strip.stripprefix.prefixes=/api"
- "traefik.http.routers.todos-api.middlewares=todos-strip"
networks:
- app-network
depends_on:
- redis # Ensure Redis starts before Todos API
restart: unless-stopped
users-api:
build: ./users-api
image: users-api:latest
container_name: users-api
environment:
- SERVER_PORT=8083
- JWT_SECRET=${JWT_SECRET}
- SPRING_ZIPKIN_BASE_URL=http://zipkin:9411/ # Spring Boot specific Zipkin config
labels:
- "traefik.enable=true"
- "traefik.http.routers.users-api.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/users)"
- "traefik.http.routers.users-api.entrypoints=websecure"
- "traefik.http.routers.users-api.tls.certresolver=myresolver"
- "traefik.http.middlewares.users-strip.stripprefix.prefixes=/api"
- "traefik.http.routers.users-api.middlewares=users-strip"
networks:
- app-network
restart: unless-stopped
log-message-processor:
build: ./log-message-processor
image: log-message-processor:latest
container_name: log-message-processor
environment:
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_CHANNEL=log_channel
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans
networks:
- app-network
depends_on:
- redis
restart: unless-stopped
redis:
image: redis:alpine # Lightweight Redis image
container_name: redis
networks:
- app-network
restart: unless-stopped
zipkin:
image: openzipkin/zipkin # Official Zipkin image
container_name: zipkin
ports:
- "9411:9411" # Expose Zipkin UI internally (Traefik handles external access)
networks:
- app-network
labels:
- "traefik.enable=true"
- "traefik.http.routers.zipkin.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/zipkin)"
- "traefik.http.routers.zipkin.entrypoints=websecure"
- "traefik.http.routers.zipkin.tls.certresolver=myresolver"
- "traefik.http.middlewares.zipkin-strip.stripprefix.prefixes=/api/zipkin"
- "traefik.http.routers.zipkin.middlewares=zipkin-strip"
restart: unless-stopped
networks:
app-network:
driver: bridge # Custom bridge network for inter-service communication
`
Traefik Configuration Explained
Figure 4: Traefik request routing showing path-based routing, SSL termination, and Let's Encrypt integration
Traefik leverages Docker labels for dynamic service discovery and routing configuration. This eliminates the need for manual configuration file updates when services are added, removed, or updated, making it far more agile than traditional reverse proxies like Nginx.
Label breakdown for frontend service:
`yaml
labels:
# 1. Enable Traefik for this container
- "traefik.enable=true"
# 2. HTTPS router configuration for the main domain
- "traefik.http.routers.frontend.rule=Host(
example.com)" # Matches requests for the specified domain - "traefik.http.routers.frontend.entrypoints=websecure" # Listens on the HTTPS entrypoint (port 443)
- "traefik.http.routers.frontend.tls.certresolver=myresolver" # Uses the Let's Encrypt certificate resolver
# 3. HTTP router for automatic redirection to HTTPS
- "traefik.http.routers.frontend-http.rule=Host(
example.com)" # Matches requests for the domain on HTTP - "traefik.http.routers.frontend-http.entrypoints=web" # Listens on the HTTP entrypoint (port 80)
- "traefik.http.routers.frontend-http.middlewares=https-redirect" # Applies the 'https-redirect' middleware
# 4. Middleware definition for the HTTPS redirect
- "traefik.http.middlewares.https-redirect.redirectscheme.scheme=https" # Configures the middleware to redirect to HTTPS
`
Benefits of this approach:
- No configuration file reloads required: Traefik automatically detects changes to Docker labels and updates its routing table in real-time.
- Services can be added/removed without Traefik downtime: This enables true zero-downtime deployments and dynamic scaling.
- SSL certificates automatically provisioned and renewed: Let's Encrypt integration handles the entire lifecycle of SSL certificates.
-
Path-based routing allows multiple services on one domain: Different microservices can be exposed under different URL paths on the same domain (e.g.,
/api/auth,/api/todos). - Middleware support for transformations: Traefik middlewares can perform various functions like path stripping, authentication, rate limiting, and more, before forwarding requests to the backend service.
Service Routing Table
This table summarizes how external requests are routed to internal services via Traefik:
| URL Path | Service | Backend Port | Traefik Middleware |
|---|---|---|---|
https://domain.com/ |
frontend | 80 (internal) | N/A |
https://domain.com/api/auth/* |
auth-api | 8081 |
auth-strip (strips /api/auth) |
https://domain.com/api/todos/* |
todos-api | 8082 |
todos-strip (strips /api) |
https://domain.com/api/users/* |
users-api | 8083 |
users-strip (strips /api) |
https://domain.com/api/zipkin/* |
zipkin | 9411 |
zipkin-strip (strips /api/zipkin) |
http://domain.com/* |
N/A | N/A |
https-redirect (redirects to HTTPS) |
[SECURITY IMPLEMENTATION AND BEST PRACTICES]
Security was a primary consideration throughout this project, implemented at multiple layers from infrastructure to application.
Secrets Management Strategy
All sensitive information is stored and managed securely, never committed directly to the repository.
GitHub Secrets Used:
-
AWS_ACCESS_KEY_ID: AWS programmatic access key for GitHub Actions. -
AWS_SECRET_ACCESS_KEY: AWS secret key corresponding to the access key. -
SSH_PUBLIC_KEY: The public part of the SSH key pair used for EC2 instance creation. -
SSH_PRIVATE_KEY: The private part of the SSH key pair used by Terraform and Ansible for SSH connections. -
DOMAIN_NAME: The production domain name for the application (e.g.,yourdomain.com). -
ACME_EMAIL: Email address for Let's Encrypt certificate notifications. -
MAIL_USERNAME: SMTP username for sending drift detection emails. -
MAIL_PASSWORD: SMTP password for sending drift detection emails. -
MAIL_TO: Recipient email address for drift detection alerts. -
JWT_SECRET: Secret key used for signing and verifying JSON Web Tokens across microservices.
Secret Rotation Strategy:
- Dedicated Keys: SSH keys and AWS IAM credentials are generated specifically for this CI/CD pipeline, limiting their scope.
-
Least Privilege: The AWS IAM user associated with
AWS_ACCESS_KEY_IDhas only the minimum permissions required to provision and manage the specified resources. - Regular Rotation: A strategy for rotating all secrets (AWS keys, SSH keys, JWT secrets) every 90 days is recommended to minimize the impact of potential compromise.
-
Environment-Specific Secrets: For multi-environment setups, separate secrets would be maintained for
dev,staging, andprodto further isolate environments.
Network Security
AWS Security Group Rules:
The EC2 instance's security group is configured with strict inbound rules:
`
Inbound Rules:
- SSH (Port 22): Allowed only from a specific CIDR block (e.g., your office IP, VPN IP). This prevents unauthorized SSH access.
- HTTP (Port 80): Allowed from anywhere (0.0.0.0/0). This is necessary for Traefik to handle initial Let's Encrypt challenges and HTTP to HTTPS redirection.
- HTTPS (Port 443): Allowed from anywhere (0.0.0.0/0). This is for the main application traffic.
Outbound Rules:
- All Traffic: Allowed to anywhere (0.0.0.0/0). This is necessary for the instance to download packages, pull Docker images, and make API calls to AWS services.
`
Docker Network Isolation:
-
Dedicated Bridge Network: All Docker services run within a custom bridge network (
app-network). This isolates them from the host's network and from other Docker networks. -
Internal Communication: Services communicate with each other using their container names (e.g.,
http://redis:6379), which are resolved by Docker's internal DNS. -
Minimal Port Exposure: Only Traefik exposes ports (80, 443, 8080) to the host machine. All other services are only accessible internally within the
app-network, significantly reducing the attack surface. - Encrypted External Traffic: All external traffic to the application is forced over HTTPS, encrypted by Traefik.
SSL/TLS Implementation
Let's Encrypt via Traefik:
- Automatic Certificate Provisioning: Traefik is configured to automatically obtain and renew SSL certificates from Let's Encrypt using the TLS-ALPN-01 challenge.
-
Persistent Storage: Certificates are stored in a persistent volume (
./letsencrypt:/letsencrypt), ensuring they survive container restarts. - Automatic Renewal: Traefik handles certificate renewal automatically 30 days before expiration.
- No Wildcard Certificates: For enhanced security, specific certificates are obtained for the primary domain, rather than using wildcard certificates, which have a broader attack surface.
- HTTP to HTTPS Redirection: Traefik automatically redirects all HTTP traffic to HTTPS, ensuring all communication is encrypted.
Application Security Measures
- JWT Token Authentication:
- The `auth-api` generates and validates JSON Web Tokens (JWTs) for user sessions.
- Tokens have configurable expiration times.
- A shared `JWT_SECRET` (injected via `.env`) is used across services for token validation, ensuring only authorized services can verify tokens.
- All API endpoints requiring authentication enforce the presence and validity of JWTs in the `Authorization` header.
- Input Validation:
- Each API service is responsible for validating incoming request payloads to prevent common vulnerabilities like injection attacks (though no SQL DB is used here, the principle applies).
- Frontend input is also validated client-side and server-side.
- CORS Configuration:
- The frontend and backend APIs are served from the same domain (different paths), eliminating the need for complex Cross-Origin Resource Sharing (CORS) configurations and potential misconfigurations.
- Firewall Configuration (UFW):
- The Uncomplicated Firewall (UFW) is configured on the EC2 instance to provide an additional layer of host-level network security:
bash ufw default deny incoming # Deny all incoming traffic by default ufw default allow outgoing # Allow all outgoing traffic ufw allow 22/tcp # Allow SSH ufw allow 80/tcp # Allow HTTP ufw allow 443/tcp # Allow HTTPS ufw enable # Enable the firewall - This ensures that only explicitly allowed ports are open, even if security group rules were to be misconfigured.
- The Uncomplicated Firewall (UFW) is configured on the EC2 instance to provide an additional layer of host-level network security:
[OBSERVABILITY AND DISTRIBUTED TRACING]
Observability is crucial for understanding the behavior of microservices in production. Zipkin is integrated to provide distributed tracing, allowing us to visualize and analyze request flows across all services.
Zipkin Integration Example (Frontend)
Each service is instrumented to send trace data to the Zipkin collector. Here's an example from the Vue.js frontend:
`javascript
// frontend/src/zipkin.js
import { Tracer, ExplicitContext, BatchRecorder } from "zipkin";
import { HttpLogger } from "zipkin-transport-http";
const tracer = new Tracer({
ctxImpl: new ExplicitContext(), // Manages the current span context
recorder: new BatchRecorder({
// Buffers spans and sends them in batches
logger: new HttpLogger({
// Sends spans over HTTP
endpoint: ${process.env.VUE_APP_API_URL}/api/zipkin, // Zipkin collector endpoint via Traefik
jsonEncoder: JSON.stringify, // Encodes spans as JSON
}),
}),
localServiceName: "frontend", // Name of this service in traces
supportsJoin: false, // Frontend typically starts new traces
});
export default tracer;
`
Similar instrumentation is applied to the Go, Node.js, Java, and Python services, ensuring that every request's journey through the microservices architecture is captured.
What Zipkin Tracks:
- Request Duration: Measures the time taken for each operation within a service and across services.
- Service Dependencies: Visualizes the call graph, showing which services call which others.
- Error Rates and Failure Points: Helps identify where errors occur in a distributed transaction.
- Latency Breakdown: Pinpoints bottlenecks by showing the time spent in different components (e.g., network, database, internal processing).
- Asynchronous Message Processing: Traces can follow messages through queues (like Redis in this case) to track the full lifecycle of an event.
Use Cases:
- Identifying Slow Endpoints: Quickly pinpoint which API calls or internal service interactions are contributing to high latency.
- Debugging Timeout Issues: Understand where a request is getting stuck or timing out across multiple services.
- Understanding Service Communication Patterns: Gain insights into how services interact, which can be invaluable for refactoring or optimizing.
- Capacity Planning: Analyze traffic patterns and service performance to inform scaling decisions.
- Root Cause Analysis for Production Incidents: When an issue occurs, traces provide a detailed timeline of events, helping to quickly identify the root cause.
[LESSONS LEARNED AND KEY TAKEAWAYS]
The journey of building this CI/CD pipeline provided several invaluable lessons and reinforced core DevOps principles.
1. Automation ROI is Exponential
The initial investment in setting up the pipeline was significant, approximately 40 hours of focused development and debugging. However, the return on investment (ROI) was almost immediate and continues to grow:
- Deployment Time Reduction: Manual deployments, which previously took 2+ hours (including SSH, Git pulls, Docker builds, and manual checks), were reduced to less than 5 minutes for a full application update. Infrastructure provisioning went from hours to minutes.
- Error Rate Reduction: Manual errors, a common source of production issues, were virtually eliminated. The pipeline ensures consistent, repeatable deployments.
- Confidence Boost: The ability to deploy changes rapidly and reliably instilled a high degree of confidence in the development team, encouraging more frequent, smaller releases.
Conclusion: The upfront time investment in automation pays dividends immediately. After just a few deployments, the time saved exceeded the initial development time, proving that automation is not a luxury but a necessity for efficient software delivery.
2. Drift Detection is Non-Negotiable
During the development phase, it was tempting to make "quick fixes" directly in the AWS console for testing purposes. This inevitably led to discrepancies between the Terraform state and the actual infrastructure. The drift detection pipeline (using terraform plan -detailed-exitcode) consistently caught these manual changes.
Lesson: Enforce an "infrastructure as code or it doesn't exist" policy from day one. Any change to infrastructure must go through the code repository and the CI/CD pipeline. This prevents configuration drift, ensures auditability, and maintains a single source of truth for infrastructure.
3. Infrastructure as Code Provides Documentation
The Terraform configuration files, along with the Ansible playbooks, serve as living, executable documentation of the entire infrastructure and its configuration.
- Clarity: The HCL files clearly define every AWS resource and its properties.
- Auditability: Every change to the infrastructure is tracked in Git, complete with commit messages and pull request reviews.
- Understanding: Comments within the code explain why certain decisions were made, not just what was configured, which is invaluable for new team members or for revisiting the setup months later.
4. Docker Compose Complexity Sweet Spot
For projects with a moderate number of services (e.g., less than 20), Docker Compose provides the perfect balance between simplicity and functionality. It offers container orchestration capabilities without the steep learning curve and operational overhead of more complex systems.
Alternatives considered:
- Kubernetes: While powerful, Kubernetes would have been massive overkill for a single-server deployment. Its complexity (YAML sprawl, cluster management, networking) would have significantly slowed down development without providing proportional benefits for this scale.
- Docker Swarm: Considered, but its uncertain future and less vibrant ecosystem made it a less attractive choice.
- Nomad: A strong contender for lightweight orchestration, but with less ecosystem support and community resources compared to Docker Compose for this specific use case.
5. Traefik is a Game-Changer
Traefik proved to be an exceptionally powerful and developer-friendly reverse proxy. Its Docker-native approach, which uses container labels for dynamic configuration, eliminated the configuration management complexity often associated with Nginx or HAProxy.
- Automatic SSL: The seamless integration with Let's Encrypt for automatic SSL certificate provisioning and renewal was a major time-saver and security enhancer.
- Dynamic Routing: The ability to add or remove services and have Traefik automatically update its routing rules without restarts was crucial for zero-downtime deployments.
6. GitHub Actions for Team Workflows
GitHub Actions, while perhaps not as feature-rich or flexible as some enterprise-grade CI/CD platforms (like GitLab CI or Jenkins), offers unparalleled integration with GitHub repositories.
- Ease of Use: Its YAML-based syntax is relatively easy to learn.
- Tight Integration: Direct access to GitHub events, secrets, and environments simplifies pipeline development.
- Community Actions: A vast marketplace of pre-built actions accelerates workflow creation.
For smaller teams or projects already hosted on GitHub, it provides a highly effective and convenient CI/CD solution without the need for managing a separate CI/CD server.
[CHALLENGES ENCOUNTERED AND SOLUTIONS]
Building a robust CI/CD pipeline often involves overcoming several technical hurdles. Here are some key challenges faced during this project and their respective solutions.
Challenge 1: SSH Key Management in CI/CD
Problem: GitHub Actions runners are ephemeral, meaning they are provisioned fresh for each job. For Terraform to provision an EC2 instance with an SSH public key, and for Ansible to connect to that instance using the corresponding private key, these keys needed to be available as files on the runner during workflow execution. Storing them directly in the repository is a security anti-pattern.
Solution Implemented:
The public and private SSH keys were stored as encrypted GitHub Secrets (SSH_PUBLIC_KEY and SSH_PRIVATE_KEY). During the GitHub Actions workflow, these secrets were dynamically written to temporary files on the runner's filesystem.
`yaml
- name: Create SSH Keys for Plan
run: |
echo "${{ secrets.SSH_PUBLIC_KEY }}" > infra/terraform/deployer_key.pub
echo "${{ secrets.SSH_PRIVATE_KEY }}" > infra/terraform/deployer_key
chmod 600 infra/terraform/deployer_key # Set secure permissions for the private key
`
Key Insight: It's crucial to set appropriate file permissions (chmod 600) for the private key to prevent unauthorized access and ensure SSH clients accept it. Additionally, a defensive step was added to verify the key format:
`yaml
- name: Verify SSH Key Format
run: |
chmod 600 infra/terraform/deployer_key
ssh-keygen -l -f infra/terraform/deployer_key || \
echo "::error::SSH Private Key is invalid! Check your GitHub Secret."
`
This check helps catch issues early if the secret was incorrectly pasted or corrupted.
Challenge 2: Terraform and Ansible Integration
Problem: Ansible needs the public IP address of the EC2 instance to connect and configure it. However, this IP address is only known after Terraform has successfully created the instance. This presents a classic "chicken-and-egg" problem in automation.
Solutions Evaluated:
- ❌ Manual Intervention: Run Terraform, then manually copy the IP to an Ansible inventory file, then run Ansible. (Completely defeats the purpose of automation).
- ❌ Terraform Provisioners with
remote-exec: While Terraform hasremote-execprovisioners, they are generally considered brittle for complex configuration management. They lack the idempotency and rich module ecosystem of Ansible. - ✅ Dynamic Ansible Inventory Generation: Use Terraform's
local_fileresource to dynamically generate the Ansible inventory file after the EC2 instance's IP is known.
Final Solution:
A local_file resource in Terraform was used to create an hosts.yml file in the Ansible inventory directory. This file uses a templatefile function to inject the aws_instance.todo_app.public_ip into the inventory.
hcl
resource "local_file" "ansible_inventory" {
content = templatefile("${path.module}/inventory.tftpl", {
host = aws_instance.todo_app.public_ip
user = var.server_user
key = var.private_key_path
})
filename = "${path.module}/../ansible/inventory/hosts.yml"
}
Then, a null_resource with a local-exec provisioner was used to trigger the Ansible playbook, referencing this dynamically generated inventory file. This pattern ensures that Ansible always targets the correct, newly provisioned instance.
Challenge 3: Environment Variable Propagation
Problem: Several critical values (e.g., DOMAIN_NAME, ACME_EMAIL, JWT_SECRET) were needed at different stages of the pipeline and by different tools (Terraform, Ansible, Docker Compose, application containers). Maintaining consistency and securely passing these values was a challenge.
Solution: A "single source of truth" approach was adopted, with GitHub Secrets serving as the central repository for all sensitive and configuration values. These values were then propagated down the pipeline:

Figure 5: Secrets and environment variables flow from GitHub Secrets through the entire deployment pipeline
Data Flow:
GitHub Secrets
→ GitHub Actions (environment variables)
→ Terraform (via TF_VAR_ prefix)
→ Ansible (via --extra-vars)
→ Docker Compose (via .env file generated by Ansible)
→ Application Containers (via Docker Compose environment variables)
This ensures that values are consistent, securely managed, and injected at the appropriate stage without being hardcoded.
Challenge 4: Terraform State Locking
Problem: In a team environment, or even with multiple CI/CD jobs, concurrent terraform apply operations on the same state file can lead to state corruption, data loss, or inconsistent infrastructure.
Solution: Terraform's S3 backend was configured with DynamoDB for state locking.
hcl
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket" # Dedicated S3 bucket for state files
key = "todo-app/terraform.tfstate" # Path to the state file within the bucket
region = "us-east-1"
dynamodb_table = "terraform-state-lock" # DynamoDB table for locking
encrypt = true # Encrypt state file at rest
}
}
When a terraform apply is initiated, Terraform attempts to acquire a lock in the DynamoDB table. If successful, it proceeds; otherwise, it waits or fails, preventing concurrent modifications and ensuring state integrity.
Challenge 5: Docker Build Context in GitHub Actions
Problem: Building Docker images within GitHub Actions can be slow if the entire repository is sent as the build context, especially for large repositories with many unrelated files (e.g., .git directories, node_modules, documentation).
Solution: Two primary optimizations were applied:
-
.dockerignorefiles: Each service's Dockerfile directory included a.dockerignorefile. This file specifies patterns for files and directories that should be excluded from the Docker build context.dockerfile # Example .dockerignore .git node_modules *.md tests/This significantly reduces the amount of data sent to the Docker daemon, speeding up the build process. - Multi-stage builds: Dockerfiles were structured using multi-stage builds to separate build-time dependencies from runtime dependencies. This results in smaller, more secure final images.
These optimizations collectively reduced Docker build times from approximately 8 minutes to less than 2 minutes, accelerating the deployment pipeline.
[FUTURE IMPROVEMENTS AND ROADMAP]
While the current CI/CD pipeline is production-ready, there are always opportunities for enhancement and scaling. This roadmap outlines potential future improvements.
Phase 1: Multi-Environment Support (Q1 2025)
Objective: To provide isolated and consistent environments for development, staging, and production, enabling safer testing and deployment workflows.
Implementation Plan:
-
Terraform Workspaces or Separate State Files: Utilize Terraform workspaces (
terraform workspace new dev) or maintain separate Terraform state files for each environment. -
Environment-Specific Variable Files: Create
terraform.tfvarsfiles (e.g.,dev.tfvars,staging.tfvars,prod.tfvars) to manage environment-specific configurations (e.g., instance types, domain names, resource tags). -
Separate GitHub Actions Environments: Configure distinct GitHub Environments (e.g.,
dev,staging,production) with different protection rules (e.g., manual approval forproduction, no approval fordev). -
Subdomain Routing: Implement subdomain-based routing (e.g.,
dev.example.com,staging.example.com,app.example.com) to access different environments.
Phase 2: Auto-Scaling (Q2 2025)
Objective: To automatically adjust compute capacity based on demand, ensuring application availability and cost efficiency.
Components:
- AWS Auto Scaling Groups (ASG): Replace the single EC2 instance with an ASG to manage a fleet of instances.
- Application Load Balancer (ALB): Introduce an ALB in front of the ASG to distribute incoming traffic and replace the single-instance Traefik as the primary entry point. Traefik would then run on each instance behind the ALB.
- CloudWatch Alarms: Configure CloudWatch alarms to trigger scaling policies based on metrics like CPU utilization, request count per target, or custom metrics.
- Shared Persistent Storage: For Traefik certificates and other shared data, consider using Amazon EFS or an S3 bucket mounted via FUSE, ensuring state is synchronized across instances.
Phase 3: Database Persistence (Q2 2025)
Objective: To move from in-memory data stores to durable, managed database services, ensuring data integrity and persistence.
Services to Add:
- Amazon RDS (PostgreSQL): For relational data storage, replacing any in-memory databases.
- Amazon ElastiCache (Redis): For distributed caching and message queuing, providing a managed, highly available Redis instance.
- Database Migration Management: Integrate tools like Flyway or Liquibase into the CI/CD pipeline to manage database schema changes automatically.
- Automated Backups and Point-in-Time Recovery: Configure RDS and ElastiCache for automated backups and enable point-in-time recovery.
Phase 4: Comprehensive Monitoring (Q3 2025)
Objective: To implement a robust monitoring and alerting solution for proactive issue detection and performance analysis.
Stack:
- Prometheus: For collecting time-series metrics from application services, Docker containers, and the host system.
- Grafana: For creating interactive dashboards to visualize metrics and gain insights into system health and performance.
- AlertManager: For intelligent routing and deduplication of alerts generated by Prometheus.
- CloudWatch Integration: Integrate with AWS CloudWatch for monitoring AWS service health and infrastructure metrics.
Key Metrics to Track:
- Request latency (p50, p95, p99 percentiles)
- Error rates by service and endpoint
- Container resource utilization (CPU, memory, disk I/O)
- Network traffic and connection counts
- Application-specific business metrics
Phase 5: Blue-Green Deployments (Q3 2025)
Objective: To achieve zero-downtime deployments with instant rollback capabilities, minimizing user impact during updates.
Implementation:
- Two Identical Environments: Maintain two identical production environments (e.g., "Blue" and "Green").
- Traffic Switching: Use the Application Load Balancer to switch traffic instantly from the old (Blue) environment to the new (Green) environment after successful deployment and health checks.
- Automated Health Checks: Implement comprehensive health checks for the new environment before traffic is shifted.
- One-Click Rollback: In case of issues in the Green environment, traffic can be instantly switched back to the stable Blue environment.
Phase 6: Security Hardening (Q4 2025)
Additional Measures:
- AWS WAF (Web Application Firewall): Deploy WAF in front of the ALB to protect against common web exploits (e.g., SQL injection, cross-site scripting).
- Amazon GuardDuty: Enable GuardDuty for intelligent threat detection and continuous monitoring of AWS accounts for malicious activity.
- Secrets Rotation Automation: Implement automated rotation of all secrets (AWS credentials, SSH keys, database passwords) using AWS Secrets Manager or similar tools.
- Encrypted Volume Storage: Ensure all EBS volumes attached to EC2 instances are encrypted at rest.
- Regular Penetration Testing: Schedule periodic penetration tests and security audits to identify and remediate vulnerabilities.
[CONCLUSION AND FINAL THOUGHTS]
Building this comprehensive CI/CD pipeline demonstrated that modern DevOps is not merely about adopting trendy tools—it's about creating reliable, repeatable, and auditable processes that empower development teams to move quickly while maintaining production stability.
The combination of:
- Terraform for declarative infrastructure provisioning,
- GitHub Actions for robust CI/CD orchestration,
- Ansible for idempotent configuration management,
- Docker for consistent containerization, and
-
Traefik for dynamic routing and automatic SSL,
creates a powerful, production-ready deployment platform that can scale from prototype to production. The entire technology stack can be deployed with a single
git push, yet includes sophisticated safety mechanisms like drift detection, manual approval gates, and automated rollback capabilities.
Key Success Factors:
- Declarative Infrastructure: Terraform's declarative approach makes infrastructure changes reviewable, testable, and version-controlled.
- Immutable Deployments: Containers ensure consistent behavior across environments, reducing "it works on my machine" issues.
- Automated Testing: CI/CD pipelines catch issues early, preventing them from reaching production.
- Observability: Distributed tracing provides critical visibility into complex microservices interactions, aiding in debugging and performance optimization.
- Security by Default: Encrypted secrets, least-privilege IAM roles, automated SSL, and robust firewall rules establish a strong security posture.
This architecture serves as a template for modern application deployment, demonstrating that enterprise-grade automation is accessible to small teams and individual developers. The patterns established here scale effectively from single-server deployments to multi-region, highly available infrastructures.
[ADDITIONAL RESOURCES AND REFERENCES]
Project Repository:
Official Documentation:
- Terraform AWS Provider: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
- Ansible Best Practices: https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html
- Traefik v3 Documentation: https://doc.traefik.io/traefik/
- GitHub Actions Workflows: https://docs.github.com/en/actions
- Docker Compose Reference: https://docs.docker.com/compose/compose-file/
- Zipkin Architecture: https://zipkin.io/pages/architecture.html
Recommended Learning Resources:
- "Terraform: Up & Running" by Yevgeniy Brikman
- "Ansible for DevOps" by Jeff Geerling
- "The DevOps Handbook" by Gene Kim
- HashiCorp Learn (free interactive tutorials)
Questions or feedback? I'd love to discuss DevOps automation strategies, infrastructure as code patterns, or troubleshooting deployment pipelines. Drop your questions in the comments or reach out on Twitter/LinkedIn.
Found this helpful? Consider starring the project repository and sharing this guide with your team!
Tags: #DevOps #Terraform #Ansible #CICD #AWS #Docker #InfrastructureAsCode #GitHubActions #Microservices #Traefik #Automation #CloudComputing #ContainerOrchestration #SRE




Top comments (0)