PrimoCrypt

Posted on Dec 9

Building a Production-Ready CI/CD Pipeline: Automating Infrastructure with Terraform, GitHub Actions, and Ansible

#devops #terraform #ansible #cicd

In the modern DevOps landscape, manual infrastructure management and application deployment are rapidly becoming obsolete. This comprehensive guide walks you through building a complete, production-ready CI/CD pipeline for a microservices application, covering infrastructure provisioning, automated deployments, drift detection, and continuous delivery—all using industry-standard DevOps tools and best practices.

[TABLE OF CONTENTS]

Project Goals and Overview
System Architecture and Design
Infrastructure as Code with Terraform
CI/CD Pipeline Implementation with GitHub Actions
Configuration Management with Ansible
Container Orchestration with Docker Compose
Security Implementation and Best Practices
Observability and Distributed Tracing
Lessons Learned and Key Takeaways
Challenges Encountered and Solutions
Future Improvements and Roadmap

[PROJECT GOALS AND OVERVIEW]

The primary objective of this project was to create a fully automated deployment pipeline for a multi-service TODO application with complete infrastructure automation. The solution needed to address several key requirements:

Core Requirements:

Complete automation from code commit to production deployment
Infrastructure provisioning using declarative configuration
Automated configuration management for consistent server setup
Zero-downtime deployments with SSL/TLS termination
Drift detection to maintain infrastructure consistency
Distributed tracing for debugging microservices interactions
Security-first approach with encrypted secrets and minimal attack surface

Technology Stack Selected:

Infrastructure as Code: Terraform for AWS resource provisioning
Configuration Management: Ansible for server configuration and application deployment
CI/CD Orchestration: GitHub Actions for workflow automation
Containerization: Docker and Docker Compose for service isolation
Reverse Proxy: Traefik for routing, load balancing, and automatic SSL
Observability: Zipkin for distributed request tracing
Message Queue: Redis for asynchronous log processing

The end goal was a system where infrastructure changes and application updates could be deployed with a single git push, with built-in safety mechanisms including drift detection, email notifications, and manual approval gates for production environments.

[SYSTEM ARCHITECTURE AND DESIGN]

The application architecture follows a microservices pattern with seven distinct services, each serving a specific purpose. This polyglot architecture demonstrates real-world complexity where different services are written in different programming languages based on their specific requirements.

Architecture Diagram

Figure 1: Complete microservices architecture showing all 7 services, their technologies, and data flow between components

Service Responsibilities

1. Frontend Service (Vue.js)

Single-page application providing the complete user interface
Communicates with backend APIs via RESTful endpoints
Implements distributed tracing via Zipkin client
Served as static assets with client-side routing

2. Auth API (Go)

Handles user authentication and authorization
Generates and validates JWT tokens for session management
Communicates with Users API to validate credentials
Written in Go for performance and concurrency
Port: 8081 (internal)

3. Todos API (Node.js)

Provides full CRUD operations for user TODO items
Publishes create/delete events to Redis message queue
Validates JWT tokens for authenticated requests
Asynchronous, event-driven architecture
Port: 8082 (internal)

4. Users API (Spring Boot / Java)

Manages user profiles and account information
Provides user lookup for authentication service
Simplified implementation (read-only operations)
Leverages Spring Boot ecosystem
Port: 8083 (internal)

5. Log Message Processor (Python)

Consumes messages from Redis queue
Processes TODO creation and deletion events
Logs events to stdout for monitoring/aggregation
Demonstrates asynchronous processing pattern
Queue-based, no exposed ports

6. Redis

In-memory data store used as message queue
Pub/sub pattern for event broadcasting
Minimal configuration, Alpine-based image
Port: 6379 (internal only)

7. Zipkin

Distributed tracing system for microservices
Collects timing data from all services
Provides visualization of request flows
Helps identify performance bottlenecks
Port: 9411 (exposed via Traefik)

8. Traefik

Modern reverse proxy and load balancer
Automatic service discovery via Docker labels
Let's Encrypt integration for automatic SSL certificates
Path-based and host-based routing
HTTP to HTTPS automatic redirection
Ports: 80 (HTTP), 443 (HTTPS), 8080 (Dashboard)

Network Architecture

Figure 2: Docker networking showing isolated app-network with Traefik as the only external gateway

All services communicate via a dedicated Docker bridge network named app-network. This provides:

Network isolation from the host system
Service-to-service communication using container names (DNS resolution)
No exposed ports except through Traefik
Encrypted traffic between external clients and Traefik

[INFRASTRUCTURE AS CODE WITH TERRAFORM]

Terraform was chosen for infrastructure provisioning because it provides declarative configuration, state management, and a mature AWS provider with extensive resource coverage.

AWS Resources Provisioned

The Terraform configuration provisions the following AWS resources:

1. EC2 Instance

Instance Type: Configurable via variable (default: t2.medium recommended)
AMI: Latest Ubuntu 22.04 LTS (Jammy Jellyfish)
Automatically tagged for easy identification and billing
Uses data source to always fetch the latest Ubuntu AMI

# Data source ensures we always use the latest Ubuntu AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical's AWS account

  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# EC2 Instance resource definition
resource "aws_instance" "todo_app" {
  ami                    = data.aws_ami.ubuntu.id
  instance_type          = var.instance_type
  key_name               = aws_key_pair.deployer.key_name
  vpc_security_group_ids = [aws_security_group.todo_app.id]

  tags = {
    Name        = "todo-app-server-v2"
    Environment = "production"
    Project     = "hngi13-stage6"
  }
}

2. Security Group

Ingress: SSH (22), HTTP (80), HTTPS (443)
Egress: All traffic allowed (for package downloads, API calls, etc.)
SSH access restricted to specific CIDR block for security

`hcl
resource "aws_security_group" "todo_app" {
name = "todo-app-sg"
description = "Security group for TODO application"

# HTTP access for initial Let's Encrypt challenges
ingress {
description = "HTTP"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}

# HTTPS for production traffic
ingress {
description = "HTTPS"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}

# SSH restricted to specific CIDR for security
ingress {
description = "SSH"
from_port = 22
to_port = 22
protocol = "tcp"
cidr_blocks = [var.ssh_cidr] # Only allow from specific IP range
}

# Allow all outbound traffic
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}

tags = {
Name = "todo-app-sg"
}
}
`

3. SSH Key Pair

Public key uploaded to AWS for instance access
Private key stored securely in GitHub Secrets
Used by both Terraform and Ansible for authentication

hcl resource "aws_key_pair" "deployer" { key_name = var.key_name public_key = file(var.public_key_path) }

4. Remote State Configuration (S3 + DynamoDB)

S3 bucket stores Terraform state file with encryption
DynamoDB table provides state locking to prevent concurrent modifications
Configured in separate backend.tf file

Dynamic Ansible Inventory Generation

One of the most elegant aspects of this setup is the automatic generation of Ansible inventory files. Since the EC2 instance's public IP address is only known after Terraform creates it, we need a mechanism to pass this information to Ansible.

`hcl

Template file for inventory generation

resource "local_file" "ansible_inventory" {
content = templatefile("${path.module}/inventory.tftpl", {
host = aws_instance.todo_app.public_ip
user = var.server_user
key = var.private_key_path
})
filename = "${path.module}/../ansible/inventory/hosts.yml"
}
`

The inventory.tftpl template file looks like this:

yaml [web] ${host} ansible_user=${user} ansible_ssh_private_key_file=${key}

After Terraform applies, this becomes a fully functional Ansible inventory file with the actual IP address populated.

Integrated Terraform-Ansible Provisioning

To create a truly seamless deployment experience, Terraform automatically triggers Ansible configuration after infrastructure creation. This is achieved using a null_resource with a local-exec provisioner:

`hcl
resource "null_resource" "ansible_provision" {
# Trigger re-provisioning when instance or inventory changes
triggers = {
instance_id = aws_instance.todo_app.id
inventory = local_file.ansible_inventory.content
}

provisioner "local-exec" {
command = <<-EOT
echo "Waiting for SSH to be available..."

  # Wait up to 5 minutes for SSH to become available
  for i in {1..30}; do
    nc -z -w 5 ${aws_instance.todo_app.public_ip} 22 && break
    echo "Waiting for port 22... (attempt $i/30)"
    sleep 10
  done

  echo "Running Ansible playbook..."
  # Disable host key checking for automated deployments
  export ANSIBLE_HOST_KEY_CHECKING=False
  export ANSIBLE_CONFIG=${path.module}/../ansible/ansible.cfg

  ansible-playbook \
    -i ${path.module}/../ansible/inventory/hosts.yml \
    ${path.module}/../ansible/playbook.yml \
    --extra-vars "domain_name=${var.domain_name} email=${var.email}"
EOT

}

depends_on = [
aws_instance.todo_app,
local_file.ansible_inventory
]
}
`

This approach provides several benefits:

Infrastructure and configuration are provisioned in a single Terraform apply
No manual intervention required between infrastructure and configuration steps
SSH availability check prevents Ansible from failing on a booting instance
Extra variables (domain, email) are passed from Terraform to Ansible seamlessly

[CI/CD PIPELINE IMPLEMENTATION WITH GITHUB ACTIONS]

Figure 3: Complete CI/CD pipeline showing infrastructure and application deployment workflows with drift detection and manual approval gates

GitHub Actions provides the orchestration layer for our CI/CD pipeline. Two separate workflows handle infrastructure changes and application deployments respectively.

Infrastructure Pipeline (infra.yml)

This workflow implements a sophisticated drift detection and approval mechanism:

`yaml
name: Infrastructure Pipeline

on:
push:
paths:
- "infra/terraform/"
- "infra/ansible/"

jobs:
terraform-plan:
runs-on: ubuntu-latest
outputs:
drift_detected: ${{ steps.plan.outputs.exitcode == 2 }}
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: "us-east-1"

steps:
  - uses: actions/checkout@v3

  - name: Setup Terraform
    uses: hashicorp/setup-terraform@v2
    with:
      terraform_wrapper: false # Allows capturing raw output

  - name: Terraform Init
    run: terraform init
    working-directory: infra/terraform

  - name: Create SSH Keys for Plan
    run: |
      echo "${{ secrets.SSH_PUBLIC_KEY }}" > infra/terraform/deployer_key.pub
      echo "${{ secrets.SSH_PRIVATE_KEY }}" > infra/terraform/deployer_key
      chmod 600 infra/terraform/deployer_key

  - name: Terraform Plan
    id: plan
    run: |
      exit_code=0
      terraform plan -detailed-exitcode -out=tfplan || exit_code=$?
      echo "exitcode=$exit_code" >> $GITHUB_OUTPUT

      if [ $exit_code -eq 2 ]; then
        echo "Infrastructure drift detected!"
      elif [ $exit_code -eq 1 ]; then
        echo "Terraform plan failed with errors"
        exit 1
      else
        echo "No infrastructure changes detected"
      fi
    working-directory: infra/terraform
    env:
      TF_VAR_public_key_path: "${{ github.workspace }}/infra/terraform/deployer_key.pub"
      TF_VAR_private_key_path: "${{ github.workspace }}/infra/terraform/deployer_key"
      TF_VAR_domain_name: ${{ secrets.DOMAIN_NAME }}
      TF_VAR_email: ${{ secrets.ACME_EMAIL }}

  - name: Upload Terraform Plan
    uses: actions/upload-artifact@v4
    with:
      name: tfplan
      path: infra/terraform/tfplan

  - name: Send Email on Drift
    if: steps.plan.outputs.exitcode == 2
    uses: dawidd6/action-send-mail@v3
    with:
      server_address: smtp.gmail.com
      server_port: 465
      username: ${{ secrets.MAIL_USERNAME }}
      password: ${{ secrets.MAIL_PASSWORD }}
      subject: "Infrastructure Drift Detected: Manual Review Required"
      html_body: |
        <h3>Terraform Drift Detected</h3>
        <p>Infrastructure changes have been detected for <b>${{ github.repository }}</b>.</p>
        <p>Please review the Terraform plan and approve the deployment to apply changes.</p>
        <p>
          <a href="${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" 
             style="background-color: #2ea44f; color: white; padding: 10px 20px; 
                    text-decoration: none; border-radius: 5px;">
            View Plan & Approve Deployment
          </a>
        </p>
      to: ${{ secrets.MAIL_TO }}
      from: GitHub Actions CI/CD

terraform-apply:
needs: terraform-plan
if: needs.terraform-plan.outputs.drift_detected == 'true'
runs-on: ubuntu-latest
environment: production # Requires manual approval
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: "us-east-1"

steps:
  - uses: actions/checkout@v3

  - name: Setup Terraform
    uses: hashicorp/setup-terraform@v2

  - name: Create SSH Keys
    run: |
      echo "${{ secrets.SSH_PUBLIC_KEY }}" > infra/terraform/deployer_key.pub
      echo "${{ secrets.SSH_PRIVATE_KEY }}" > infra/terraform/deployer_key
      chmod 600 infra/terraform/deployer_key

  - name: Verify SSH Key Format
    run: |
      chmod 600 infra/terraform/deployer_key
      ssh-keygen -l -f infra/terraform/deployer_key || \
        echo "::error::SSH Private Key is invalid! Check your GitHub Secret."

  - name: Terraform Init
    run: terraform init
    working-directory: infra/terraform

  - name: Download Terraform Plan
    uses: actions/download-artifact@v4
    with:
      name: tfplan
      path: infra/terraform

  - name: Terraform Apply
    run: terraform apply -auto-approve tfplan
    working-directory: infra/terraform
    env:
      TF_VAR_public_key_path: "${{ github.workspace }}/infra/terraform/deployer_key.pub"
      TF_VAR_private_key_path: "${{ github.workspace}}/infra/terraform/deployer_key"
      TF_VAR_domain_name: ${{ secrets.DOMAIN_NAME }}
      TF_VAR_email: ${{ secrets.ACME_EMAIL }}

Pipeline Features Explained:

1. Trigger Conditions:

The workflow is triggered on push events to the infra/terraform/** or infra/ansible/** paths. This ensures that any changes to infrastructure code or Ansible playbooks automatically initiate a plan.

2. Terraform Plan Job (terraform-plan):

runs-on: ubuntu-latest: Executes on a fresh Ubuntu runner.
AWS Credentials: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are securely injected from GitHub Secrets, ensuring the workflow has permissions to interact with AWS.
actions/checkout@v3: Checks out the repository content.
hashicorp/setup-terraform@v2: Installs the specified Terraform version. terraform_wrapper: false is crucial here to allow capturing the raw exit code from terraform plan.
Terraform Init: Initializes the Terraform working directory, downloading providers and setting up the S3 backend for state management.
Create SSH Keys for Plan: Dynamically creates deployer_key.pub and deployer_key files from GitHub Secrets. These are needed for Terraform to pass the public key to AWS and for the null_resource to use the private key for Ansible. Proper chmod 600 is applied for security.
Terraform Plan (id: plan):
- Executes terraform plan -detailed-exitcode -out=tfplan. The -detailed-exitcode option is key for drift detection:
- 0: No changes, infrastructure matches state.
- 1: Error occurred.
- 2: Changes detected, infrastructure differs from state.
- The exit_code is captured and set as an output (drift_detected) for subsequent jobs.
- Terraform variables (TF_VAR_public_key_path, TF_VAR_private_key_path, TF_VAR_domain_name, TF_VAR_email) are passed from GitHub Actions environment variables, which in turn are populated from GitHub Secrets.
Upload Terraform Plan: The generated tfplan file (which contains the proposed changes) is uploaded as an artifact. This allows the terraform-apply job to use the exact same plan, preventing "plan drift" between the plan and apply stages.
Send Email on Drift:
- This step runs if: steps.plan.outputs.exitcode == 2, meaning it only executes if drift is detected.
- It uses dawidd6/action-send-mail@v3 to send an email notification to a configured address (MAIL_TO from secrets).
- The email includes a direct link to the GitHub Actions run, prompting a manual review.

3. Terraform Apply Job (terraform-apply):

needs: terraform-plan: This job depends on the terraform-plan job completing successfully.
if: needs.terraform-plan.outputs.drift_detected == 'true': This job only runs if drift was detected in the planning phase. If there's no drift, there's nothing to apply.
environment: production: This is a critical security feature. GitHub Environments allow you to configure protection rules, such as requiring manual approval before a workflow can proceed to this step. This acts as a "human in the loop" for production infrastructure changes.
AWS Credentials: Same as the plan job.
actions/checkout@v3: Checks out the repository.
Setup Terraform: Installs Terraform.
Create SSH Keys: Recreates the SSH key files, as runners are ephemeral.
Verify SSH Key Format: A defensive step to ensure the private key from secrets is valid before attempting to use it. This helps catch misconfigurations early.
Terraform Init: Initializes Terraform.
Download Terraform Plan: Downloads the tfplan artifact generated by the terraform-plan job. This ensures that the apply operation is based on the exact plan that was reviewed.
Terraform Apply:
- Executes terraform apply -auto-approve tfplan. The -auto-approve flag is used because the manual approval for the production environment already serves as the explicit approval.
- The tfplan file is passed directly, guaranteeing that only the planned changes are applied.
- Terraform variables are passed as in the plan step.

This infrastructure pipeline provides a robust, secure, and auditable process for managing infrastructure changes, incorporating drift detection, email notifications, and manual approval for critical environments.

Deployment Pipeline (deploy.yml)

The application deployment pipeline handles code changes and responds to infrastructure updates:

`yaml
name: Application Deployment

on:
workflow_run:
workflows: ["Infrastructure Pipeline"]
types: [completed]
push:
paths:
- "frontend/"
- "auth-api/"
- "todos-api/"
- "users-api/"
- "log-message-processor/**"
- "docker-compose.yml"

jobs:
deploy:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'push' }}
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: "us-east-1"

steps:
  - uses: actions/checkout@v3

  - name: Get Server IP Dynamically
    id: get-ip
    run: |
      IP=$(aws ec2 describe-instances \
        --filters "Name=tag:Name,Values=todo-app-server-v2" \
                  "Name=instance-state-name,Values=running" \
        --query "Reservations[*].Instances[*].PublicIpAddress" \
        --output text)
      echo "SERVER_IP=$IP" >> $GITHUB_ENV
      echo "Deploying to instance at IP: $IP"

  - name: Deploy via Ansible
    uses: dawidd6/action-ansible-playbook@v2
    with:
      playbook: infra/ansible/playbook.yml
      directory: ./
      key: ${{ secrets.SSH_PRIVATE_KEY }}
      inventory: |
        [web]
        ${{ env.SERVER_IP }} ansible_user=ubuntu
      options: |
        --extra-vars "domain_name=${{ secrets.DOMAIN_NAME }} email=${{ secrets.ACME_EMAIL }}"
    env:
      ANSIBLE_CONFIG: infra/ansible/ansible.cfg

Key Aspects:

Trigger Conditions:

workflow_run: This pipeline is triggered when the Infrastructure Pipeline completes. This ensures that if infrastructure changes (e.g., a new EC2 instance is provisioned), the application deployment automatically follows. The types: [completed] ensures it runs after the workflow finishes.
push: It also triggers on direct pushes to specific application code directories (frontend/**, auth-api/**, etc.) or the docker-compose.yml file. This allows for rapid iteration on application code without requiring an infrastructure change.
if condition: if: ${{ github.event.workflow_run.conclusion == 'success' || github.event_name == 'push' }} ensures the deployment only proceeds if the infrastructure pipeline was successful (if triggered by workflow_run) or if it's a direct code push.

Dynamic IP Resolution:

Get Server IP Dynamically: This step uses the AWS CLI to find the public IP address of the EC2 instance.
- It filters instances by the Name tag (todo-app-server-v2) and ensures the instance is running.
- The jq-like query (--query "Reservations[*].Instances[*].PublicIpAddress") extracts the IP address.
- The IP is then stored in a GitHub Actions environment variable (SERVER_IP) for use in subsequent steps. This is crucial because the EC2 instance's IP might change if it's stopped and started, or if a new instance replaces an old one.

Inline Inventory:

Deploy via Ansible: This step uses the dawidd6/action-ansible-playbook@v2 action to run the Ansible playbook.
- playbook: infra/ansible/playbook.yml: Specifies the main Ansible playbook.
- key: ${{ secrets.SSH_PRIVATE_KEY }}: The SSH private key is securely passed from GitHub Secrets, allowing Ansible to connect to the EC2 instance.
- inventory: | ...: Instead of a static inventory file, an inline inventory is generated using the dynamically fetched SERVER_IP. This makes the deployment highly flexible and resilient to IP changes. The ansible_user=ubuntu specifies the default user for SSH connection.
- options: | --extra-vars ...: Additional variables like domain_name and email are passed to Ansible from GitHub Secrets, ensuring consistency across the pipeline.
- ANSIBLE_CONFIG: infra/ansible/ansible.cfg: Points Ansible to a custom configuration file if needed.

This application deployment pipeline is designed for efficiency and reliability, automatically reacting to both infrastructure and code changes, and dynamically adapting to the current state of the infrastructure.

[CONFIGURATION MANAGEMENT WITH ANSIBLE]

Ansible handles all server configuration and application deployment tasks. The playbook is structured using roles for modularity and reusability.

Playbook Structure

The main Ansible playbook (infra/ansible/playbook.yml) orchestrates the execution of different roles:

`yaml

hosts: web become: yes # Run tasks with sudo privileges vars: project_root: /opt/todo-app # Base directory for the application repo_url: https://github.com/PrimoCrypt/DevOps-Stage-6.git # Application repository URL # jwt_secret is passed via --extra-vars from GitHub Actions # domain_name and email are also passed via --extra-vars roles:
- dependencies # Installs Docker, Docker Compose, Git, configures firewall
- deploy # Clones repo, creates .env, runs docker-compose `

This structure clearly separates concerns: dependencies role sets up the server environment, and deploy role handles the application-specific deployment.

Dependencies Role

The dependencies role (infra/ansible/roles/dependencies/tasks/main.yml) ensures all prerequisite software is installed and configured on the EC2 instance. This makes the instance ready to host Dockerized applications.

Tasks performed:

System package updates: Ensures the system is up-to-date (apt update && apt upgrade).
Docker Engine installation: Installs the latest stable version of Docker CE (Community Edition) by adding Docker's official GPG key and repository.
Docker Compose installation: Downloads and installs the latest Docker Compose binary (v2.x) to /usr/local/bin.
Git installation: Installs Git for cloning the application repository.
UFW firewall configuration: Configures the Uncomplicated Firewall (UFW) to allow SSH (port 22), HTTP (port 80), and HTTPS (port 443) traffic, then enables the firewall.
Docker service enablement and startup: Ensures the Docker daemon starts automatically on boot and is currently running.
User permissions for Docker socket: Adds the ubuntu user to the docker group, allowing them to run Docker commands without sudo. This requires a reboot or re-login to take effect, which is handled implicitly by subsequent SSH connections.

Example task (abbreviated):

`yaml

name: Install Docker dependencies
ansible.builtin.apt:
name:
- apt-transport-https
- ca-certificates
- curl
- gnupg
- lsb-release state: present update_cache: yes
name: Add Docker GPG key

ansible.builtin.apt_key:

url: https://download.docker.com/linux/ubuntu/gpg

state: present
name: Add Docker APT repository

ansible.builtin.apt_repository:

repo: "deb [arch={{ 'amd64' if ansible_architecture == 'x86_64' else ansible_architecture }}] https://download.docker.com/linux/ubuntu {{ ansible_distribution_release }} stable"

state: present
name: Install Docker Engine
ansible.builtin.apt:
name:
- docker-ce
- docker-ce-cli
- containerd.io
- docker-buildx-plugin # For buildx support
- docker-compose-plugin # For docker compose v2 state: present
name: Install Docker Compose (legacy v1 if needed, but v2 is preferred via plugin)

This task is often not needed if docker-compose-plugin is installed

For older systems or specific needs, you might still download the binary

get_url:
url: https://github.com/docker/compose/releases/download/v2.20.0/docker-compose-linux-x86_64
dest: /usr/local/bin/docker-compose
mode: "0755"
when: false # Disable this if using docker-compose-plugin
name: Ensure Docker service is running and enabled

ansible.builtin.systemd:

name: docker

state: started

enabled: yes
name: Add 'ubuntu' user to the 'docker' group

ansible.builtin.user:

name: ubuntu

groups: docker

append: yes
name: Configure UFW to allow SSH, HTTP, HTTPS
community.general.ufw:
rule: allow
port: "{{ item }}"
proto: tcp
loop:
- "22"
- "80"
- "443"
name: Enable UFW

community.general.ufw:

state: enabled

`

Deploy Role

The deploy role (infra/ansible/roles/deploy/tasks/main.yml) handles the actual application deployment and lifecycle management.

Workflow:

Clone or update the application repository: Uses git module to clone the repository if it doesn't exist, or pull the latest changes if it does. This ensures the server always has the most recent application code.
Generate .env file: Creates an .env file in the application's root directory using Ansible's template module. This file contains environment variables required by the Docker Compose services, such as DOMAIN_NAME, ACME_EMAIL, and JWT_SECRET.
Stop existing containers (if running): The docker_compose module handles this implicitly when state: present and pull: yes are used, as it will recreate containers if images have changed.
Pull latest Docker images: pull: yes in docker_compose ensures that the latest images for all services are downloaded from Docker Hub or a private registry.
Start containers with docker-compose: The community.docker.docker_compose module orchestrates the startup of all services defined in docker-compose.yml.
Verify service health: While not explicitly shown in the playbook snippet, a production setup would include tasks to wait for services to become healthy (e.g., using wait_for module or health checks).

Environment file generation:

`yaml

name: Ensure project root directory exists
ansible.builtin.file:
path: "{{ project_root }}"
state: directory
mode: "0755"
name: Clone or update application repository
ansible.builtin.git:
repo: "{{ repo_url }}"
dest: "{{ project_root }}"
version: master # Or a specific branch/tag
update: yes
force: yes # Force update in case of local changes
name: Create .env file from template
ansible.builtin.template:
src: env.j2 # Template file located in infra/ansible/roles/deploy/templates/
dest: "{{ project_root }}/.env"
mode: "0600" # Secure permissions for sensitive environment variables
name: Start application services with Docker Compose
community.docker.docker_compose:
project_src: "{{ project_root }}"
state: present # Ensures services are running
pull: yes # Pulls latest images before starting
build: yes # Builds images if necessary (e.g., local Dockerfiles)
`

The env.j2 template (infra/ansible/roles/deploy/templates/env.j2) injects runtime configuration:

DOMAIN_NAME={{ domain_name }} ACME_EMAIL={{ email }} JWT_SECRET={{ jwt_secret }}

The jwt_secret variable would typically be passed as an --extra-var from GitHub Actions, similar to domain_name and email, ensuring it's never hardcoded in the repository.

[CONTAINER ORCHESTRATION WITH DOCKER COMPOSE]

Docker Compose orchestrates all seven services with a single configuration file (docker-compose.yml). This file defines the services, their dependencies, network configurations, and how they interact with Traefik.

Complete Docker Compose Configuration

`yaml
version: "3.8"

services:
traefik:
image: traefik:v3.6
command:
- "--api.insecure=true" # Enable Traefik dashboard (for debugging, disable in prod)
- "--providers.docker=true" # Enable Docker provider
- "--providers.docker.exposedbydefault=false" # Only expose services with traefik.enable=true
- "--entrypoints.web.address=:80" # HTTP entrypoint
- "--entrypoints.websecure.address=:443" # HTTPS entrypoint
- "--certificatesresolvers.myresolver.acme.tlschallenge=true" # Use TLS challenge for Let's Encrypt
- "--certificatesresolvers.myresolver.acme.email=${ACME_EMAIL}" # Email for Let's Encrypt notifications
- "--certificatesresolvers.myresolver.acme.storage=/letsencrypt/acme.json" # Storage for certificates
ports:
- "80:80" # Expose HTTP
- "443:443" # Expose HTTPS
- "8080:8080" # Expose Traefik dashboard (for debugging, disable in prod)
volumes:
- "./letsencrypt:/letsencrypt" # Persistent storage for Let's Encrypt certificates
- "/var/run/docker.sock:/var/run/docker.sock:ro" # Mount Docker socket for service discovery
networks:
- app-network
restart: unless-stopped # Always restart unless explicitly stopped

frontend:
build: ./frontend # Build from local Dockerfile
image: frontend:latest # Tag the built image
container_name: frontend
labels:
- "traefik.enable=true"
- "traefik.http.routers.frontend.rule=Host(${DOMAIN_NAME})" # Route based on domain name
- "traefik.http.routers.frontend.entrypoints=websecure" # Use HTTPS entrypoint
- "traefik.http.routers.frontend.tls.certresolver=myresolver" # Use Let's Encrypt resolver
- "traefik.http.routers.frontend-http.rule=Host(${DOMAIN_NAME})" # HTTP router for redirect
- "traefik.http.routers.frontend-http.entrypoints=web" # Use HTTP entrypoint
- "traefik.http.routers.frontend-http.middlewares=https-redirect" # Apply HTTPS redirect middleware
- "traefik.http.middlewares.https-redirect.redirectscheme.scheme=https" # Define HTTPS redirect
networks:
- app-network
restart: unless-stopped

auth-api:
build: ./auth-api
image: auth-api:latest
container_name: auth-api
environment:
- USERS_API_ADDRESS=http://users-api:8083 # Internal service communication
- JWT_SECRET=${JWT_SECRET} # Injected from .env
- AUTH_API_PORT=8081
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans # Zipkin endpoint
labels:
- "traefik.enable=true"
- "traefik.http.routers.auth-api.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/auth)" # Route by path prefix
- "traefik.http.routers.auth-api.entrypoints=websecure"
- "traefik.http.routers.auth-api.tls.certresolver=myresolver"
- "traefik.http.middlewares.auth-strip.stripprefix.prefixes=/api/auth" # Strip path prefix before forwarding
- "traefik.http.routers.auth-api.middlewares=auth-strip"
networks:
- app-network
restart: unless-stopped

todos-api:
build: ./todos-api
image: todos-api:latest
container_name: todos-api
environment:
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_CHANNEL=log_channel
- TODO_API_PORT=8082
- JWT_SECRET=${JWT_SECRET}
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans
labels:
- "traefik.enable=true"
- "traefik.http.routers.todos-api.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/todos)"
- "traefik.http.routers.todos-api.entrypoints=websecure"
- "traefik.http.routers.todos-api.tls.certresolver=myresolver"
- "traefik.http.middlewares.todos-strip.stripprefix.prefixes=/api"
- "traefik.http.routers.todos-api.middlewares=todos-strip"
networks:
- app-network
depends_on:
- redis # Ensure Redis starts before Todos API
restart: unless-stopped

users-api:
build: ./users-api
image: users-api:latest
container_name: users-api
environment:
- SERVER_PORT=8083
- JWT_SECRET=${JWT_SECRET}
- SPRING_ZIPKIN_BASE_URL=http://zipkin:9411/ # Spring Boot specific Zipkin config
labels:
- "traefik.enable=true"
- "traefik.http.routers.users-api.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/users)"
- "traefik.http.routers.users-api.entrypoints=websecure"
- "traefik.http.routers.users-api.tls.certresolver=myresolver"
- "traefik.http.middlewares.users-strip.stripprefix.prefixes=/api"
- "traefik.http.routers.users-api.middlewares=users-strip"
networks:
- app-network
restart: unless-stopped

log-message-processor:
build: ./log-message-processor
image: log-message-processor:latest
container_name: log-message-processor
environment:
- REDIS_HOST=redis
- REDIS_PORT=6379
- REDIS_CHANNEL=log_channel
- ZIPKIN_URL=http://zipkin:9411/api/v2/spans
networks:
- app-network
depends_on:
- redis
restart: unless-stopped

redis:
image: redis:alpine # Lightweight Redis image
container_name: redis
networks:
- app-network
restart: unless-stopped

zipkin:
image: openzipkin/zipkin # Official Zipkin image
container_name: zipkin
ports:
- "9411:9411" # Expose Zipkin UI internally (Traefik handles external access)
networks:
- app-network
labels:
- "traefik.enable=true"
- "traefik.http.routers.zipkin.rule=Host(${DOMAIN_NAME}) && PathPrefix(/api/zipkin)"
- "traefik.http.routers.zipkin.entrypoints=websecure"
- "traefik.http.routers.zipkin.tls.certresolver=myresolver"
- "traefik.http.middlewares.zipkin-strip.stripprefix.prefixes=/api/zipkin"
- "traefik.http.routers.zipkin.middlewares=zipkin-strip"
restart: unless-stopped

networks:
app-network:
driver: bridge # Custom bridge network for inter-service communication
`

Traefik Configuration Explained

Figure 4: Traefik request routing showing path-based routing, SSL termination, and Let's Encrypt integration

Traefik leverages Docker labels for dynamic service discovery and routing configuration. This eliminates the need for manual configuration file updates when services are added, removed, or updated, making it far more agile than traditional reverse proxies like Nginx.

Label breakdown for frontend service:

`yaml
labels:
# 1. Enable Traefik for this container

"traefik.enable=true"

# 2. HTTPS router configuration for the main domain

"traefik.http.routers.frontend.rule=Host(example.com)" # Matches requests for the specified domain
"traefik.http.routers.frontend.entrypoints=websecure" # Listens on the HTTPS entrypoint (port 443)
"traefik.http.routers.frontend.tls.certresolver=myresolver" # Uses the Let's Encrypt certificate resolver

# 3. HTTP router for automatic redirection to HTTPS

"traefik.http.routers.frontend-http.rule=Host(example.com)" # Matches requests for the domain on HTTP
"traefik.http.routers.frontend-http.entrypoints=web" # Listens on the HTTP entrypoint (port 80)
"traefik.http.routers.frontend-http.middlewares=https-redirect" # Applies the 'https-redirect' middleware

# 4. Middleware definition for the HTTPS redirect

"traefik.http.middlewares.https-redirect.redirectscheme.scheme=https" # Configures the middleware to redirect to HTTPS `

Benefits of this approach:

No configuration file reloads required: Traefik automatically detects changes to Docker labels and updates its routing table in real-time.
Services can be added/removed without Traefik downtime: This enables true zero-downtime deployments and dynamic scaling.
SSL certificates automatically provisioned and renewed: Let's Encrypt integration handles the entire lifecycle of SSL certificates.
Path-based routing allows multiple services on one domain: Different microservices can be exposed under different URL paths on the same domain (e.g., /api/auth, /api/todos).
Middleware support for transformations: Traefik middlewares can perform various functions like path stripping, authentication, rate limiting, and more, before forwarding requests to the backend service.

Service Routing Table

This table summarizes how external requests are routed to internal services via Traefik:

URL Path	Service	Backend Port	Traefik Middleware
`https://domain.com/`	frontend	80 (internal)	N/A
`https://domain.com/api/auth/*`	auth-api	8081	`auth-strip` (strips `/api/auth`)
`https://domain.com/api/todos/*`	todos-api	8082	`todos-strip` (strips `/api`)
`https://domain.com/api/users/*`	users-api	8083	`users-strip` (strips `/api`)
`https://domain.com/api/zipkin/*`	zipkin	9411	`zipkin-strip` (strips `/api/zipkin`)
`http://domain.com/*`	N/A	N/A	`https-redirect` (redirects to HTTPS)

[SECURITY IMPLEMENTATION AND BEST PRACTICES]

Security was a primary consideration throughout this project, implemented at multiple layers from infrastructure to application.

Secrets Management Strategy

All sensitive information is stored and managed securely, never committed directly to the repository.

GitHub Secrets Used:

AWS_ACCESS_KEY_ID: AWS programmatic access key for GitHub Actions.
AWS_SECRET_ACCESS_KEY: AWS secret key corresponding to the access key.
SSH_PUBLIC_KEY: The public part of the SSH key pair used for EC2 instance creation.
SSH_PRIVATE_KEY: The private part of the SSH key pair used by Terraform and Ansible for SSH connections.
DOMAIN_NAME: The production domain name for the application (e.g., yourdomain.com).
ACME_EMAIL: Email address for Let's Encrypt certificate notifications.
MAIL_USERNAME: SMTP username for sending drift detection emails.
MAIL_PASSWORD: SMTP password for sending drift detection emails.
MAIL_TO: Recipient email address for drift detection alerts.
JWT_SECRET: Secret key used for signing and verifying JSON Web Tokens across microservices.

Secret Rotation Strategy:

Dedicated Keys: SSH keys and AWS IAM credentials are generated specifically for this CI/CD pipeline, limiting their scope.
Least Privilege: The AWS IAM user associated with AWS_ACCESS_KEY_ID has only the minimum permissions required to provision and manage the specified resources.
Regular Rotation: A strategy for rotating all secrets (AWS keys, SSH keys, JWT secrets) every 90 days is recommended to minimize the impact of potential compromise.
Environment-Specific Secrets: For multi-environment setups, separate secrets would be maintained for dev, staging, and prod to further isolate environments.

Network Security

AWS Security Group Rules:
The EC2 instance's security group is configured with strict inbound rules:

`
Inbound Rules:

SSH (Port 22): Allowed only from a specific CIDR block (e.g., your office IP, VPN IP). This prevents unauthorized SSH access.
HTTP (Port 80): Allowed from anywhere (0.0.0.0/0). This is necessary for Traefik to handle initial Let's Encrypt challenges and HTTP to HTTPS redirection.
HTTPS (Port 443): Allowed from anywhere (0.0.0.0/0). This is for the main application traffic.

Outbound Rules:

All Traffic: Allowed to anywhere (0.0.0.0/0). This is necessary for the instance to download packages, pull Docker images, and make API calls to AWS services. `

Docker Network Isolation:

Dedicated Bridge Network: All Docker services run within a custom bridge network (app-network). This isolates them from the host's network and from other Docker networks.
Internal Communication: Services communicate with each other using their container names (e.g., http://redis:6379), which are resolved by Docker's internal DNS.
Minimal Port Exposure: Only Traefik exposes ports (80, 443, 8080) to the host machine. All other services are only accessible internally within the app-network, significantly reducing the attack surface.
Encrypted External Traffic: All external traffic to the application is forced over HTTPS, encrypted by Traefik.

SSL/TLS Implementation

Let's Encrypt via Traefik:

Automatic Certificate Provisioning: Traefik is configured to automatically obtain and renew SSL certificates from Let's Encrypt using the TLS-ALPN-01 challenge.
Persistent Storage: Certificates are stored in a persistent volume (./letsencrypt:/letsencrypt), ensuring they survive container restarts.
Automatic Renewal: Traefik handles certificate renewal automatically 30 days before expiration.
No Wildcard Certificates: For enhanced security, specific certificates are obtained for the primary domain, rather than using wildcard certificates, which have a broader attack surface.
HTTP to HTTPS Redirection: Traefik automatically redirects all HTTP traffic to HTTPS, ensuring all communication is encrypted.

Application Security Measures

JWT Token Authentication:

- The `auth-api` generates and validates JSON Web Tokens (JWTs) for user sessions.
- Tokens have configurable expiration times.
- A shared `JWT_SECRET` (injected via `.env`) is used across services for token validation, ensuring only authorized services can verify tokens.
- All API endpoints requiring authentication enforce the presence and validity of JWTs in the `Authorization` header.

Input Validation:

- Each API service is responsible for validating incoming request payloads to prevent common vulnerabilities like injection attacks (though no SQL DB is used here, the principle applies).
- Frontend input is also validated client-side and server-side.

CORS Configuration:

- The frontend and backend APIs are served from the same domain (different paths), eliminating the need for complex Cross-Origin Resource Sharing (CORS) configurations and potential misconfigurations.

Firewall Configuration (UFW):
- The Uncomplicated Firewall (UFW) is configured on the EC2 instance to provide an additional layer of host-level network security: bash ufw default deny incoming # Deny all incoming traffic by default ufw default allow outgoing # Allow all outgoing traffic ufw allow 22/tcp # Allow SSH ufw allow 80/tcp # Allow HTTP ufw allow 443/tcp # Allow HTTPS ufw enable # Enable the firewall
- This ensures that only explicitly allowed ports are open, even if security group rules were to be misconfigured.

[OBSERVABILITY AND DISTRIBUTED TRACING]

Observability is crucial for understanding the behavior of microservices in production. Zipkin is integrated to provide distributed tracing, allowing us to visualize and analyze request flows across all services.

Zipkin Integration Example (Frontend)

Each service is instrumented to send trace data to the Zipkin collector. Here's an example from the Vue.js frontend:

`javascript
// frontend/src/zipkin.js
import { Tracer, ExplicitContext, BatchRecorder } from "zipkin";
import { HttpLogger } from "zipkin-transport-http";

const tracer = new Tracer({
ctxImpl: new ExplicitContext(), // Manages the current span context
recorder: new BatchRecorder({
// Buffers spans and sends them in batches
logger: new HttpLogger({
// Sends spans over HTTP
endpoint: ${process.env.VUE_APP_API_URL}/api/zipkin, // Zipkin collector endpoint via Traefik
jsonEncoder: JSON.stringify, // Encodes spans as JSON
}),
}),
localServiceName: "frontend", // Name of this service in traces
supportsJoin: false, // Frontend typically starts new traces
});

export default tracer;
`

Similar instrumentation is applied to the Go, Node.js, Java, and Python services, ensuring that every request's journey through the microservices architecture is captured.

What Zipkin Tracks:

Request Duration: Measures the time taken for each operation within a service and across services.
Service Dependencies: Visualizes the call graph, showing which services call which others.
Error Rates and Failure Points: Helps identify where errors occur in a distributed transaction.
Latency Breakdown: Pinpoints bottlenecks by showing the time spent in different components (e.g., network, database, internal processing).
Asynchronous Message Processing: Traces can follow messages through queues (like Redis in this case) to track the full lifecycle of an event.

Use Cases:

Identifying Slow Endpoints: Quickly pinpoint which API calls or internal service interactions are contributing to high latency.
Debugging Timeout Issues: Understand where a request is getting stuck or timing out across multiple services.
Understanding Service Communication Patterns: Gain insights into how services interact, which can be invaluable for refactoring or optimizing.
Capacity Planning: Analyze traffic patterns and service performance to inform scaling decisions.
Root Cause Analysis for Production Incidents: When an issue occurs, traces provide a detailed timeline of events, helping to quickly identify the root cause.

[LESSONS LEARNED AND KEY TAKEAWAYS]

The journey of building this CI/CD pipeline provided several invaluable lessons and reinforced core DevOps principles.

1. Automation ROI is Exponential

The initial investment in setting up the pipeline was significant, approximately 40 hours of focused development and debugging. However, the return on investment (ROI) was almost immediate and continues to grow:

Deployment Time Reduction: Manual deployments, which previously took 2+ hours (including SSH, Git pulls, Docker builds, and manual checks), were reduced to less than 5 minutes for a full application update. Infrastructure provisioning went from hours to minutes.
Error Rate Reduction: Manual errors, a common source of production issues, were virtually eliminated. The pipeline ensures consistent, repeatable deployments.
Confidence Boost: The ability to deploy changes rapidly and reliably instilled a high degree of confidence in the development team, encouraging more frequent, smaller releases.

Conclusion: The upfront time investment in automation pays dividends immediately. After just a few deployments, the time saved exceeded the initial development time, proving that automation is not a luxury but a necessity for efficient software delivery.

2. Drift Detection is Non-Negotiable

During the development phase, it was tempting to make "quick fixes" directly in the AWS console for testing purposes. This inevitably led to discrepancies between the Terraform state and the actual infrastructure. The drift detection pipeline (using terraform plan -detailed-exitcode) consistently caught these manual changes.

Lesson: Enforce an "infrastructure as code or it doesn't exist" policy from day one. Any change to infrastructure must go through the code repository and the CI/CD pipeline. This prevents configuration drift, ensures auditability, and maintains a single source of truth for infrastructure.

3. Infrastructure as Code Provides Documentation

The Terraform configuration files, along with the Ansible playbooks, serve as living, executable documentation of the entire infrastructure and its configuration.

Clarity: The HCL files clearly define every AWS resource and its properties.
Auditability: Every change to the infrastructure is tracked in Git, complete with commit messages and pull request reviews.
Understanding: Comments within the code explain why certain decisions were made, not just what was configured, which is invaluable for new team members or for revisiting the setup months later.

4. Docker Compose Complexity Sweet Spot

For projects with a moderate number of services (e.g., less than 20), Docker Compose provides the perfect balance between simplicity and functionality. It offers container orchestration capabilities without the steep learning curve and operational overhead of more complex systems.

Alternatives considered:

Kubernetes: While powerful, Kubernetes would have been massive overkill for a single-server deployment. Its complexity (YAML sprawl, cluster management, networking) would have significantly slowed down development without providing proportional benefits for this scale.
Docker Swarm: Considered, but its uncertain future and less vibrant ecosystem made it a less attractive choice.
Nomad: A strong contender for lightweight orchestration, but with less ecosystem support and community resources compared to Docker Compose for this specific use case.

5. Traefik is a Game-Changer

Traefik proved to be an exceptionally powerful and developer-friendly reverse proxy. Its Docker-native approach, which uses container labels for dynamic configuration, eliminated the configuration management complexity often associated with Nginx or HAProxy.

Automatic SSL: The seamless integration with Let's Encrypt for automatic SSL certificate provisioning and renewal was a major time-saver and security enhancer.
Dynamic Routing: The ability to add or remove services and have Traefik automatically update its routing rules without restarts was crucial for zero-downtime deployments.

6. GitHub Actions for Team Workflows

GitHub Actions, while perhaps not as feature-rich or flexible as some enterprise-grade CI/CD platforms (like GitLab CI or Jenkins), offers unparalleled integration with GitHub repositories.

Ease of Use: Its YAML-based syntax is relatively easy to learn.
Tight Integration: Direct access to GitHub events, secrets, and environments simplifies pipeline development.
Community Actions: A vast marketplace of pre-built actions accelerates workflow creation.

For smaller teams or projects already hosted on GitHub, it provides a highly effective and convenient CI/CD solution without the need for managing a separate CI/CD server.

[CHALLENGES ENCOUNTERED AND SOLUTIONS]

Building a robust CI/CD pipeline often involves overcoming several technical hurdles. Here are some key challenges faced during this project and their respective solutions.

Challenge 1: SSH Key Management in CI/CD

Problem: GitHub Actions runners are ephemeral, meaning they are provisioned fresh for each job. For Terraform to provision an EC2 instance with an SSH public key, and for Ansible to connect to that instance using the corresponding private key, these keys needed to be available as files on the runner during workflow execution. Storing them directly in the repository is a security anti-pattern.

Solution Implemented:
The public and private SSH keys were stored as encrypted GitHub Secrets (SSH_PUBLIC_KEY and SSH_PRIVATE_KEY). During the GitHub Actions workflow, these secrets were dynamically written to temporary files on the runner's filesystem.

`yaml

name: Create SSH Keys for Plan run: | echo "${{ secrets.SSH_PUBLIC_KEY }}" > infra/terraform/deployer_key.pub echo "${{ secrets.SSH_PRIVATE_KEY }}" > infra/terraform/deployer_key chmod 600 infra/terraform/deployer_key # Set secure permissions for the private key `

Key Insight: It's crucial to set appropriate file permissions (chmod 600) for the private key to prevent unauthorized access and ensure SSH clients accept it. Additionally, a defensive step was added to verify the key format:

`yaml

name: Verify SSH Key Format run: | chmod 600 infra/terraform/deployer_key ssh-keygen -l -f infra/terraform/deployer_key || \ echo "::error::SSH Private Key is invalid! Check your GitHub Secret." `

This check helps catch issues early if the secret was incorrectly pasted or corrupted.

Challenge 2: Terraform and Ansible Integration

Problem: Ansible needs the public IP address of the EC2 instance to connect and configure it. However, this IP address is only known after Terraform has successfully created the instance. This presents a classic "chicken-and-egg" problem in automation.

Solutions Evaluated:

❌ Manual Intervention: Run Terraform, then manually copy the IP to an Ansible inventory file, then run Ansible. (Completely defeats the purpose of automation).
❌ Terraform Provisioners with remote-exec: While Terraform has remote-exec provisioners, they are generally considered brittle for complex configuration management. They lack the idempotency and rich module ecosystem of Ansible.
✅ Dynamic Ansible Inventory Generation: Use Terraform's local_file resource to dynamically generate the Ansible inventory file after the EC2 instance's IP is known.

Final Solution:
A local_file resource in Terraform was used to create an hosts.yml file in the Ansible inventory directory. This file uses a templatefile function to inject the aws_instance.todo_app.public_ip into the inventory.

hcl resource "local_file" "ansible_inventory" { content = templatefile("${path.module}/inventory.tftpl", { host = aws_instance.todo_app.public_ip user = var.server_user key = var.private_key_path }) filename = "${path.module}/../ansible/inventory/hosts.yml" }

Then, a null_resource with a local-exec provisioner was used to trigger the Ansible playbook, referencing this dynamically generated inventory file. This pattern ensures that Ansible always targets the correct, newly provisioned instance.

Challenge 3: Environment Variable Propagation

Problem: Several critical values (e.g., DOMAIN_NAME, ACME_EMAIL, JWT_SECRET) were needed at different stages of the pipeline and by different tools (Terraform, Ansible, Docker Compose, application containers). Maintaining consistency and securely passing these values was a challenge.

Solution: A "single source of truth" approach was adopted, with GitHub Secrets serving as the central repository for all sensitive and configuration values. These values were then propagated down the pipeline:

Figure 5: Secrets and environment variables flow from GitHub Secrets through the entire deployment pipeline

Data Flow:

GitHub Secrets → GitHub Actions (environment variables) → Terraform (via TF_VAR_ prefix) → Ansible (via --extra-vars) → Docker Compose (via .env file generated by Ansible) → Application Containers (via Docker Compose environment variables)

This ensures that values are consistent, securely managed, and injected at the appropriate stage without being hardcoded.

Challenge 4: Terraform State Locking

Problem: In a team environment, or even with multiple CI/CD jobs, concurrent terraform apply operations on the same state file can lead to state corruption, data loss, or inconsistent infrastructure.

Solution: Terraform's S3 backend was configured with DynamoDB for state locking.

hcl terraform { backend "s3" { bucket = "my-terraform-state-bucket" # Dedicated S3 bucket for state files key = "todo-app/terraform.tfstate" # Path to the state file within the bucket region = "us-east-1" dynamodb_table = "terraform-state-lock" # DynamoDB table for locking encrypt = true # Encrypt state file at rest } }

When a terraform apply is initiated, Terraform attempts to acquire a lock in the DynamoDB table. If successful, it proceeds; otherwise, it waits or fails, preventing concurrent modifications and ensuring state integrity.

Challenge 5: Docker Build Context in GitHub Actions

Problem: Building Docker images within GitHub Actions can be slow if the entire repository is sent as the build context, especially for large repositories with many unrelated files (e.g., .git directories, node_modules, documentation).

Solution: Two primary optimizations were applied:

.dockerignore files: Each service's Dockerfile directory included a .dockerignore file. This file specifies patterns for files and directories that should be excluded from the Docker build context. dockerfile # Example .dockerignore .git node_modules *.md tests/ This significantly reduces the amount of data sent to the Docker daemon, speeding up the build process.
Multi-stage builds: Dockerfiles were structured using multi-stage builds to separate build-time dependencies from runtime dependencies. This results in smaller, more secure final images.

These optimizations collectively reduced Docker build times from approximately 8 minutes to less than 2 minutes, accelerating the deployment pipeline.

[FUTURE IMPROVEMENTS AND ROADMAP]

While the current CI/CD pipeline is production-ready, there are always opportunities for enhancement and scaling. This roadmap outlines potential future improvements.

Phase 1: Multi-Environment Support (Q1 2025)

Objective: To provide isolated and consistent environments for development, staging, and production, enabling safer testing and deployment workflows.

Implementation Plan:

Terraform Workspaces or Separate State Files: Utilize Terraform workspaces (terraform workspace new dev) or maintain separate Terraform state files for each environment.
Environment-Specific Variable Files: Create terraform.tfvars files (e.g., dev.tfvars, staging.tfvars, prod.tfvars) to manage environment-specific configurations (e.g., instance types, domain names, resource tags).
Separate GitHub Actions Environments: Configure distinct GitHub Environments (e.g., dev, staging, production) with different protection rules (e.g., manual approval for production, no approval for dev).
Subdomain Routing: Implement subdomain-based routing (e.g., dev.example.com, staging.example.com, app.example.com) to access different environments.

Phase 2: Auto-Scaling (Q2 2025)

Objective: To automatically adjust compute capacity based on demand, ensuring application availability and cost efficiency.

Components:

AWS Auto Scaling Groups (ASG): Replace the single EC2 instance with an ASG to manage a fleet of instances.
Application Load Balancer (ALB): Introduce an ALB in front of the ASG to distribute incoming traffic and replace the single-instance Traefik as the primary entry point. Traefik would then run on each instance behind the ALB.
CloudWatch Alarms: Configure CloudWatch alarms to trigger scaling policies based on metrics like CPU utilization, request count per target, or custom metrics.
Shared Persistent Storage: For Traefik certificates and other shared data, consider using Amazon EFS or an S3 bucket mounted via FUSE, ensuring state is synchronized across instances.

Phase 3: Database Persistence (Q2 2025)

Objective: To move from in-memory data stores to durable, managed database services, ensuring data integrity and persistence.

Services to Add:

Amazon RDS (PostgreSQL): For relational data storage, replacing any in-memory databases.
Amazon ElastiCache (Redis): For distributed caching and message queuing, providing a managed, highly available Redis instance.
Database Migration Management: Integrate tools like Flyway or Liquibase into the CI/CD pipeline to manage database schema changes automatically.
Automated Backups and Point-in-Time Recovery: Configure RDS and ElastiCache for automated backups and enable point-in-time recovery.

Phase 4: Comprehensive Monitoring (Q3 2025)

Objective: To implement a robust monitoring and alerting solution for proactive issue detection and performance analysis.

Stack:

Prometheus: For collecting time-series metrics from application services, Docker containers, and the host system.
Grafana: For creating interactive dashboards to visualize metrics and gain insights into system health and performance.
AlertManager: For intelligent routing and deduplication of alerts generated by Prometheus.
CloudWatch Integration: Integrate with AWS CloudWatch for monitoring AWS service health and infrastructure metrics.

Key Metrics to Track:

Request latency (p50, p95, p99 percentiles)
Error rates by service and endpoint
Container resource utilization (CPU, memory, disk I/O)
Network traffic and connection counts
Application-specific business metrics

Phase 5: Blue-Green Deployments (Q3 2025)

Objective: To achieve zero-downtime deployments with instant rollback capabilities, minimizing user impact during updates.

Implementation:

Two Identical Environments: Maintain two identical production environments (e.g., "Blue" and "Green").
Traffic Switching: Use the Application Load Balancer to switch traffic instantly from the old (Blue) environment to the new (Green) environment after successful deployment and health checks.
Automated Health Checks: Implement comprehensive health checks for the new environment before traffic is shifted.
One-Click Rollback: In case of issues in the Green environment, traffic can be instantly switched back to the stable Blue environment.

Phase 6: Security Hardening (Q4 2025)

Additional Measures:

AWS WAF (Web Application Firewall): Deploy WAF in front of the ALB to protect against common web exploits (e.g., SQL injection, cross-site scripting).
Amazon GuardDuty: Enable GuardDuty for intelligent threat detection and continuous monitoring of AWS accounts for malicious activity.
Secrets Rotation Automation: Implement automated rotation of all secrets (AWS credentials, SSH keys, database passwords) using AWS Secrets Manager or similar tools.
Encrypted Volume Storage: Ensure all EBS volumes attached to EC2 instances are encrypted at rest.
Regular Penetration Testing: Schedule periodic penetration tests and security audits to identify and remediate vulnerabilities.

[CONCLUSION AND FINAL THOUGHTS]

Building this comprehensive CI/CD pipeline demonstrated that modern DevOps is not merely about adopting trendy tools—it's about creating reliable, repeatable, and auditable processes that empower development teams to move quickly while maintaining production stability.

The combination of:

Terraform for declarative infrastructure provisioning,
GitHub Actions for robust CI/CD orchestration,
Ansible for idempotent configuration management,
Docker for consistent containerization, and
Traefik for dynamic routing and automatic SSL, creates a powerful, production-ready deployment platform that can scale from prototype to production. The entire technology stack can be deployed with a single git push, yet includes sophisticated safety mechanisms like drift detection, manual approval gates, and automated rollback capabilities.

Key Success Factors:

Declarative Infrastructure: Terraform's declarative approach makes infrastructure changes reviewable, testable, and version-controlled.
Immutable Deployments: Containers ensure consistent behavior across environments, reducing "it works on my machine" issues.
Automated Testing: CI/CD pipelines catch issues early, preventing them from reaching production.
Observability: Distributed tracing provides critical visibility into complex microservices interactions, aiding in debugging and performance optimization.
Security by Default: Encrypted secrets, least-privilege IAM roles, automated SSL, and robust firewall rules establish a strong security posture.

This architecture serves as a template for modern application deployment, demonstrating that enterprise-grade automation is accessible to small teams and individual developers. The patterns established here scale effectively from single-server deployments to multi-region, highly available infrastructures.

[ADDITIONAL RESOURCES AND REFERENCES]

Project Repository:

https://github.com/PrimoCrypt/DevOps-Stage-6

Official Documentation:

Terraform AWS Provider: https://registry.terraform.io/providers/hashicorp/aws/latest/docs
Ansible Best Practices: https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html
Traefik v3 Documentation: https://doc.traefik.io/traefik/
GitHub Actions Workflows: https://docs.github.com/en/actions
Docker Compose Reference: https://docs.docker.com/compose/compose-file/
Zipkin Architecture: https://zipkin.io/pages/architecture.html

Recommended Learning Resources:

"Terraform: Up & Running" by Yevgeniy Brikman
"Ansible for DevOps" by Jeff Geerling
"The DevOps Handbook" by Gene Kim
HashiCorp Learn (free interactive tutorials)

Questions or feedback? I'd love to discuss DevOps automation strategies, infrastructure as code patterns, or troubleshooting deployment pipelines. Drop your questions in the comments or reach out on Twitter/LinkedIn.

Found this helpful? Consider starring the project repository and sharing this guide with your team!

Tags: #DevOps #Terraform #Ansible #CICD #AWS #Docker #InfrastructureAsCode #GitHubActions #Microservices #Traefik #Automation #CloudComputing #ContainerOrchestration #SRE

[TABLE OF CONTENTS]

[PROJECT GOALS AND OVERVIEW]

[SYSTEM ARCHITECTURE AND DESIGN]

Architecture Diagram

Service Responsibilities

Network Architecture

[INFRASTRUCTURE AS CODE WITH TERRAFORM]

AWS Resources Provisioned

Dynamic Ansible Inventory Generation

Template file for inventory generation

Integrated Terraform-Ansible Provisioning

[CI/CD PIPELINE IMPLEMENTATION WITH GITHUB ACTIONS]

Infrastructure Pipeline (infra.yml)

Deployment Pipeline (deploy.yml)

[CONFIGURATION MANAGEMENT WITH ANSIBLE]

Playbook Structure

`yaml

Dependencies Role

This task is often not needed if docker-compose-plugin is installed

For older systems or specific needs, you might still download the binary

Deploy Role

[CONTAINER ORCHESTRATION WITH DOCKER COMPOSE]

Complete Docker Compose Configuration

Traefik Configuration Explained

Service Routing Table

[SECURITY IMPLEMENTATION AND BEST PRACTICES]

Secrets Management Strategy

Network Security

SSL/TLS Implementation

Application Security Measures

[OBSERVABILITY AND DISTRIBUTED TRACING]

Zipkin Integration Example (Frontend)

[LESSONS LEARNED AND KEY TAKEAWAYS]

1. Automation ROI is Exponential

2. Drift Detection is Non-Negotiable

3. Infrastructure as Code Provides Documentation

4. Docker Compose Complexity Sweet Spot

5. Traefik is a Game-Changer

6. GitHub Actions for Team Workflows

[CHALLENGES ENCOUNTERED AND SOLUTIONS]

Challenge 1: SSH Key Management in CI/CD

Challenge 2: Terraform and Ansible Integration

Challenge 3: Environment Variable Propagation

Challenge 4: Terraform State Locking

Challenge 5: Docker Build Context in GitHub Actions

[FUTURE IMPROVEMENTS AND ROADMAP]

Phase 1: Multi-Environment Support (Q1 2025)

Phase 2: Auto-Scaling (Q2 2025)

Phase 3: Database Persistence (Q2 2025)

Phase 4: Comprehensive Monitoring (Q3 2025)

Phase 5: Blue-Green Deployments (Q3 2025)

Phase 6: Security Hardening (Q4 2025)

[CONCLUSION AND FINAL THOUGHTS]

[ADDITIONAL RESOURCES AND REFERENCES]