Filbert Nana Blessing

Posted on Mar 14

How I Built a Production-Grade DevOps Project From Scratch

#aws #devops #docker #terraform

A walkthrough of building a real CI/CD pipeline, AWS infrastructure, and containerised app — the way it's actually done in production.

Why I Built This

Most DevOps tutorials show you how to deploy a "Hello World" app to a single EC2 instance with hardcoded AWS keys. That's fine for learning the basics, but it doesn't reflect what production engineering actually looks like.

I wanted to build something I could point to and say — this is how I would do it at a real company. No shortcuts, no tutorial hand-holding.

The result: a fully automated pipeline that takes code from a git push to a live HTTPS endpoint on AWS, with security scanning, infrastructure as code, observability, and zero static credentials anywhere.

Live URL: https://tasks.therealblessing.com
GitHub: github.com/nanafilbert/cicd-aws-terraform-deploy

What I Built

A Node.js task manager API with a Kanban dashboard UI, deployed to AWS behind an Application Load Balancer with:

8-stage GitHub Actions CI/CD pipeline
OIDC keyless AWS authentication — no static credentials
Modular Terraform — VPC, ALB, ASG, EC2, all in reusable modules
Multi-stage Docker build — tests run inside the build, broken images can't be pushed
Trivy CVE scanning — pipeline fails on HIGH/CRITICAL vulnerabilities
ACM SSL certificate with custom domain
Prometheus + Grafana observability stack locally
ASG instance refresh for zero-downtime deployments

The Architecture

GitHub Actions (OIDC) → Docker Hub → AWS
                                      │
                              ALB (HTTPS:443)
                              ACM Certificate
                              HTTP → HTTPS redirect
                                      │
                          Auto Scaling Group (EC2 t3.small)
                                      │
                              Docker Container
                              Node.js API :3000

The pipeline authenticates to AWS using OIDC short-lived tokens — no AWS_ACCESS_KEY_ID or AWS_SECRET_ACCESS_KEY anywhere. Every deploy triggers an ASG instance refresh that replaces EC2 instances with fresh ones pulling the latest image.

The CI/CD Pipeline

Eight stages, every one intentional:

1. Lint

ESLint checks code quality. Fails fast before any expensive steps.

2. Test

Jest runs 19 integration tests with coverage enforced at 80%. If tests fail, nothing gets built or deployed.

3. Security Scan

Trivy scans the filesystem for known CVEs in dependencies. Fails the pipeline on any HIGH or CRITICAL unfixed vulnerability. This caught real issues during development — Alpine CVEs and npm transitive dependency vulnerabilities that needed pinning.

4. Build & Push

Multi-stage Docker build. Three stages:

deps — installs only production dependencies
test — runs Jest inside the build process
production — minimal Alpine 3.21, non-root user, only what's needed to run

The test stage is critical. If your tests fail, the image doesn't get built. You physically cannot push a broken image.

5. Terraform Plan

terraform plan runs and saves the plan as an artifact. This is what gets applied in the next stage — no variables re-injected, no drift between plan and apply.

6. Deploy

terraform apply using the saved plan. Followed immediately by an explicit ASG instance refresh:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name $ASG_NAME \
  --preferences '{"MinHealthyPercentage": 50, "InstanceWarmup": 60}'

This is what actually gets new code onto the instances. Without triggering the refresh, the ASG would keep running the old image indefinitely.

7. Smoke Test

Polls /health/ready for up to 6 minutes after deploy. If the app never becomes healthy, the pipeline fails and you know immediately.

8. Summary

A pass/fail table written to the GitHub Actions job summary. Clean, visible at a glance.

The Terraform Setup

Three independent modules:

networking — VPC, public subnets across two AZs, internet gateway, route tables.

security — Security groups. The ALB accepts traffic from anywhere on 80 and 443. The app security group only accepts traffic from the ALB security group — EC2 instances are never directly reachable from the internet.

compute — ALB with HTTP redirect to HTTPS and an HTTPS listener with ACM certificate, launch template with IMDSv2 required, ASG with rolling instance refresh, IAM role for EC2 with SSM access.

A bootstrap/ folder handles the one-time setup that must exist before the pipeline can run — the OIDC provider, IAM role, S3 state bucket, and DynamoDB lock table.

Remote state in S3 with DynamoDB locking means the pipeline and local Terraform commands never conflict.

OIDC — The Right Way to Authenticate

This was the most important decision in the project.

The traditional approach is to create an IAM user, generate access keys, and store them as GitHub secrets. This works but creates long-lived credentials that can be leaked, rotated incorrectly, or forgotten.

OIDC works differently. GitHub Actions requests a short-lived token from GitHub's OIDC provider. AWS verifies that token against a trust policy and issues temporary credentials. The whole exchange happens in seconds and the credentials expire when the job ends.

permissions:
  id-token: write
  contents: read

- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
    aws-region: ${{ secrets.AWS_REGION }}

The trust policy on the IAM role restricts assumption to this specific GitHub repo only:

"Condition": {
  "StringLike": {
    "token.actions.githubusercontent.com:sub": 
      "repo:nanafilbert/cicd-aws-terraform-deploy:*"
  }
}

No static credentials. Nothing to rotate. Nothing to leak.

Bugs I Hit (And What They Taught Me)

The Docker permission bug — The production container was crashing with Cannot find module '/app/src/app.js' even though the file clearly existed in the image. Took a while to figure out: I had set chmod -R 550 on the app directory. Read and execute, but no execute on directories means Node.js can't traverse the path. Changed to 755 and it worked immediately. The lesson: file permission bugs are silent and confusing — always verify what your non-root user can actually access.

The HSTS loop — After adding HTTPS, all API calls from the browser were being upgraded to HTTPS even when I explicitly typed http://. Helmet's default configuration sets a Strict-Transport-Security header, which tells browsers to remember to always use HTTPS for this origin. Even clearing the cache wasn't enough — had to explicitly clear the HSTS policy in Chrome's chrome://net-internals/#hsts and disable the header in Helmet for the HTTP-only ALB endpoint.

The instance refresh gap — After every deploy the new Docker image was pushed to Docker Hub, but the EC2 instance kept running the old one. Terraform saw no infrastructure changes so it didn't replace anything. The fix was to explicitly trigger an ASG instance refresh in the pipeline after every apply. Without that step, automation is an illusion — you're just pushing images that never get deployed.

The Terraform state lock — A failed pipeline run left a lock on the state file. Subsequent runs couldn't acquire the lock and failed immediately. Learned that terraform force-unlock -force <lock-id> from the correct working directory resolves this, and added auto-unlock logic to the plan job for future failures.

Observability

The app exposes Prometheus metrics via prom-client:

const promClient = require("prom-client");
promClient.collectDefaultMetrics({ register });

app.get("/health/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.send(await register.metrics());
});

Locally, docker-compose up starts the full stack — app, nginx, Prometheus, and Grafana. Prometheus scrapes /health/metrics every 15 seconds. Grafana visualizes CPU usage, heap memory, event loop lag, and active handles in real time.

Running a load test against the local API makes the graphs spike visibly — useful for demonstrating the observability story to anyone reviewing the project.

What I Would Add Next

RDS PostgreSQL — tasks currently live in memory and reset on deploy. A real database would make this production-ready in a deeper sense.
CloudWatch alarms — alert on unhealthy host count and high CPU before users notice.
WAF — Web Application Firewall in front of the ALB for rate limiting and bot protection at the infrastructure level.

Final Thoughts

The most valuable part of this project wasn't the technology — it was the debugging. Every bug I hit taught me something real: how file permissions work in containers, how browsers cache security policies, how Terraform state locking works, how ASG instance refresh interacts with deploy automation.

That's the difference between following a tutorial and building something yourself. The tutorial gives you the happy path. Building it yourself gives you everything else.

If you're building a DevOps portfolio, don't copy a tutorial. Pick a problem, build something real, and let it break. That's where the learning actually happens.

The full source code is at github.com/nanafilbert/cicd-aws-terraform-deploy and the live app is running at https://tasks.therealblessing.com.

DEV Community