Samir Khanal

Posted on May 9

How I Built a Canary Deployment CI/CD Pipeline with GitHub Actions and AWS EC2

#aws #devops #cicd #github

In this post, I'll walk you through how I built a production-ready pipeline using GitHub Actions, AWS EC2, Docker, Nginx, Trivy, and Terraform — with a canary deployment strategy at the core of it all.

The Problem With "Just Deploy It"

Most early-stage projects deploy by pushing directly to production. It works — until it doesn't. One bad commit and every single user hits the bug at the same time.

The alternative? Deploy smarter. Instead of flipping a switch for all your users at once, you test your new release on a small slice of real traffic first. That's the idea behind canary deployments.

What Is a Canary Deployment?

The name comes from the old mining practice of using canaries to detect dangerous gases — an early warning system. In software, it means deploying a new version of your app to a small percentage of users while the stable version continues serving the rest.

Here's the basic idea:

Your stable version keeps running normally
A new "canary" version is deployed alongside it
A small portion of traffic (say, 10%) gets routed to the canary
You monitor it — error rates, response times, health checks
If everything looks good, you gradually increase traffic: 25%, 50%, 100%
If something goes wrong, you roll back instantly

It's one of the safest ways to ship new features without gambling your entire user base on every release.

The Pipeline Overview

My goal was to build something that handles the full deployment lifecycle automatically — from the moment code is pushed to when it's live in production.

Here's what the pipeline does in order:

Builds a Docker image from the latest code
Scans the image for vulnerabilities using Trivy
Pushes the image to Amazon ECR
Deploys the application to EC2
Starts the canary deployment alongside the stable version
Monitors health at each traffic stage
Shifts traffic gradually or rolls back if something fails

All of this is triggered automatically by a push to the repo. No manual steps, no SSH-ing into servers by hand.

Docker and Amazon ECR

Every deployment is packaged as a Docker image. This was a deliberate choice — containers give you consistent environments across dev, staging, and production, and rolling back is as simple as pulling the previous image.

Each time the pipeline runs, a new image is built, tagged, and pushed to Amazon ECR (Elastic Container Registry). The EC2 instance then pulls the latest image and starts the new container. It's clean, repeatable, and removes the classic "works on my machine" problem entirely.

Security Scanning With Trivy

Before any image gets deployed, Trivy runs a full vulnerability scan on it. This is integrated directly into the pipeline — not as an afterthought.

Trivy checks for:

High and critical CVEs in the base image and dependencies
Outdated packages with known exploits
Misconfigurations

If critical vulnerabilities are found, the pipeline stops. The image doesn't get deployed. This kind of shift-left security practice catches issues early, before they ever reach production, which is exactly where you want to catch them.

The Canary Strategy in Practice

On the EC2 instance, two versions of the application run simultaneously:

The stable version on one port
The canary version on another port

Nginx sits in front of both and handles the traffic split. Initially, it's configured to send 90% of requests to the stable version and 10% to the canary. Then, as health checks pass, the split shifts:

Stage	Stable	Canary
Start	90%	10%
Stage 2	75%	25%
Stage 3	50%	50%
Final	0%	100%

Each stage waits for health checks to pass before progressing. If anything fails at any point, the deployment stops and traffic returns to the stable version.

Traffic Management With Nginx

Nginx is doing the heavy lifting here as the reverse proxy. The traffic weights are updated dynamically as the deployment progresses. At 10%, most users don't even notice. By the time you hit 50/50, you've already validated the new release under real production conditions.

Managing this dynamically — updating Nginx config, reloading it, and verifying the changes — was one of the more interesting engineering challenges. Getting the automation right took a few iterations, but the result is a clean, reliable traffic shifting mechanism.

Health Monitoring and Rollback

After each traffic shift, the pipeline runs health checks against the canary:

Is the container still running?
Is the app returning successful HTTP responses?
Are there any error spikes?

If something looks off, the pipeline doesn't wait around. It marks the deployment as unhealthy, stops the traffic shift, and triggers a rollback. The stable version takes over again without any user-facing downtime.

Having automated rollback as a first-class citizen in the pipeline — not something you script in a panic after things go wrong — made a huge difference in how confident I felt pushing new releases.

Slack Notifications

No one wants to babysit a deployment dashboard. So I set up Slack notifications at every meaningful stage:

Deployment started
Canary promoted (traffic shift in progress)
Deployment succeeded
Rollback triggered (and why)

The team gets real-time visibility without having to check GitHub Actions manually. It also creates a passive audit trail — you can scroll back through Slack and see exactly when each release went out and whether it succeeded.

Infrastructure With Terraform

The EC2 instances, networking, IAM roles, and ECR repository are all managed through Terraform. Infrastructure as code means the entire setup is version controlled, reproducible, and easy to tear down or replicate in a new environment.

For authentication, I used OIDC federation between GitHub Actions and AWS IAM. This means no long-lived AWS access keys stored in GitHub secrets. Instead, GitHub gets a short-lived token scoped to exactly what it needs. It's a cleaner, more secure approach that more people should be using.

Challenges and Lessons Learned

This wasn't a smooth build from day one. A few things that tripped me up:

Dynamic traffic routing

Updating Nginx weights and reloading mid-deployment required careful scripting to avoid race conditions or partial states.

Rollback timing

Deciding when to trigger a rollback (how many failed health checks? how long to wait?) involved some trial and error before landing on something reliable.

Container lifecycle management

Keeping old containers around long enough for a safe rollback while cleaning them up after a successful promotion needed a thoughtful cleanup strategy.

Each of these pushed me to think more carefully about what "production-safe" actually means in practice — not just in theory.

Final Thoughts

Building this pipeline gave me a much deeper appreciation for why canary deployments are so common in modern cloud environments. It's not just about being cautious — it's about having confidence that you can ship fast and still catch problems before they become incidents.

The combination of GitHub Actions for orchestration, Docker for packaging, Nginx for traffic control, Trivy for security, and Terraform for infrastructure gives you a solid, production-grade foundation. And once it's all wired together, deploying a new version feels a lot less like rolling the dice.

If you're still deploying directly to production on every push, this might be a good weekend project to change that.

Github Link: https://github.com/Shawmeer/Canary-pipeline
Checkout More Blogs at https://khanalsamir.com

DEV Community