DEV Community

Cover image for From Minikube to AWS EKS: How I Built a Zero-Downtime Blue-Green Deployment Pipeline for ShopSwift
Oluwagbade Odimayo
Oluwagbade Odimayo

Posted on

From Minikube to AWS EKS: How I Built a Zero-Downtime Blue-Green Deployment Pipeline for ShopSwift

I built ShopSwift, a Node.js/Express e-commerce API, and wrapped it in a production-grade blue-green deployment pipeline: Docker, Kubernetes, Minikube local validation, NGINX Ingress, GitHub Actions CI, AWS EKS, Amazon ECR, and Prometheus + Grafana monitoring. Zero failed requests across every switch and rollback. Here is exactly how I did it - including the architecture mistake that caused a 503, and the fix that made it truly zero-downtime.


The Real Problem With Shipping Software

Releasing code is where theory meets reality.

A feature can pass every local test, build cleanly in CI, and still fail the moment real traffic touches it. When it does, the question is not what broke - it is how quickly can you recover without taking users down with you.

Traditional rolling deployments reduce this risk but do not eliminate it. During a rollout, old and new code can run simultaneously, creating version skew. If the new version is bad, rollback means redeploying the old one - which takes time users will feel.

Blue-green deployment takes a different approach. Two environments run in parallel. One is live. The other is where the new release lands. Traffic switches only after validation. Rollback is a routing change, not a redeployment.

The question I wanted to answer with this project was practical:

Can I build a blue-green pipeline that delivers genuinely zero failed requests through a traffic switch, a rollback, and a simulated broken release - locally and in the cloud?

The answer is yes. But the path had a real failure in it. That failure made the project better.

Repository: github.com/gbadedata/shopswift-blue-green


What This Project Covers

Phase What was built
1 Node.js + Express API with Jest tests
2 Docker image with environment-based versioning
3 Git history and GitHub baseline
4 Minikube blue baseline with NGINX Ingress
5 Green deployment + traffic switch (with a 503 failure and fix)
6 Zero-downtime rollback from Green back to Blue
7 Broken release simulation with readiness probe protection
8 GitHub Actions CI (tests, Docker build, Trivy, kubeconform)
9 AWS EKS cloud deployment with Amazon ECR
10 Prometheus + Grafana monitoring stack
- AWS teardown and cost sweep

The Application: ShopSwift

ShopSwift is a small Node.js and Express e-commerce API. The application itself is intentionally simple. The deployment system around it is not.

Endpoint Purpose
/ Landing
/health Liveness probe
/ready Readiness probe
/version Active version and environment
/products Simulated catalogue
/products/:id Product detail
/cart Simulated cart
/checkout Simulated checkout
/metrics Prometheus metrics

The /version endpoint was the most important one for deployment validation. It returned the running version and environment label, making it trivial to confirm exactly which environment was serving traffic at any moment:

// Blue environment
{
  "app": "ShopSwift",
  "version": "v1.0.0",
  "environment": "blue",
  "commit": "aws-blue",
  "port": 3000,
  "status": "running"
}
Enter fullscreen mode Exit fullscreen mode
// Green environment
{
  "app": "ShopSwift",
  "version": "v2.0.0",
  "environment": "green",
  "commit": "aws-green",
  "port": 3000,
  "status": "running"
}
Enter fullscreen mode Exit fullscreen mode

Technology Stack

Layer Tool
Application Node.js, Express
Testing Jest, Supertest
Containerization Docker
Local Kubernetes Minikube
Cloud Kubernetes AWS EKS
Container Registry Amazon ECR
Ingress NGINX Ingress Controller
CI/CD GitHub Actions
Security Scanning Trivy
Manifest Validation kubeconform
Monitoring Prometheus, Grafana
Cloud Tooling AWS CLI, eksctl
Package Management Helm

Architecture: The Final Routing Model

After testing and refinement (including one important failure), the traffic routing model settled into this:

User request
      |
AWS Load Balancer (cloud) or kubectl port-forward (local)
      |
NGINX Ingress Controller
      |
shopswift-ingress
      |
shopswift-active-service
      |
selector: environment=blue  OR  environment=green
      |
Blue pods           Green pods
Enter fullscreen mode Exit fullscreen mode

The key design principle: NGINX Ingress never changes. The only thing that changes during a switch is the label selector on shopswift-active-service.

This distinction matters - and it came from a real failure. More on that below.


Phase 1: The Application

I started with the Express API and immediately wrote tests using Jest and Supertest before touching Docker or Kubernetes.

Test Suites: 1 passed
Tests:       7 passed
Enter fullscreen mode Exit fullscreen mode

This was deliberate. Kubernetes deployment should not begin with an untested application. The health, readiness, and version endpoints needed to be correct before any of the deployment logic could trust them.


Phase 2: Dockerizing ShopSwift

The Docker image was designed to support both Blue and Green from the same codebase using environment variables:

# Blue container
docker run -e APP_VERSION=v1.0.0 -e APP_ENV=blue shopswift:v1.0.0

# Green container
docker run -e APP_VERSION=v2.0.0 -e APP_ENV=green shopswift:v2.0.0
Enter fullscreen mode Exit fullscreen mode

No separate codebases. No duplicated Dockerfiles. One image, configured at runtime.

I also wrote smoke test scripts to validate all endpoints quickly after each build - a habit that paid dividends throughout the project.

Challenge: npm ci Caught a Lockfile Mismatch

The Docker build failed at:

RUN npm ci --omit=dev
Enter fullscreen mode Exit fullscreen mode

The cause: package-lock.json was out of sync with package.json.

This is precisely why npm ci exists. Unlike npm install, it treats a lockfile mismatch as a hard failure rather than silently correcting it. I regenerated the lockfile and rebuilt - and the build became reproducible.

Lesson: A reproducible build that fails loudly is better than a lenient one that silently diverges.


Phase 3: Git Baseline

After the local application and Docker image were validated, I pushed to GitHub. The commit history became part of the project evidence:

feat: build ShopSwift app and Docker baseline
feat: deploy ShopSwift blue on Minikube with NGINX Ingress
feat: implement zero-downtime switch with active service selector
feat: simulate broken green release and validate readiness protection
feat: add GitHub Actions CI pipeline
feat: deploy and validate blue-green on AWS EKS
feat: add Prometheus and Grafana monitoring
docs: rewrite README with complete deployment documentation
Enter fullscreen mode Exit fullscreen mode

For a project like this, traceability is part of the deliverable.


Phase 4: Minikube Blue Baseline

Before spending time or money on AWS, I validated the full deployment architecture locally with Minikube.

Kubernetes resources deployed:

Namespace:  ecommerce-bluegreen
Deployment: shopswift-blue
Service:    shopswift-blue-service
Ingress:    shopswift-ingress
Enter fullscreen mode Exit fullscreen mode

Challenge: WSL + Docker Driver = Unreliable Ingress Access

Running Minikube with the Docker driver inside WSL meant that accessing shopswift.local directly was unreliable - a known networking limitation of this environment. The app and Service were fine; the issue was local DNS and networking.

The solution was to port-forward the NGINX Ingress Controller and pass the correct Host header:

kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80 &

curl -H "Host: shopswift.local" http://localhost:8080/version
Enter fullscreen mode Exit fullscreen mode

This still exercised the full NGINX Ingress routing path - just without relying on local DNS resolution. It was the right tradeoff for a local validation environment.


Phase 5: Deploying Green and Switching Traffic

With Blue stable, I deployed Green:

Deployment: shopswift-green
Service:    shopswift-green-service
Enter fullscreen mode Exit fullscreen mode

Before switching traffic, I tested Green internally through its own Service. This is non-negotiable in a proper blue-green workflow - Green pods running does not mean Green is ready to serve users.

Green internal check confirmed:

{
  "app": "ShopSwift",
  "version": "v2.0.0",
  "environment": "green",
  "commit": "minikube-green",
  "port": 3000,
  "status": "running"
}
Enter fullscreen mode Exit fullscreen mode

Both environments were now running. Time to switch traffic.


The Failure That Made This Project Better

My first switching approach was to patch the Ingress backend directly:

# Before
backend:
  service:
    name: shopswift-blue-service

# After patching
backend:
  service:
    name: shopswift-green-service
Enter fullscreen mode Exit fullscreen mode

Logical. Clean-looking. But during a continuous zero-downtime test:

FAILED request: status=503
Failed requests: 1 of 26
Enter fullscreen mode Exit fullscreen mode

A single 503 during a traffic switch means the design cannot honestly be called zero-downtime. I did not hide this result. I used it to understand what was happening.

The likely cause: when the Ingress backend is patched, NGINX reloads its configuration. During that reload - even briefly - upstream connections can fail. One request landed in that gap.


The Fix: The Stable Active Service Pattern

Instead of touching the Ingress, I introduced a stable intermediary:

# shopswift-active-service - this never changes in Ingress
apiVersion: v1
kind: Service
metadata:
  name: shopswift-active-service
spec:
  selector:
    app: shopswift
    environment: blue  # <-- only this changes during a switch
Enter fullscreen mode Exit fullscreen mode

The Ingress always points to shopswift-active-service. To switch traffic, I only patch the selector:

# Switch to Green
kubectl patch service shopswift-active-service \
  -n ecommerce-bluegreen \
  --type='merge' \
  -p '{"spec":{"selector":{"app":"shopswift","environment":"green"}}}'

# Roll back to Blue
kubectl patch service shopswift-active-service \
  -n ecommerce-bluegreen \
  --type='merge' \
  -p '{"spec":{"selector":{"app":"shopswift","environment":"blue"}}}'
Enter fullscreen mode Exit fullscreen mode

Kubernetes Service selector updates are atomic. The control plane propagates the change with no NGINX reload, no connection gap. This is the architecture improvement that made zero-downtime achievable.

Result After the Fix

Blue to Green switch:
Total requests:  26
Failed requests: 0
Zero-downtime availability test: PASSED
Enter fullscreen mode Exit fullscreen mode

Phase 6: Zero-Downtime Rollback

Rollback in a blue-green system should not require rebuilding or redeploying the previous version. Blue never stopped running. It was just not receiving traffic.

To roll back, I patched the selector back to Blue:

Green to Blue rollback:
Total requests:  25
Failed requests: 0
Zero-downtime availability test: PASSED
Enter fullscreen mode Exit fullscreen mode

Key operational insight: rollback is a routing decision, not a deployment operation. Blue stayed warm throughout Green's tenure as active. Recovery time is measured in seconds, not minutes.


Phase 7: Simulating a Broken Release

A deployment strategy that only handles successful releases is incomplete.

I simulated a broken Green release by configuring the /ready endpoint to return HTTP 503:

FORCE_NOT_READY=true
Enter fullscreen mode Exit fullscreen mode

The Kubernetes readiness probe detected this and never admitted the broken pods to the Service's endpoint pool. The rollout timed out. Traffic never left Blue.

Live /version response during the broken Green simulation:

{
  "app": "ShopSwift",
  "version": "v1.0.0",
  "environment": "blue",
  "commit": "minikube-blue",
  "port": 3000,
  "status": "running"
}
Enter fullscreen mode Exit fullscreen mode

The live smoke test still passed. Users were never affected.

Readiness probes are not decoration. They are an operational safety control. A pod that fails its readiness probe never receives production traffic - regardless of whether it is running.


Phase 8: GitHub Actions CI Pipeline

The GitHub Actions workflow validated every push:

# .github/workflows/ci.yml
steps:
  - Checkout repository
  - Setup Node.js 20
  - npm ci
  - Run Jest unit tests
  - Docker image build
  - Trivy vulnerability scan
  - kubeconform Kubernetes manifest validation
Enter fullscreen mode Exit fullscreen mode

Challenge: Trivy Action Version Not Found

The initial workflow referenced:

uses: aquasecurity/trivy-action@0.24.0
Enter fullscreen mode Exit fullscreen mode

GitHub Actions could not resolve this version. The fix was to pin a verified release tag from the Trivy Action releases page and pin the Trivy binary version to match.

Final CI result:

ShopSwift CI: PASSED
Enter fullscreen mode Exit fullscreen mode

This phase closed an important gap. The project now had automated validation, not just manual testing.


Phase 9: AWS EKS Cloud Deployment

Local validation proved the architecture. AWS EKS proved it at scale.

Cloud stack:

Component Service
Kubernetes AWS EKS
Container registry Amazon ECR
Ingress NGINX Ingress Controller via Helm
Load balancer AWS Network Load Balancer (provisioned by NGINX Helm chart)
Cluster provisioning eksctl

Cluster configuration:

Cluster: shopswift-bluegreen-eks
Region:  us-east-1
Nodes:   2 x t3.small
Enter fullscreen mode Exit fullscreen mode

Images pushed to ECR:

677276115158.dkr.ecr.us-east-1.amazonaws.com/shopswift:v1.0.0
677276115158.dkr.ecr.us-east-1.amazonaws.com/shopswift:v2.0.0
Enter fullscreen mode Exit fullscreen mode

AWS Blue baseline confirmed:

{
  "app": "ShopSwift",
  "version": "v1.0.0",
  "environment": "blue",
  "commit": "aws-blue",
  "port": 3000,
  "status": "running"
}
Enter fullscreen mode Exit fullscreen mode

AWS Blue to Green

Total requests:  21
Failed requests: 0
AWS zero-downtime switch: PASSED
Enter fullscreen mode Exit fullscreen mode

AWS Green to Blue Rollback

Total requests:  23
Failed requests: 0
AWS zero-downtime rollback: PASSED
Enter fullscreen mode Exit fullscreen mode

The active Service selector pattern that was proven in Minikube held in AWS without modification. The architecture was portable.


Phase 10: Prometheus and Grafana Monitoring

Deployment is only part of production readiness. Observability is the other part.

I installed kube-prometheus-stack on EKS via Helm. The stack included:

I also enabled NGINX Ingress metrics for request-level visibility.

Example PromQL queries run against the cluster:

up
Enter fullscreen mode Exit fullscreen mode
kube_pod_info{namespace="ecommerce-bluegreen"}
Enter fullscreen mode Exit fullscreen mode
kube_deployment_status_replicas_available{namespace="ecommerce-bluegreen"}
Enter fullscreen mode Exit fullscreen mode
container_cpu_usage_seconds_total{namespace="ecommerce-bluegreen"}
Enter fullscreen mode Exit fullscreen mode

Grafana showed the ShopSwift namespace, Blue and Green pods, the ingress-nginx namespace, and the monitoring namespace - all with pod and node-level metrics.

One thing I want to be clear about: the monitoring evidence demonstrates Kubernetes infrastructure observability - pod health, replica counts, CPU usage, ingress traffic. It does not demonstrate business-level observability such as checkout conversion rates or revenue metrics. Overclaiming monitoring scope weakens a project. I documented this distinction explicitly.


AWS Teardown and Cost Control

Cloud engineering is not only about building. It is about cleaning up responsibly.

After capturing all evidence, I deleted every AWS resource associated with this project:

EKS cluster
Worker EC2 instances
AWS Network Load Balancer
ECR repositories
EBS volumes
Elastic IPs
NAT Gateway
CloudFormation stacks
Project IAM roles
Project security groups
Enter fullscreen mode Exit fullscreen mode

Final verification:

EKS clusters:           []
Load balancers:         none
ECR repository:         not found
EC2 running instances:  none
Available EBS volumes:  none
Elastic IPs:            none
Enter fullscreen mode Exit fullscreen mode

I also performed a broader AWS cost sweep and identified non-project items that needed review:

RDS manual snapshot: bio-platform-postgres (eu-west-2)
Old S3 Terraform state bucket
DynamoDB Terraform lock table
Possible customer-managed KMS keys
Enter fullscreen mode Exit fullscreen mode

Knowing what you built is not enough. Knowing what is still running - and what it costs - is part of the engineering discipline.


Key Lessons

1. Zero downtime must be proven under traffic, not assumed

The first design produced a 503. The improved design produced zero failed requests across every switch. The difference was measured in continuous load tests, not in YAML reviews.

2. Readiness probes are operational safety controls

The broken Green release failed its readiness probe and never received live traffic. That is the intended behaviour. It worked exactly as designed.

3. Keep the stable thing stable

Patching the Ingress caused a brief NGINX reload and dropped a request. Keeping the Ingress stable and patching only the Service selector eliminated that failure mode. In distributed systems, minimise what changes during a live operation.

4. Validate locally before going to cloud

Minikube let me find the 503, design the fix, and prove the improved architecture - before a single dollar was spent on AWS. Local validation is not just a convenience; it is cost control.

5. Teardown is part of the engineering work

Every cloud resource left running is a cost risk. Cleanup is not an afterthought.

6. Monitoring claims should match monitoring evidence

The Grafana dashboards showed Kubernetes metrics. That is what I claimed. Nothing more.


What I Would Add Next

If this project were extended into a production system:


Final Outcome

This project proved a complete local-to-cloud DevOps release workflow:

Code
  |
Tests (Jest + Supertest)
  |
Docker image
  |
Minikube validation
  |
Blue-Green deployment (active Service selector pattern)
  |
Zero-downtime rollback
  |
Broken release protection (readiness probes)
  |
GitHub Actions CI (tests, Trivy, kubeconform)
  |
AWS EKS cloud deployment
  |
Prometheus + Grafana monitoring
  |
AWS teardown and cost sweep
Enter fullscreen mode Exit fullscreen mode

The most valuable part of this project was not that everything worked on the first try. It did not.

The most valuable part was that a real deployment failure was detected, analysed, fixed, and validated again under load. That is the full engineering cycle - not just implementation, but discovery and improvement.


Repository

github.com/gbadedata/shopswift-blue-green

If you are reviewing this project, the strongest places to start:

  • k8s/active-service.yaml - the stable switching mechanism
  • scripts/zero-downtime-test.sh - the load test that caught the 503
  • k8s/broken-green/ - the readiness probe failure simulation
  • .github/workflows/ci.yml - the full CI pipeline
  • evidence/ - AWS EKS deployment, monitoring, and teardown evidence
  • README.md - complete deployment walkthrough with copy-paste commands

Built with Node.js, Docker, Kubernetes, GitHub Actions, AWS EKS, Prometheus, and Grafana. Questions or feedback? Drop a comment below.

Top comments (0)