Oluwagbade Odimayo

Posted on Jun 3

From Minikube to AWS EKS: How I Built a Zero-Downtime Blue-Green Deployment Pipeline for ShopSwift

#docker #grafana #kubernetes #aws

I built ShopSwift, a Node.js/Express e-commerce API, and wrapped it in a production-grade blue-green deployment pipeline: Docker, Kubernetes, Minikube local validation, NGINX Ingress, GitHub Actions CI, AWS EKS, Amazon ECR, and Prometheus + Grafana monitoring. Zero failed requests across every switch and rollback. Here is exactly how I did it - including the architecture mistake that caused a 503, and the fix that made it truly zero-downtime.

The Real Problem With Shipping Software

Releasing code is where theory meets reality.

A feature can pass every local test, build cleanly in CI, and still fail the moment real traffic touches it. When it does, the question is not what broke - it is how quickly can you recover without taking users down with you.

Traditional rolling deployments reduce this risk but do not eliminate it. During a rollout, old and new code can run simultaneously, creating version skew. If the new version is bad, rollback means redeploying the old one - which takes time users will feel.

Blue-green deployment takes a different approach. Two environments run in parallel. One is live. The other is where the new release lands. Traffic switches only after validation. Rollback is a routing change, not a redeployment.

The question I wanted to answer with this project was practical:

Can I build a blue-green pipeline that delivers genuinely zero failed requests through a traffic switch, a rollback, and a simulated broken release - locally and in the cloud?

The answer is yes. But the path had a real failure in it. That failure made the project better.

Repository: github.com/gbadedata/shopswift-blue-green

What This Project Covers

Phase	What was built
1	Node.js + Express API with Jest tests
2	Docker image with environment-based versioning
3	Git history and GitHub baseline
4	Minikube blue baseline with NGINX Ingress
5	Green deployment + traffic switch (with a 503 failure and fix)
6	Zero-downtime rollback from Green back to Blue
7	Broken release simulation with readiness probe protection
8	GitHub Actions CI (tests, Docker build, Trivy, kubeconform)
9	AWS EKS cloud deployment with Amazon ECR
10	Prometheus + Grafana monitoring stack
-	AWS teardown and cost sweep

The Application: ShopSwift

ShopSwift is a small Node.js and Express e-commerce API. The application itself is intentionally simple. The deployment system around it is not.

Endpoint	Purpose
`/`	Landing
`/health`	Liveness probe
`/ready`	Readiness probe
`/version`	Active version and environment
`/products`	Simulated catalogue
`/products/:id`	Product detail
`/cart`	Simulated cart
`/checkout`	Simulated checkout
`/metrics`	Prometheus metrics

The /version endpoint was the most important one for deployment validation. It returned the running version and environment label, making it trivial to confirm exactly which environment was serving traffic at any moment:

// Blue environment
{
  "app": "ShopSwift",
  "version": "v1.0.0",
  "environment": "blue",
  "commit": "aws-blue",
  "port": 3000,
  "status": "running"
}

// Green environment
{
  "app": "ShopSwift",
  "version": "v2.0.0",
  "environment": "green",
  "commit": "aws-green",
  "port": 3000,
  "status": "running"
}

Technology Stack

Layer	Tool
Application	Node.js, Express
Testing	Jest, Supertest
Containerization	Docker
Local Kubernetes	Minikube
Cloud Kubernetes	AWS EKS
Container Registry	Amazon ECR
Ingress	NGINX Ingress Controller
CI/CD	GitHub Actions
Security Scanning	Trivy
Manifest Validation	kubeconform
Monitoring	Prometheus, Grafana
Cloud Tooling	AWS CLI, eksctl
Package Management	Helm

Architecture: The Final Routing Model

After testing and refinement (including one important failure), the traffic routing model settled into this:

User request
      |
AWS Load Balancer (cloud) or kubectl port-forward (local)
      |
NGINX Ingress Controller
      |
shopswift-ingress
      |
shopswift-active-service
      |
selector: environment=blue  OR  environment=green
      |
Blue pods           Green pods

The key design principle: NGINX Ingress never changes. The only thing that changes during a switch is the label selector on shopswift-active-service.

This distinction matters - and it came from a real failure. More on that below.

Phase 1: The Application

I started with the Express API and immediately wrote tests using Jest and Supertest before touching Docker or Kubernetes.

Test Suites: 1 passed
Tests:       7 passed

This was deliberate. Kubernetes deployment should not begin with an untested application. The health, readiness, and version endpoints needed to be correct before any of the deployment logic could trust them.

Phase 2: Dockerizing ShopSwift

The Docker image was designed to support both Blue and Green from the same codebase using environment variables:

# Blue container
docker run -e APP_VERSION=v1.0.0 -e APP_ENV=blue shopswift:v1.0.0

# Green container
docker run -e APP_VERSION=v2.0.0 -e APP_ENV=green shopswift:v2.0.0

No separate codebases. No duplicated Dockerfiles. One image, configured at runtime.

I also wrote smoke test scripts to validate all endpoints quickly after each build - a habit that paid dividends throughout the project.

Challenge: `npm ci` Caught a Lockfile Mismatch

The Docker build failed at:

RUN npm ci --omit=dev

The cause: package-lock.json was out of sync with package.json.

This is precisely why npm ci exists. Unlike npm install, it treats a lockfile mismatch as a hard failure rather than silently correcting it. I regenerated the lockfile and rebuilt - and the build became reproducible.

Lesson: A reproducible build that fails loudly is better than a lenient one that silently diverges.

Phase 3: Git Baseline

After the local application and Docker image were validated, I pushed to GitHub. The commit history became part of the project evidence:

feat: build ShopSwift app and Docker baseline
feat: deploy ShopSwift blue on Minikube with NGINX Ingress
feat: implement zero-downtime switch with active service selector
feat: simulate broken green release and validate readiness protection
feat: add GitHub Actions CI pipeline
feat: deploy and validate blue-green on AWS EKS
feat: add Prometheus and Grafana monitoring
docs: rewrite README with complete deployment documentation

For a project like this, traceability is part of the deliverable.

Phase 4: Minikube Blue Baseline

Before spending time or money on AWS, I validated the full deployment architecture locally with Minikube.

Kubernetes resources deployed:

Namespace:  ecommerce-bluegreen
Deployment: shopswift-blue
Service:    shopswift-blue-service
Ingress:    shopswift-ingress

Challenge: WSL + Docker Driver = Unreliable Ingress Access

Running Minikube with the Docker driver inside WSL meant that accessing shopswift.local directly was unreliable - a known networking limitation of this environment. The app and Service were fine; the issue was local DNS and networking.

The solution was to port-forward the NGINX Ingress Controller and pass the correct Host header:

kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80 &

curl -H "Host: shopswift.local" http://localhost:8080/version

This still exercised the full NGINX Ingress routing path - just without relying on local DNS resolution. It was the right tradeoff for a local validation environment.

Phase 5: Deploying Green and Switching Traffic

With Blue stable, I deployed Green:

Deployment: shopswift-green
Service:    shopswift-green-service

Before switching traffic, I tested Green internally through its own Service. This is non-negotiable in a proper blue-green workflow - Green pods running does not mean Green is ready to serve users.

Green internal check confirmed:

{
  "app": "ShopSwift",
  "version": "v2.0.0",
  "environment": "green",
  "commit": "minikube-green",
  "port": 3000,
  "status": "running"
}

Both environments were now running. Time to switch traffic.

The Failure That Made This Project Better

My first switching approach was to patch the Ingress backend directly:

# Before
backend:
  service:
    name: shopswift-blue-service

# After patching
backend:
  service:
    name: shopswift-green-service

Logical. Clean-looking. But during a continuous zero-downtime test:

FAILED request: status=503
Failed requests: 1 of 26

A single 503 during a traffic switch means the design cannot honestly be called zero-downtime. I did not hide this result. I used it to understand what was happening.

The likely cause: when the Ingress backend is patched, NGINX reloads its configuration. During that reload - even briefly - upstream connections can fail. One request landed in that gap.

The Fix: The Stable Active Service Pattern

Instead of touching the Ingress, I introduced a stable intermediary:

# shopswift-active-service - this never changes in Ingress
apiVersion: v1
kind: Service
metadata:
  name: shopswift-active-service
spec:
  selector:
    app: shopswift
    environment: blue  # <-- only this changes during a switch

The Ingress always points to shopswift-active-service. To switch traffic, I only patch the selector:

# Switch to Green
kubectl patch service shopswift-active-service \
  -n ecommerce-bluegreen \
  --type='merge' \
  -p '{"spec":{"selector":{"app":"shopswift","environment":"green"}}}'

# Roll back to Blue
kubectl patch service shopswift-active-service \
  -n ecommerce-bluegreen \
  --type='merge' \
  -p '{"spec":{"selector":{"app":"shopswift","environment":"blue"}}}'

Kubernetes Service selector updates are atomic. The control plane propagates the change with no NGINX reload, no connection gap. This is the architecture improvement that made zero-downtime achievable.

Result After the Fix

Blue to Green switch:
Total requests:  26
Failed requests: 0
Zero-downtime availability test: PASSED

Phase 6: Zero-Downtime Rollback

Rollback in a blue-green system should not require rebuilding or redeploying the previous version. Blue never stopped running. It was just not receiving traffic.

To roll back, I patched the selector back to Blue:

Green to Blue rollback:
Total requests:  25
Failed requests: 0
Zero-downtime availability test: PASSED

Key operational insight: rollback is a routing decision, not a deployment operation. Blue stayed warm throughout Green's tenure as active. Recovery time is measured in seconds, not minutes.

Phase 7: Simulating a Broken Release

A deployment strategy that only handles successful releases is incomplete.

I simulated a broken Green release by configuring the /ready endpoint to return HTTP 503:

FORCE_NOT_READY=true

The Kubernetes readiness probe detected this and never admitted the broken pods to the Service's endpoint pool. The rollout timed out. Traffic never left Blue.

Live /version response during the broken Green simulation:

{
  "app": "ShopSwift",
  "version": "v1.0.0",
  "environment": "blue",
  "commit": "minikube-blue",
  "port": 3000,
  "status": "running"
}

The live smoke test still passed. Users were never affected.

Readiness probes are not decoration. They are an operational safety control. A pod that fails its readiness probe never receives production traffic - regardless of whether it is running.

Phase 8: GitHub Actions CI Pipeline

The GitHub Actions workflow validated every push:

# .github/workflows/ci.yml
steps:
  - Checkout repository
  - Setup Node.js 20
  - npm ci
  - Run Jest unit tests
  - Docker image build
  - Trivy vulnerability scan
  - kubeconform Kubernetes manifest validation

Challenge: Trivy Action Version Not Found

The initial workflow referenced:

uses: aquasecurity/trivy-action@0.24.0

GitHub Actions could not resolve this version. The fix was to pin a verified release tag from the Trivy Action releases page and pin the Trivy binary version to match.

Final CI result:

ShopSwift CI: PASSED

This phase closed an important gap. The project now had automated validation, not just manual testing.

Phase 9: AWS EKS Cloud Deployment

Local validation proved the architecture. AWS EKS proved it at scale.

Cloud stack:

Component	Service
Kubernetes	AWS EKS
Container registry	Amazon ECR
Ingress	NGINX Ingress Controller via Helm
Load balancer	AWS Network Load Balancer (provisioned by NGINX Helm chart)
Cluster provisioning	eksctl

Cluster configuration:

Cluster: shopswift-bluegreen-eks
Region:  us-east-1
Nodes:   2 x t3.small

Images pushed to ECR:

677276115158.dkr.ecr.us-east-1.amazonaws.com/shopswift:v1.0.0
677276115158.dkr.ecr.us-east-1.amazonaws.com/shopswift:v2.0.0

AWS Blue baseline confirmed:

{
  "app": "ShopSwift",
  "version": "v1.0.0",
  "environment": "blue",
  "commit": "aws-blue",
  "port": 3000,
  "status": "running"
}

AWS Blue to Green

Total requests:  21
Failed requests: 0
AWS zero-downtime switch: PASSED

AWS Green to Blue Rollback

Total requests:  23
Failed requests: 0
AWS zero-downtime rollback: PASSED

The active Service selector pattern that was proven in Minikube held in AWS without modification. The architecture was portable.

Phase 10: Prometheus and Grafana Monitoring

Deployment is only part of production readiness. Observability is the other part.

I installed kube-prometheus-stack on EKS via Helm. The stack included:

Prometheus - metrics collection and storage
Grafana - visualization and dashboards
Alertmanager - alert routing
kube-state-metrics - Kubernetes object metrics
node-exporter - node-level metrics
Prometheus Operator - automated scrape configuration

I also enabled NGINX Ingress metrics for request-level visibility.

Example PromQL queries run against the cluster:

up

kube_pod_info{namespace="ecommerce-bluegreen"}

kube_deployment_status_replicas_available{namespace="ecommerce-bluegreen"}

container_cpu_usage_seconds_total{namespace="ecommerce-bluegreen"}

Grafana showed the ShopSwift namespace, Blue and Green pods, the ingress-nginx namespace, and the monitoring namespace - all with pod and node-level metrics.

One thing I want to be clear about: the monitoring evidence demonstrates Kubernetes infrastructure observability - pod health, replica counts, CPU usage, ingress traffic. It does not demonstrate business-level observability such as checkout conversion rates or revenue metrics. Overclaiming monitoring scope weakens a project. I documented this distinction explicitly.

AWS Teardown and Cost Control

Cloud engineering is not only about building. It is about cleaning up responsibly.

After capturing all evidence, I deleted every AWS resource associated with this project:

EKS cluster
Worker EC2 instances
AWS Network Load Balancer
ECR repositories
EBS volumes
Elastic IPs
NAT Gateway
CloudFormation stacks
Project IAM roles
Project security groups

Final verification:

EKS clusters:           []
Load balancers:         none
ECR repository:         not found
EC2 running instances:  none
Available EBS volumes:  none
Elastic IPs:            none

I also performed a broader AWS cost sweep and identified non-project items that needed review:

RDS manual snapshot: bio-platform-postgres (eu-west-2)
Old S3 Terraform state bucket
DynamoDB Terraform lock table
Possible customer-managed KMS keys

Knowing what you built is not enough. Knowing what is still running - and what it costs - is part of the engineering discipline.

Key Lessons

1. Zero downtime must be proven under traffic, not assumed

The first design produced a 503. The improved design produced zero failed requests across every switch. The difference was measured in continuous load tests, not in YAML reviews.

2. Readiness probes are operational safety controls

The broken Green release failed its readiness probe and never received live traffic. That is the intended behaviour. It worked exactly as designed.

3. Keep the stable thing stable

Patching the Ingress caused a brief NGINX reload and dropped a request. Keeping the Ingress stable and patching only the Service selector eliminated that failure mode. In distributed systems, minimise what changes during a live operation.

4. Validate locally before going to cloud

Minikube let me find the 503, design the fix, and prove the improved architecture - before a single dollar was spent on AWS. Local validation is not just a convenience; it is cost control.

5. Teardown is part of the engineering work

Every cloud resource left running is a cost risk. Cleanup is not an afterthought.

6. Monitoring claims should match monitoring evidence

The Grafana dashboards showed Kubernetes metrics. That is what I claimed. Nothing more.

What I Would Add Next

If this project were extended into a production system:

Real DNS + TLS via cert-manager and ExternalDNS
Progressive delivery with Argo Rollouts or Flagger - automated canary analysis before promotion
Automated promotion gates - promote Green only if error rate and latency are within bounds
Slack/PagerDuty alerting via Alertmanager
Horizontal Pod Autoscaler for traffic-driven scaling
PodDisruptionBudgets to protect availability during node maintenance
NetworkPolicies for pod-level traffic isolation
External Secrets Operator for secrets management
IRSA (IAM Roles for Service Accounts) to replace node-level IAM
Centralized logging with Fluent Bit + OpenSearch or Loki
Business-level metrics - checkout attempts, success rates, latency by environment
Infrastructure as Code with Terraform or Pulumi

Final Outcome

This project proved a complete local-to-cloud DevOps release workflow:

Code
  |
Tests (Jest + Supertest)
  |
Docker image
  |
Minikube validation
  |
Blue-Green deployment (active Service selector pattern)
  |
Zero-downtime rollback
  |
Broken release protection (readiness probes)
  |
GitHub Actions CI (tests, Trivy, kubeconform)
  |
AWS EKS cloud deployment
  |
Prometheus + Grafana monitoring
  |
AWS teardown and cost sweep

The most valuable part of this project was not that everything worked on the first try. It did not.

The most valuable part was that a real deployment failure was detected, analysed, fixed, and validated again under load. That is the full engineering cycle - not just implementation, but discovery and improvement.

Repository

github.com/gbadedata/shopswift-blue-green

If you are reviewing this project, the strongest places to start:

k8s/active-service.yaml - the stable switching mechanism
scripts/zero-downtime-test.sh - the load test that caught the 503
k8s/broken-green/ - the readiness probe failure simulation
.github/workflows/ci.yml - the full CI pipeline
evidence/ - AWS EKS deployment, monitoring, and teardown evidence
README.md - complete deployment walkthrough with copy-paste commands

Built with Node.js, Docker, Kubernetes, GitHub Actions, AWS EKS, Prometheus, and Grafana. Questions or feedback? Drop a comment below.

DEV Community

From Minikube to AWS EKS: How I Built a Zero-Downtime Blue-Green Deployment Pipeline for ShopSwift

The Real Problem With Shipping Software

What This Project Covers

The Application: ShopSwift

Technology Stack

Architecture: The Final Routing Model

Phase 1: The Application

Phase 2: Dockerizing ShopSwift

Challenge: `npm ci` Caught a Lockfile Mismatch

Phase 3: Git Baseline

Phase 4: Minikube Blue Baseline

Challenge: WSL + Docker Driver = Unreliable Ingress Access

Phase 5: Deploying Green and Switching Traffic

The Failure That Made This Project Better

The Fix: The Stable Active Service Pattern

Result After the Fix

Phase 6: Zero-Downtime Rollback

Phase 7: Simulating a Broken Release

Phase 8: GitHub Actions CI Pipeline

Challenge: Trivy Action Version Not Found

Phase 9: AWS EKS Cloud Deployment

AWS Blue to Green

AWS Green to Blue Rollback

Phase 10: Prometheus and Grafana Monitoring

AWS Teardown and Cost Control

Key Lessons

1. Zero downtime must be proven under traffic, not assumed

2. Readiness probes are operational safety controls

3. Keep the stable thing stable

4. Validate locally before going to cloud

5. Teardown is part of the engineering work

6. Monitoring claims should match monitoring evidence

What I Would Add Next

Final Outcome

Repository

Top comments (0)

The Real Problem With Shipping Software

What This Project Covers

The Application: ShopSwift

Technology Stack

Architecture: The Final Routing Model

Phase 1: The Application

Phase 2: Dockerizing ShopSwift

Challenge: npm ci Caught a Lockfile Mismatch

Phase 3: Git Baseline

Phase 4: Minikube Blue Baseline

Challenge: WSL + Docker Driver = Unreliable Ingress Access

Phase 5: Deploying Green and Switching Traffic

The Failure That Made This Project Better

The Fix: The Stable Active Service Pattern

Result After the Fix

Phase 6: Zero-Downtime Rollback

Phase 7: Simulating a Broken Release

Phase 8: GitHub Actions CI Pipeline

Challenge: Trivy Action Version Not Found

Phase 9: AWS EKS Cloud Deployment

AWS Blue to Green

AWS Green to Blue Rollback

Phase 10: Prometheus and Grafana Monitoring

AWS Teardown and Cost Control

Key Lessons

1. Zero downtime must be proven under traffic, not assumed

2. Readiness probes are operational safety controls

3. Keep the stable thing stable

4. Validate locally before going to cloud

5. Teardown is part of the engineering work

6. Monitoring claims should match monitoring evidence

What I Would Add Next

Final Outcome

Repository

Challenge: `npm ci` Caught a Lockfile Mismatch