I built ShopSwift, a Node.js/Express e-commerce API, and wrapped it in a production-grade blue-green deployment pipeline: Docker, Kubernetes, Minikube local validation, NGINX Ingress, GitHub Actions CI, AWS EKS, Amazon ECR, and Prometheus + Grafana monitoring. Zero failed requests across every switch and rollback. Here is exactly how I did it - including the architecture mistake that caused a 503, and the fix that made it truly zero-downtime.
The Real Problem With Shipping Software
Releasing code is where theory meets reality.
A feature can pass every local test, build cleanly in CI, and still fail the moment real traffic touches it. When it does, the question is not what broke - it is how quickly can you recover without taking users down with you.
Traditional rolling deployments reduce this risk but do not eliminate it. During a rollout, old and new code can run simultaneously, creating version skew. If the new version is bad, rollback means redeploying the old one - which takes time users will feel.
Blue-green deployment takes a different approach. Two environments run in parallel. One is live. The other is where the new release lands. Traffic switches only after validation. Rollback is a routing change, not a redeployment.
The question I wanted to answer with this project was practical:
Can I build a blue-green pipeline that delivers genuinely zero failed requests through a traffic switch, a rollback, and a simulated broken release - locally and in the cloud?
The answer is yes. But the path had a real failure in it. That failure made the project better.
Repository: github.com/gbadedata/shopswift-blue-green
What This Project Covers
| Phase | What was built |
|---|---|
| 1 | Node.js + Express API with Jest tests |
| 2 | Docker image with environment-based versioning |
| 3 | Git history and GitHub baseline |
| 4 | Minikube blue baseline with NGINX Ingress |
| 5 | Green deployment + traffic switch (with a 503 failure and fix) |
| 6 | Zero-downtime rollback from Green back to Blue |
| 7 | Broken release simulation with readiness probe protection |
| 8 | GitHub Actions CI (tests, Docker build, Trivy, kubeconform) |
| 9 | AWS EKS cloud deployment with Amazon ECR |
| 10 | Prometheus + Grafana monitoring stack |
| - | AWS teardown and cost sweep |
The Application: ShopSwift
ShopSwift is a small Node.js and Express e-commerce API. The application itself is intentionally simple. The deployment system around it is not.
| Endpoint | Purpose |
|---|---|
/ |
Landing |
/health |
Liveness probe |
/ready |
Readiness probe |
/version |
Active version and environment |
/products |
Simulated catalogue |
/products/:id |
Product detail |
/cart |
Simulated cart |
/checkout |
Simulated checkout |
/metrics |
Prometheus metrics |
The /version endpoint was the most important one for deployment validation. It returned the running version and environment label, making it trivial to confirm exactly which environment was serving traffic at any moment:
// Blue environment
{
"app": "ShopSwift",
"version": "v1.0.0",
"environment": "blue",
"commit": "aws-blue",
"port": 3000,
"status": "running"
}
// Green environment
{
"app": "ShopSwift",
"version": "v2.0.0",
"environment": "green",
"commit": "aws-green",
"port": 3000,
"status": "running"
}
Technology Stack
| Layer | Tool |
|---|---|
| Application | Node.js, Express |
| Testing | Jest, Supertest |
| Containerization | Docker |
| Local Kubernetes | Minikube |
| Cloud Kubernetes | AWS EKS |
| Container Registry | Amazon ECR |
| Ingress | NGINX Ingress Controller |
| CI/CD | GitHub Actions |
| Security Scanning | Trivy |
| Manifest Validation | kubeconform |
| Monitoring | Prometheus, Grafana |
| Cloud Tooling | AWS CLI, eksctl |
| Package Management | Helm |
Architecture: The Final Routing Model
After testing and refinement (including one important failure), the traffic routing model settled into this:
User request
|
AWS Load Balancer (cloud) or kubectl port-forward (local)
|
NGINX Ingress Controller
|
shopswift-ingress
|
shopswift-active-service
|
selector: environment=blue OR environment=green
|
Blue pods Green pods
The key design principle: NGINX Ingress never changes. The only thing that changes during a switch is the label selector on shopswift-active-service.
This distinction matters - and it came from a real failure. More on that below.
Phase 1: The Application
I started with the Express API and immediately wrote tests using Jest and Supertest before touching Docker or Kubernetes.
Test Suites: 1 passed
Tests: 7 passed
This was deliberate. Kubernetes deployment should not begin with an untested application. The health, readiness, and version endpoints needed to be correct before any of the deployment logic could trust them.
Phase 2: Dockerizing ShopSwift
The Docker image was designed to support both Blue and Green from the same codebase using environment variables:
# Blue container
docker run -e APP_VERSION=v1.0.0 -e APP_ENV=blue shopswift:v1.0.0
# Green container
docker run -e APP_VERSION=v2.0.0 -e APP_ENV=green shopswift:v2.0.0
No separate codebases. No duplicated Dockerfiles. One image, configured at runtime.
I also wrote smoke test scripts to validate all endpoints quickly after each build - a habit that paid dividends throughout the project.
Challenge: npm ci Caught a Lockfile Mismatch
The Docker build failed at:
RUN npm ci --omit=dev
The cause: package-lock.json was out of sync with package.json.
This is precisely why npm ci exists. Unlike npm install, it treats a lockfile mismatch as a hard failure rather than silently correcting it. I regenerated the lockfile and rebuilt - and the build became reproducible.
Lesson: A reproducible build that fails loudly is better than a lenient one that silently diverges.
Phase 3: Git Baseline
After the local application and Docker image were validated, I pushed to GitHub. The commit history became part of the project evidence:
feat: build ShopSwift app and Docker baseline
feat: deploy ShopSwift blue on Minikube with NGINX Ingress
feat: implement zero-downtime switch with active service selector
feat: simulate broken green release and validate readiness protection
feat: add GitHub Actions CI pipeline
feat: deploy and validate blue-green on AWS EKS
feat: add Prometheus and Grafana monitoring
docs: rewrite README with complete deployment documentation
For a project like this, traceability is part of the deliverable.
Phase 4: Minikube Blue Baseline
Before spending time or money on AWS, I validated the full deployment architecture locally with Minikube.
Kubernetes resources deployed:
Namespace: ecommerce-bluegreen
Deployment: shopswift-blue
Service: shopswift-blue-service
Ingress: shopswift-ingress
Challenge: WSL + Docker Driver = Unreliable Ingress Access
Running Minikube with the Docker driver inside WSL meant that accessing shopswift.local directly was unreliable - a known networking limitation of this environment. The app and Service were fine; the issue was local DNS and networking.
The solution was to port-forward the NGINX Ingress Controller and pass the correct Host header:
kubectl port-forward -n ingress-nginx service/ingress-nginx-controller 8080:80 &
curl -H "Host: shopswift.local" http://localhost:8080/version
This still exercised the full NGINX Ingress routing path - just without relying on local DNS resolution. It was the right tradeoff for a local validation environment.
Phase 5: Deploying Green and Switching Traffic
With Blue stable, I deployed Green:
Deployment: shopswift-green
Service: shopswift-green-service
Before switching traffic, I tested Green internally through its own Service. This is non-negotiable in a proper blue-green workflow - Green pods running does not mean Green is ready to serve users.
Green internal check confirmed:
{
"app": "ShopSwift",
"version": "v2.0.0",
"environment": "green",
"commit": "minikube-green",
"port": 3000,
"status": "running"
}
Both environments were now running. Time to switch traffic.
The Failure That Made This Project Better
My first switching approach was to patch the Ingress backend directly:
# Before
backend:
service:
name: shopswift-blue-service
# After patching
backend:
service:
name: shopswift-green-service
Logical. Clean-looking. But during a continuous zero-downtime test:
FAILED request: status=503
Failed requests: 1 of 26
A single 503 during a traffic switch means the design cannot honestly be called zero-downtime. I did not hide this result. I used it to understand what was happening.
The likely cause: when the Ingress backend is patched, NGINX reloads its configuration. During that reload - even briefly - upstream connections can fail. One request landed in that gap.
The Fix: The Stable Active Service Pattern
Instead of touching the Ingress, I introduced a stable intermediary:
# shopswift-active-service - this never changes in Ingress
apiVersion: v1
kind: Service
metadata:
name: shopswift-active-service
spec:
selector:
app: shopswift
environment: blue # <-- only this changes during a switch
The Ingress always points to shopswift-active-service. To switch traffic, I only patch the selector:
# Switch to Green
kubectl patch service shopswift-active-service \
-n ecommerce-bluegreen \
--type='merge' \
-p '{"spec":{"selector":{"app":"shopswift","environment":"green"}}}'
# Roll back to Blue
kubectl patch service shopswift-active-service \
-n ecommerce-bluegreen \
--type='merge' \
-p '{"spec":{"selector":{"app":"shopswift","environment":"blue"}}}'
Kubernetes Service selector updates are atomic. The control plane propagates the change with no NGINX reload, no connection gap. This is the architecture improvement that made zero-downtime achievable.
Result After the Fix
Blue to Green switch:
Total requests: 26
Failed requests: 0
Zero-downtime availability test: PASSED
Phase 6: Zero-Downtime Rollback
Rollback in a blue-green system should not require rebuilding or redeploying the previous version. Blue never stopped running. It was just not receiving traffic.
To roll back, I patched the selector back to Blue:
Green to Blue rollback:
Total requests: 25
Failed requests: 0
Zero-downtime availability test: PASSED
Key operational insight: rollback is a routing decision, not a deployment operation. Blue stayed warm throughout Green's tenure as active. Recovery time is measured in seconds, not minutes.
Phase 7: Simulating a Broken Release
A deployment strategy that only handles successful releases is incomplete.
I simulated a broken Green release by configuring the /ready endpoint to return HTTP 503:
FORCE_NOT_READY=true
The Kubernetes readiness probe detected this and never admitted the broken pods to the Service's endpoint pool. The rollout timed out. Traffic never left Blue.
Live /version response during the broken Green simulation:
{
"app": "ShopSwift",
"version": "v1.0.0",
"environment": "blue",
"commit": "minikube-blue",
"port": 3000,
"status": "running"
}
The live smoke test still passed. Users were never affected.
Readiness probes are not decoration. They are an operational safety control. A pod that fails its readiness probe never receives production traffic - regardless of whether it is running.
Phase 8: GitHub Actions CI Pipeline
The GitHub Actions workflow validated every push:
# .github/workflows/ci.yml
steps:
- Checkout repository
- Setup Node.js 20
- npm ci
- Run Jest unit tests
- Docker image build
- Trivy vulnerability scan
- kubeconform Kubernetes manifest validation
Challenge: Trivy Action Version Not Found
The initial workflow referenced:
uses: aquasecurity/trivy-action@0.24.0
GitHub Actions could not resolve this version. The fix was to pin a verified release tag from the Trivy Action releases page and pin the Trivy binary version to match.
Final CI result:
ShopSwift CI: PASSED
This phase closed an important gap. The project now had automated validation, not just manual testing.
Phase 9: AWS EKS Cloud Deployment
Local validation proved the architecture. AWS EKS proved it at scale.
Cloud stack:
| Component | Service |
|---|---|
| Kubernetes | AWS EKS |
| Container registry | Amazon ECR |
| Ingress | NGINX Ingress Controller via Helm |
| Load balancer | AWS Network Load Balancer (provisioned by NGINX Helm chart) |
| Cluster provisioning | eksctl |
Cluster configuration:
Cluster: shopswift-bluegreen-eks
Region: us-east-1
Nodes: 2 x t3.small
Images pushed to ECR:
677276115158.dkr.ecr.us-east-1.amazonaws.com/shopswift:v1.0.0
677276115158.dkr.ecr.us-east-1.amazonaws.com/shopswift:v2.0.0
AWS Blue baseline confirmed:
{
"app": "ShopSwift",
"version": "v1.0.0",
"environment": "blue",
"commit": "aws-blue",
"port": 3000,
"status": "running"
}
AWS Blue to Green
Total requests: 21
Failed requests: 0
AWS zero-downtime switch: PASSED
AWS Green to Blue Rollback
Total requests: 23
Failed requests: 0
AWS zero-downtime rollback: PASSED
The active Service selector pattern that was proven in Minikube held in AWS without modification. The architecture was portable.
Phase 10: Prometheus and Grafana Monitoring
Deployment is only part of production readiness. Observability is the other part.
I installed kube-prometheus-stack on EKS via Helm. The stack included:
- Prometheus - metrics collection and storage
- Grafana - visualization and dashboards
- Alertmanager - alert routing
- kube-state-metrics - Kubernetes object metrics
- node-exporter - node-level metrics
- Prometheus Operator - automated scrape configuration
I also enabled NGINX Ingress metrics for request-level visibility.
Example PromQL queries run against the cluster:
up
kube_pod_info{namespace="ecommerce-bluegreen"}
kube_deployment_status_replicas_available{namespace="ecommerce-bluegreen"}
container_cpu_usage_seconds_total{namespace="ecommerce-bluegreen"}
Grafana showed the ShopSwift namespace, Blue and Green pods, the ingress-nginx namespace, and the monitoring namespace - all with pod and node-level metrics.
One thing I want to be clear about: the monitoring evidence demonstrates Kubernetes infrastructure observability - pod health, replica counts, CPU usage, ingress traffic. It does not demonstrate business-level observability such as checkout conversion rates or revenue metrics. Overclaiming monitoring scope weakens a project. I documented this distinction explicitly.
AWS Teardown and Cost Control
Cloud engineering is not only about building. It is about cleaning up responsibly.
After capturing all evidence, I deleted every AWS resource associated with this project:
EKS cluster
Worker EC2 instances
AWS Network Load Balancer
ECR repositories
EBS volumes
Elastic IPs
NAT Gateway
CloudFormation stacks
Project IAM roles
Project security groups
Final verification:
EKS clusters: []
Load balancers: none
ECR repository: not found
EC2 running instances: none
Available EBS volumes: none
Elastic IPs: none
I also performed a broader AWS cost sweep and identified non-project items that needed review:
RDS manual snapshot: bio-platform-postgres (eu-west-2)
Old S3 Terraform state bucket
DynamoDB Terraform lock table
Possible customer-managed KMS keys
Knowing what you built is not enough. Knowing what is still running - and what it costs - is part of the engineering discipline.
Key Lessons
1. Zero downtime must be proven under traffic, not assumed
The first design produced a 503. The improved design produced zero failed requests across every switch. The difference was measured in continuous load tests, not in YAML reviews.
2. Readiness probes are operational safety controls
The broken Green release failed its readiness probe and never received live traffic. That is the intended behaviour. It worked exactly as designed.
3. Keep the stable thing stable
Patching the Ingress caused a brief NGINX reload and dropped a request. Keeping the Ingress stable and patching only the Service selector eliminated that failure mode. In distributed systems, minimise what changes during a live operation.
4. Validate locally before going to cloud
Minikube let me find the 503, design the fix, and prove the improved architecture - before a single dollar was spent on AWS. Local validation is not just a convenience; it is cost control.
5. Teardown is part of the engineering work
Every cloud resource left running is a cost risk. Cleanup is not an afterthought.
6. Monitoring claims should match monitoring evidence
The Grafana dashboards showed Kubernetes metrics. That is what I claimed. Nothing more.
What I Would Add Next
If this project were extended into a production system:
- Real DNS + TLS via cert-manager and ExternalDNS
- Progressive delivery with Argo Rollouts or Flagger - automated canary analysis before promotion
- Automated promotion gates - promote Green only if error rate and latency are within bounds
- Slack/PagerDuty alerting via Alertmanager
- Horizontal Pod Autoscaler for traffic-driven scaling
- PodDisruptionBudgets to protect availability during node maintenance
- NetworkPolicies for pod-level traffic isolation
- External Secrets Operator for secrets management
- IRSA (IAM Roles for Service Accounts) to replace node-level IAM
- Centralized logging with Fluent Bit + OpenSearch or Loki
- Business-level metrics - checkout attempts, success rates, latency by environment
- Infrastructure as Code with Terraform or Pulumi
Final Outcome
This project proved a complete local-to-cloud DevOps release workflow:
Code
|
Tests (Jest + Supertest)
|
Docker image
|
Minikube validation
|
Blue-Green deployment (active Service selector pattern)
|
Zero-downtime rollback
|
Broken release protection (readiness probes)
|
GitHub Actions CI (tests, Trivy, kubeconform)
|
AWS EKS cloud deployment
|
Prometheus + Grafana monitoring
|
AWS teardown and cost sweep
The most valuable part of this project was not that everything worked on the first try. It did not.
The most valuable part was that a real deployment failure was detected, analysed, fixed, and validated again under load. That is the full engineering cycle - not just implementation, but discovery and improvement.
Repository
github.com/gbadedata/shopswift-blue-green
If you are reviewing this project, the strongest places to start:
-
k8s/active-service.yaml- the stable switching mechanism -
scripts/zero-downtime-test.sh- the load test that caught the 503 -
k8s/broken-green/- the readiness probe failure simulation -
.github/workflows/ci.yml- the full CI pipeline -
evidence/- AWS EKS deployment, monitoring, and teardown evidence -
README.md- complete deployment walkthrough with copy-paste commands
Built with Node.js, Docker, Kubernetes, GitHub Actions, AWS EKS, Prometheus, and Grafana. Questions or feedback? Drop a comment below.
Top comments (0)