DEV Community

Cover image for Building a Self-Healing Blue/Green Deployment with Nginx & Docker
Destiny Obs
Destiny Obs

Posted on

Building a Self-Healing Blue/Green Deployment with Nginx & Docker

The 2:47 AM Lesson That Changed Everything

It was 2:47 AM.
Our backend service had gone quiet — not a hard crash, not an obvious error, just silence.
Nginx sat waiting, clients kept retrying, and every second felt longer than the last.

That was the night I learned the real definition of downtime:

“Downtime isn’t just when the server stops — it’s when users stop trusting your system.”

So when the HNG DevOps Stage 2 task came, I didn’t just want to make a deployment work.
I wanted to build something that heals itself.

My goal was clear:

  • Run two identical app containers — Blue and Green.
  • Let Nginx automatically fail over if one goes down.
  • Keep responses clean, fast, and consistent.
  • Never show a 5xx error again.

What Blue/Green Deployment Really Means

A Blue/Green Deployment is a release strategy where two identical environments live side-by-side:

Environment Role Purpose
Blue Active Serves live user traffic
Green Standby Runs silently, ready to take over

When Blue fails or you roll out an update, Green steps up instantly.
Traffic switches without downtime.

Think of it like a plane with two engines — both spinning, but only one giving thrust.
If the active engine fails, you flip a switch, and the passengers never notice.
That’s exactly what we achieve in DevOps terms:

Blue = Active upstream
Green = Backup upstream
Nginx = Pilot deciding which one should fly
Enter fullscreen mode Exit fullscreen mode

My Three-Layer Architecture Stack

Here’s the structure I built from the ground up:

image

All services live in a single Docker Network, communicating through internal DNS.
Nginx listens on port 8080, while both app containers expose /healthz and /version endpoints.


Inside the Intelligence of Nginx Failover

Failover magic starts inside the upstream block of the Nginx configuration:

image

Let’s break this down:

  • app_blue is the primary server.
  • app_green is marked as backup — it stays idle until Blue fails.
  • max_fails=1 means one failure is enough to demote Blue.
  • fail_timeout=3s decides how long Blue stays on the bench.

When Blue fails once within 3 seconds, Nginx instantly reroutes traffic to Green — no manual restart, no DNS propagation, no human intervention.

That tiny keyword backup turns Nginx from a static load balancer into a self-healing reverse proxy.


Retry Logic — The Secret Behind Zero Downtime

To make the transition invisible to users, I implemented this retry policy:

image

Here’s what happens:

  1. A request hits Blue.
  2. Blue delays or sends a 500.
  3. Nginx checks: “Is this retriable?”
  4. It retries once, this time on Green.
  5. Green returns 200 OK — the user never notices.

It even retries POST requests (non_idempotent) safely.
Because in real life, reliability > rigidity.


Docker Compose — Orchestrating Everything

This was my docker-compose.yml setup:

image

Two services (app_blue, app_green) are identical, except for color tags.
Nginx uses environment variables like ACTIVE_POOL to decide which one is live.

image


⚙️ The Active Pool Switch Script

To make switching seamless, I wrote a small shell script — render-and-reload.sh.
image

So with one command:

docker compose exec -T nginx sh -lc 'ACTIVE_POOL=green /opt/nginx/render-and-reload.sh'"
Enter fullscreen mode Exit fullscreen mode

I can flip traffic instantly from Blue to Green — no rebuild, no downtime.

image


The Moment of Truth — Chaos Testing

Once everything was running, I simulated failure with a secret robust script to break things.

bash scripts/verify.sh
Enter fullscreen mode Exit fullscreen mode

The results were satisfying:

image

Every request kept returning 200 OK even while Blue was being stressed.
That’s not just a passing test — that’s trust preserved.


Key Concepts I Learned (Definitions for Interns Coming After Me)

Term Definition
Upstream A group of backend servers that Nginx load-balances between.
Backup A server only used when primaries fail.
Failover Automatic switch to a standby system upon failure.
Healthcheck Automated test to ensure a container is alive and healthy.
Zero-Downtime Deployment Releasing new versions without interrupting service availability.

Why This Matters in the Real World

In real production systems, even a minute of downtime can cost users, transactions, or trust.
What this setup teaches is graceful failure — planning for the moment things go wrong.

And the best part?
It’s all built with open-source tools:

  • Docker for container orchestration
  • Nginx for reverse proxy and failover
  • Shell scripts for automation
  • No fancy load balancer bills, no proprietary magic

Final Thoughts

This project wasn’t just about passing a task — it was a lesson in resilience.

In all my DevOps Practices i never thougth about this or implementing such cause they are many tools to do this
I learned that DevOps isn’t only about deployment pipelines or automation scripts.
It’s about ensuring that no matter what happens, your users stay unaffected.

When the next 2:47 AM crash happens — your system won’t panic.
It’ll heal itself.
GitHub Repo: DestinyObs/HNGi13-Stage2-Task


I’m DestinyObs| iDeploy | iSecure | iSustain


Top comments (0)