Auto-Failover, Zero Downtime, and Manual Traffic Switching
One of the core responsibilities of a DevOps engineer is ensuring application availability in the presence of failure. Downtime is rarely caused by deployments themselves, but by how traffic is handled when something goes wrong.
In Stage 2 of my DevOps internship, I implemented a Blue/Green deployment architecture using Nginx upstreams and Docker Compose, focusing on:
- Zero failed client requests during outages
- Automatic failover within a single request
- Manual traffic switching without restarting containers
- No application code changes
- No image rebuilds
This article is a beginner-friendly but production-accurate walkthrough of the solution, explaining both the configuration and the runtime behavior in detail.
Problem Overview
We are provided with two identical Node.js services packaged as pre-built Docker images:
- Blue — primary (active)
- Green — backup
Each service exposes the following endpoints:
| Endpoint | Purpose |
|---|---|
GET /version |
Returns JSON + headers |
GET /healthz |
Liveness check |
POST /chaos/start |
Simulates failure |
POST /chaos/stop |
Restores service |
The task is to place Nginx in front of both services and guarantee:
- All traffic goes to Blue by default
- On Blue failure, Nginx automatically switches to Green
- No client request returns non-200 during failover
- Application response headers are forwarded unchanged
- Traffic can be manually toggled between Blue and Green
Architecture Overview
The final architecture is intentionally simple and production-aligned:
Client --> Nginx (8080) --> Blue App (8081) OR failover to Green App (8082)
Key characteristics:
- Nginx is the single public entrypoint
- Blue/Green run simultaneously
- Docker Compose orchestrates everything
- No Kubernetes, no service mesh, no rebuilds
Environment-Driven Configuration
All behavior is controlled via environment variables, making the setup CI-friendly and reproducible.
Key variables:
- BLUE_IMAGE, GREEN_IMAGE
- ACTIVE_POOL (blue or green)
- RELEASE_ID_BLUE, RELEASE_ID_GREEN
- PORT, BLUE_PORT, GREEN_PORT
- NGINX_PORT
This design ensures:
- No hardcoded values
- Safe traffic switching
- Easy automated verification
Docker Compose: Service Breakdown
Blue Application Service
app_blue:
image: ${BLUE_IMAGE}
container_name: app_blue
restart: always
environment:
- PORT=${PORT}
- RELEASE_ID=${RELEASE_ID_BLUE}
- APP_POOL=blue
expose:
- "${PORT}"
ports:
- "${BLUE_PORT}:${PORT}"
healthcheck:
test: ["CMD-SHELL", "node -e \"process.exit(0)\""]
interval: 5s
timeout: 2s
retries: 3
What this achieves:
- Runs the provided Blue image without modification
- Injects runtime metadata used in response headers
- Exposes the service internally to Nginx
- Maps a direct port (8081) for chaos testing
- Keeps the container healthy and restartable
The Green service is identical, differing only in image, release ID, and port.
This symmetry is critical for Blue/Green deployments.
Nginx Reverse Proxy
nginx:
image: nginx:latest
ports:
- "${NGINX_PORT}:80"
volumes:
- ./nginx/nginx.tmpl:/etc/nginx/templates/default.conf.template:ro
- ./nginx/entrypoint.sh:/docker-entrypoint.d/10-envsubst.sh:ro
- ./nginx-logs:/var/log/nginx
environment:
- ACTIVE_POOL=${ACTIVE_POOL}
- PORT=${PORT}
Key decisions:
- Nginx is the only public interface
- Configuration is templated, not static
- Logs are persisted for inspection and alerting
- No container restarts needed for traffic switching
Nginx Upstreams: Blue/Green Routing
The heart of the solution lies in the Nginx upstream configuration.
Timeout and Retry Configuration
proxy_connect_timeout 1s;
proxy_read_timeout 5s;
proxy_send_timeout 3s;
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
proxy_next_upstream_tries 2;
proxy_next_upstream_timeout 8s;
These values ensure:
- Failures are detected quickly
- Retries happen automatically
- Total request time remains under 10 seconds
- Clients never see partial or failed responses
Primary / Backup Upstreams
upstream blue {
server app_blue:${PORT} max_fails=1 fail_timeout=5s;
server app_green:${PORT} backup;
}
upstream green {
server app_green:${PORT} max_fails=1 fail_timeout=5s;
server app_blue:${PORT} backup;
}
Why this works:
- max_fails=1 marks the primary unhealthy after a single failure
- fail_timeout=5s enables fast recovery
- backup ensures Green is only used when Blue fails
- The same config supports both active pools
Deep Dive: Request Flow During Failure
This is the most important part of the system.
Normal Operation
- Client sends:
GET http://localhost:8080/version
- Nginx forwards to Blue
- Blue responds with 200
- Headers returned:
X-App-Pool: blue
X-Release-Id: <RELEASE_ID_BLUE>
Failure Scenario (Blue Down)
Chaos is induced directly on Blue:
POST http://localhost:8081/chaos/start?mode=error
Now let’s trace a single client request.
Step 1: Request Hits Nginx
The client is unaware of Blue or Green.
Step 2: Nginx Proxies to Blue
Blue is the primary upstream.
Step 3: Blue Fails
Blue returns a 5xx or times out.
Step 4: Nginx Intercepts the Failure
Because of:
proxy_next_upstream error timeout http_5xx;
Nginx does not forward the failure to the client.
Step 5: Immediate Retry to Green
Within the same client request, Nginx retries the request to Green.
Step 6: Green Responds Successfully
Green returns:
HTTP 200
X-App-Pool: green
X-Release-Id: <RELEASE_ID_GREEN>
Result:
The client sees HTTP 200, even though Blue failed.
Why Proxy Buffering Matters
proxy_buffering on;
This ensures Nginx does not stream partial responses.
If Blue fails mid-request, Nginx can safely retry Green without exposing errors to clients.
Header Preservation
Each application response includes:
- X-App-Pool
- X-Release-Id Nginx forwards these headers unchanged:
proxy_pass_header X-App-Pool;
proxy_pass_header X-Release-Id;
This allows:
CI validation
Runtime verification
Clear observability of which pool served the request
Manual Blue/Green Switching
Traffic switching is handled by configuration templating.
Entrypoint Script
envsubst '$ACTIVE_POOL $PORT $RELEASE_ID_BLUE $RELEASE_ID_GREEN' \
< default.conf.template > default.conf
This allows:
- Changing ACTIVE_POOL=green
- Regenerating the Nginx config
- Reloading Nginx without downtime No containers are restarted.
Stability Under Sustained Failure
During a ~10 second request loop:
- Zero non-200 responses
- ≥95% responses from Green
- Blue remains isolated until healthy
This satisfies all grader stability requirements.
Key Takeaways
This project demonstrates:
- Blue/Green deployment without Kubernetes
- Auto-failover within a single HTTP request
- Resilience implemented at the proxy layer
- Environment-driven infrastructure design
- Production-grade reliability using simple tools
Conclusion
High availability is not about avoiding failure—it’s about handling failure correctly.
By combining Nginx upstreams, Docker Compose, and strict timeout and retry controls, we achieve:
- Zero downtime
- Safe rollbacks
- Transparent failover
- CI-ready verification
This approach mirrors real production systems and is an excellent foundation for any DevOps engineer.
If you’re learning DevOps, mastering patterns like this matters far more than chasing tools. Reliability is a design choice.
Explore the code here
Top comments (0)