Great Stack to Doesn't Work #8
Load Balancer: "Traffic Incoming, Nothing Standing"
A survival guide for when everything goes wrong in production.
Your application handles 1,000 requests per second without breaking a sweat. You put a load balancer in front of it. Now it handles 200 requests per second and half of them return 502.
The load balancer, the thing you deployed to improve reliability, just became the single point of failure. Not because it's broken — because you configured it wrong.
Nginx vs HAProxy vs Envoy: The Decision Tree
These three dominate the load balancer space. They overlap significantly but each has a sweet spot.
Nginx: The Swiss army knife. Web server, reverse proxy, load balancer, static file server. If you're already running Nginx for your web server and need basic load balancing (round-robin, least connections, IP hash), adding upstream configuration is trivial. Configuration is file-based, hot-reloadable with nginx -s reload.
Best for: teams that want simplicity and are already in the Nginx ecosystem. Small to medium traffic. Static configuration that doesn't change often.
HAProxy: Purpose-built for load balancing. More sophisticated health checking, connection management, and traffic routing than Nginx. The stats page gives you real-time visibility into backend health, connection counts, and error rates. ACL-based routing is powerful for complex traffic patterns.
Best for: high-traffic environments where you need fine-grained control over connection behavior, advanced health checking, and detailed operational metrics.
Envoy: Built for service mesh and microservices. Dynamic configuration via xDS APIs (no file reloads). First-class support for gRPC, HTTP/2, and WebSocket. Built-in distributed tracing, circuit breaking, and rate limiting. Heavier and more complex than Nginx or HAProxy.
Best for: microservices architectures, especially when used as a sidecar proxy (Istio, Linkerd). Dynamic environments where backends change frequently. Teams that need service mesh capabilities.
The honest answer: for 80% of deployments, Nginx or HAProxy is sufficient. Envoy adds capabilities most teams don't need and complexity every team feels.
Connection Pooling and Keepalive: The Performance Multiplier
Every new TCP connection requires a three-way handshake: SYN, SYN-ACK, ACK. On a local network, that's ~0.5ms. Through TLS, add another 1-2ms for the TLS handshake. When your load balancer opens a new connection to a backend for every request, those milliseconds multiply by thousands of requests per second.
Upstream keepalive maintains persistent connections between the load balancer and your backends. Instead of opening a new connection per request, the load balancer reuses an existing one.
Nginx:
upstream backend {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
keepalive 64; # Keep 64 idle connections per worker
keepalive_timeout 60s; # Close idle connections after 60 seconds
keepalive_requests 1000; # Max requests per connection before recycling
}
server {
location / {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Connection ""; # Required for keepalive
}
}
The proxy_http_version 1.1 and proxy_set_header Connection "" lines are critical. Without them, Nginx defaults to HTTP/1.0 for upstream connections, which doesn't support keepalive. This is the most common configuration mistake — keepalive is configured on the upstream block but disabled by the proxy settings.
Buffer Sizes: The Silent 502 Generator
When your backend sends a response, Nginx buffers it before forwarding to the client. If the response exceeds the buffer size, Nginx writes to a temporary file on disk. If even that fails (disk full, permissions, or buffering disabled), you get a 502.
proxy_buffer_size 16k; # Buffer for the first part of the response (headers)
proxy_buffers 8 16k; # 8 buffers of 16k each for the body
proxy_busy_buffers_size 32k; # How much can be sent to the client while still buffering
Default proxy_buffer_size is 4k or 8k depending on the platform. If your backend returns large headers (big cookies, verbose auth tokens, lots of custom headers), 4k isn't enough. The response gets truncated. 502.
How to diagnose: if you see upstream sent too big header while reading response header from upstream in Nginx error logs, increase proxy_buffer_size.
For large response bodies (reports, data exports, file downloads), consider proxy_buffering off to stream directly from backend to client without buffering. This reduces memory usage but means the backend connection stays open for the entire transfer duration.
Rate Limiting: Protecting Your Backends
Rate limiting at the load balancer layer protects your backends from traffic spikes, abuse, and accidental DDoS from misbehaving clients.
Request-based rate limiting (Nginx):
limit_req_zone $binary_remote_addr zone=api:10m rate=100r/s;
server {
location /api/ {
limit_req zone=api burst=200 nodelay;
proxy_pass http://backend;
}
}
This allows 100 requests per second per IP. The burst=200 allows brief spikes up to 200 requests, and nodelay processes burst requests immediately instead of queuing them.
Connection-based rate limiting:
limit_conn_zone $binary_remote_addr zone=conn:10m;
server {
location / {
limit_conn conn 50; # Max 50 concurrent connections per IP
proxy_pass http://backend;
}
}
Connection limits protect against slowloris attacks and clients that open hundreds of connections without closing them.
Choose the right key for rate limiting. $binary_remote_addr works for direct client connections. Behind a CDN or another proxy, all requests come from the CDN's IP — you need to rate limit on a header like X-Forwarded-For or a custom API key header instead.
SSL Termination: More Than Just Certificates
SSL termination at the load balancer means clients connect via HTTPS to the load balancer, and the load balancer connects to backends via HTTP. This offloads the crypto work from your backends and centralizes certificate management.
OCSP stapling eliminates the latency of clients checking certificate revocation status:
ssl_stapling on;
ssl_stapling_verify on;
resolver 8.8.8.8 8.8.4.4 valid=300s;
SSL session caching avoids repeating the full TLS handshake for returning clients:
ssl_session_cache shared:SSL:50m;
ssl_session_timeout 1d;
ssl_session_tickets off; # Or on, but rotate keys
Protocol and cipher selection:
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_prefer_server_ciphers on;
Disable TLS 1.0 and 1.1 — they're deprecated. Prefer TLS 1.3 when clients support it; the handshake is faster (1-RTT vs 2-RTT) and the cipher suites are simpler.
WebSocket Proxying: The Upgrade Dance
WebSocket connections start as HTTP and upgrade to a persistent bidirectional channel. Load balancers need explicit configuration to handle the upgrade.
location /ws {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 86400s; # 24 hours — don't timeout idle connections
proxy_send_timeout 86400s;
}
Without the Upgrade and Connection headers, the load balancer treats the WebSocket request as a regular HTTP request and closes the connection after the first response.
The proxy_read_timeout is critical. Default is 60 seconds. WebSocket connections are often idle for long periods (waiting for events). A 60-second timeout kills idle connections, forcing clients to reconnect constantly.
For health checks on WebSocket backends, use a separate HTTP health check endpoint. Don't try to WebSocket handshake as a health check — it's fragile and slow.
Health Checks: Active vs Passive
Passive health checks monitor real traffic. If a backend returns 5 errors in 10 seconds, mark it as unhealthy and stop sending traffic for 30 seconds. This is reactive — you don't detect problems until real users are affected.
Nginx (open-source) only supports passive health checks:
upstream backend {
server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
server 10.0.0.2:8080 max_fails=3 fail_timeout=30s;
}
Active health checks send synthetic requests to backends on a schedule. If a backend fails the health check, it's removed from the pool before any user traffic reaches it. This is proactive.
HAProxy:
backend api_servers
option httpchk GET /health
http-check expect status 200
server srv1 10.0.0.1:8080 check inter 5s fall 3 rise 2
server srv2 10.0.0.2:8080 check inter 5s fall 3 rise 2
inter 5s: check every 5 seconds. fall 3: mark unhealthy after 3 consecutive failures. rise 2: mark healthy again after 2 consecutive successes.
The ideal is both: active health checks to detect unhealthy backends proactively, passive health checks as a safety net for failures that the health check endpoint doesn't catch (like the health endpoint returning 200 while the application is actually deadlocked — the theme of Episode #4).
War Story: The Night of 502s
E-commerce platform. Nginx load balancer. 8 backend servers. Normal evening traffic: 15,000 requests per second. Black Friday preview campaign launched at 20:00.
20:05 — Traffic spikes to 45,000 rps. Load balancer CPU is fine. Backends are handling it.
20:12 — 502 errors start appearing. 2%, then 5%, then 15%. Backend servers show 40% CPU usage. They're not overloaded.
20:20 — On-call engineer checks Nginx error logs: no live upstreams while connecting to upstream. All 8 backends are marked as unhealthy.
What happened: the backends were responding, but slowly. Under 3x normal traffic, response times went from 50ms to 800ms. Nginx's proxy_read_timeout was set to the default: 60 seconds. That wasn't the problem. The problem was proxy_connect_timeout: 5 seconds. Under load, the backends' TCP accept queues filled up. New connections took 6 seconds to establish. Nginx marked them as failed (connect timeout). After max_fails=3 — three timeouts in fail_timeout=30s — Nginx marked the backend as unhealthy.
All 8 backends hit the same threshold within minutes of each other. All marked unhealthy. No backends left. 100% 502s.
The fix:
- Increased
proxy_connect_timeoutto 15 seconds. - Increased backend
somaxconnand application listen backlog to 65535. - Increased
keepaliveconnections to reduce new connection overhead. - Added
max_fails=5instead of the default 1 (yes, Nginx's defaultmax_failsis 1).
The backends were never overloaded. They were slow to accept new connections under burst load, and the load balancer's aggressive failure detection made the problem worse.
War Story: The SSL Certificate Surprise
Less technical, more organizational. The SSL certificate for the production domain expired at 06:00 on a Tuesday. Auto-renewal was configured but pointed to a DNS provider account that someone had changed the password on 3 months earlier. The renewal failed silently. The monitoring check for certificate expiry was set to alert at 7 days — but the team had suppressed that alert because "it auto-renews, we don't need the noise."
At 06:00, every HTTPS request to the platform failed. Browser users got a scary red warning page. API clients got TLS handshake errors. Mobile apps crashed because they enforced certificate pinning.
Time to diagnosis: 12 minutes (fast — someone was already awake).
Time to fix: 4 hours. Issuing a new certificate required DNS validation, which required accessing the DNS provider, which required a password reset, which required access to an email account that was tied to an employee who had left the company.
Lessons:
- Never suppress certificate expiry alerts. Set them at 30 days, 14 days, and 3 days.
- Monitor the actual renewal process, not just the expiry date. If renewal fails, alert immediately.
- DNS provider credentials are as critical as production credentials. Store them in the same secret manager.
- Certificate pinning in mobile apps means you can't recover by switching to a different certificate authority quickly. Consider HPKP alternatives or pin to the CA, not the leaf certificate.
Key Takeaways
The load balancer is infrastructure you interact with through configuration, not code. Every default is a decision someone made for a general case that probably doesn't match your specific case.
Connection keepalive between the load balancer and backends is the single highest-impact configuration change for most setups. Followed by correct buffer sizes, then timeouts.
Health checks should be active, not just passive. Passive checks detect problems after users are affected. Active checks detect problems before.
And manage your SSL certificates like the critical infrastructure they are. An expired certificate is a total outage with a 4-hour recovery time if you haven't prepared.
Over to You
Nginx, HAProxy, or Envoy? What's your go-to and why? Any load balancer misconfiguration horror stories?
If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.
Follow me:
This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*
Top comments (0)