Hey Dev Community!
Welcome!
Introduction
Load balancing is the practice of distributing client requests across multiple backend resources to maximize throughput, minimize latency, and provide fault tolerance. At cloud scale this is not a single component but a layered system: edge DNS and anycast, global routing, regional load balancers, and local reverse proxies cooperate with health checks, autoscaling, telemetry, and traffic shaping to keep services alive under billions of requests.
This article is a zero-to-one-hundred practical guide. It explains core algorithms, trade-offs, operational controls, testing strategies, and step-by-step code examples you can run and extend. It covers stateless and stateful workloads, HTTP/1.1 vs HTTP/2/HTTP3 considerations, TLS termination, sticky sessions, consistent hashing, autoscaling hooks, and production hardening.
Quick decision checklist
- Traffic shape: bursty vs steady
- Request duration: short (milliseconds) vs long (seconds)
- Statefulness: stateless vs session affinity required
- Geography: single region vs global users
- Failure model: acceptable downtime and recovery time objective (RTO)
- SLOs: p99 latency and availability targets
Core algorithms overview
| Algorithm | Key benefit | Weakness | Best use |
|---|---|---|---|
| Round Robin | Simple, low overhead | Ignores instantaneous load | Homogeneous stateless pools |
| Least Connections | Adapts to variable durations | Needs accurate counters | Longâlived connections |
| Weighted | Capacity aware routing | Requires tuning and telemetry | Mixed-capacity clusters |
| Consistent Hashing | Minimal remap on churn | More complex; needs vnodes | Stateful caches and sharded state |
Deep technical notes on algorithms
Round Robin
- Implementation: rotate an index across healthy backends.
- Pros: minimal state, deterministic.
- Cons: poor when request durations vary or backends differ in capacity. Use behind autoscalers or when backends are homogeneous.
Least Connections
- Implementation: maintain an active connection/stream counter per backend and choose the minimum.
- For HTTP/1.1 track TCP connections; for HTTP/2/HTTP3 track active streams.
- Requires atomic counters and careful decrement on request completion or connection close.
Weighted Selection
- Assign a weight to each backend proportional to capacity (CPU, memory, NIC).
- Selection probability â weight.
- Weights can be static or dynamically adjusted from telemetry.
Consistent Hashing
- Map keys (client IP, session id, cache key) to a ring of virtual nodes.
- When nodes join/leave only (O(k/n)) keys move.
- Use many virtual nodes per physical node to smooth distribution.
Architecture layers and responsibilities
-
Edge Layer
- DNS geoârouting, anycast IPs, CDN edges.
- Goal: route users to the nearest healthy region and absorb volumetric attacks.
-
Global Routing
- Multiâregion failover, healthâaware DNS (Route53 style), BGP anycast.
- Goal: regional failover and traffic steering.
-
Regional Load Balancer
- L4/L7 balancing across AZs, TLS termination, DDoS mitigation.
- Examples: AWS ELB/ALB/NLB, GCP Load Balancer.
-
Local Reverse Proxy
- Sidecar or node-level proxy (Envoy, Nginx, HAProxy) for fine-grained routing, retries, circuit breaking.
- Goal: perânode resilience and observability.
-
Backend Pool
- Application instances, containers, or serverless functions.
- Autoscaling and health checks keep pool size appropriate.
Practical implementation components
- Router core: selection algorithm and request forwarding.
- Health checker: liveness and readiness probes, active and passive checks.
- Connection tracker: atomic counters for leastâconn.
- Session affinity: sticky cookies, IP affinity, or consistent hashing.
- Autoscaler hooks: metrics â scale decisions (RPS, queue length, CPU).
- Observability: metrics, traces, logs, dashboards, alerts.
- Safety controls: circuit breakers, rate limiting, backpressure.
Handsâon examples
- Minimal Round Robin router with health checks in Python A simple, runnable starting point. Not production hardened but demonstrates the core ideas.
rr_lb.py
import time
import requests
import threading
from http.server import BaseHTTPRequestHandler, HTTPServer
BACKENDS = ["http://127.0.0.1:9001", "http://127.0.0.1:9002"]
alive = {b: True for b in BACKENDS}
idx = 0
lock = threading.Lock()
def healthcheckloop(interval=5):
while True:
for b in BACKENDS:
try:
r = requests.get(b + "/health", timeout=1)
alive[b] = (r.status_code == 200)
except Exception:
alive[b] = False
time.sleep(interval)
def next_backend():
global idx
with lock:
n = len(BACKENDS)
for _ in range(n):
b = BACKENDS[idx % n]
idx += 1
if alive.get(b):
return b
return None
class ProxyHandler(BaseHTTPRequestHandler):
def do_GET(self):
backend = next_backend()
if not backend:
self.send_response(503)
self.end_headers()
self.wfile.write(b"No healthy backends")
return
try:
r = requests.get(backend + self.path, timeout=5)
self.sendresponse(r.statuscode)
for k,v in r.headers.items():
if k.lower() not in ("content-encoding","transfer-encoding","connection"):
self.send_header(k, v)
self.end_headers()
self.wfile.write(r.content)
except Exception as e:
self.send_response(502)
self.end_headers()
self.wfile.write(str(e).encode())
if name == "main":
threading.Thread(target=healthcheckloop, daemon=True).start()
server = HTTPServer(("0.0.0.0", 8080), ProxyHandler)
print("LB listening on :8080")
server.serve_forever()
Notes
- Add timeouts and connection pooling for performance.
- Replace Python requests with an async HTTP client for high throughput.
- Add logging and metrics (request count, latency, backend selection).
- Least Connections sketch with atomic counters Key idea: increment counter on request start, decrement on finish. Use thread-safe primitives.
least_conn.py sketch
import threading
from collections import defaultdict
BACKENDS = ["b1","b2","b3"]
alive = {b: True for b in BACKENDS}
conn_count = defaultdict(int)
lock = threading.Lock()
def chooseleastconn():
with lock:
candidates = [b for b in BACKENDS if alive[b]]
if not candidates:
return None
return min(candidates, key=lambda x: conn_count[x])
def handle_request(req):
b = chooseleastconn()
if not b:
return 503
with lock:
conn_count[b] += 1
try:
# proxy to backend b
pass
finally:
with lock:
conn_count[b] -= 1
Production considerations
- Use atomic counters or perâworker counters aggregated periodically to avoid lock contention.
- For HTTP/2/HTTP3 track active streams rather than TCP sockets.
- Consistent Hashing implementation with virtual nodes Useful for session affinity and distributed caches.
consistent_hash.py
import hashlib
import bisect
class ConsistentHashRing:
def init(self, vnodes=128):
self.ring = []
self.node_map = {}
self.vnodes = vnodes
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
for i in range(self.vnodes):
h = self._hash(f"{node}-{i}")
bisect.insort(self.ring, h)
self.node_map[h] = node
def remove_node(self, node):
to_remove = []
for i in range(self.vnodes):
h = self._hash(f"{node}-{i}")
to_remove.append(h)
for h in to_remove:
idx = bisect.bisect_left(self.ring, h)
if idx < len(self.ring) and self.ring[idx] == h:
self.ring.pop(idx)
del self.node_map[h]
def get_node(self, key):
if not self.ring:
return None
h = self._hash(key)
idx = bisect.bisect(self.ring, h) % len(self.ring)
return self.node_map[self.ring[idx]]
Usage
- Map session id or user id to a backend.
- When nodes change, only a fraction of keys move.
- Health check patterns
- Active probes: periodic HTTP GET to /health or TCP connect.
- Passive checks: detect repeated failures from backend and mark unhealthy.
- Health state machine: require N consecutive failures to mark unhealthy and M consecutive successes to mark healthy.
- Grace periods: after startup, allow a warmup window before marking healthy.
Example health check policy pseudocode:
if failures >= 3 -> mark unhealthy
if successes >= 2 -> mark healthy
retry backoff: 1s, 2s, 4s
- TLS termination and connection handling
- Terminate TLS at edge to offload CPU and centralize certificate management.
- Pass-through TLS (L4) when backend needs client certs or end-to-end encryption.
- Connection draining: when removing a backend, stop new connections and wait for existing ones to finish or timeout.
- Sticky sessions and stateful workloads
- Cookie-based affinity: LB sets a cookie that pins client to a backend. Works but reduces flexibility.
- IP affinity: map client IP to backend; fails with NAT or mobile clients.
- Consistent hashing: better for caches and sharded state; avoids sticky cookie pitfalls.
- Session replication: replicate session state across backends (Redis, Memcached) to keep stateless app servers.
- Autoscaling and capacity planning Simple capacity formula If you expect peak RPS (R) and average request latency (L) (seconds), and each server can handle (C) concurrent requests comfortably, required servers (N) approximates:
[
N = \left\lceil \frac{R \cdot L}{C} \right\rceil
]
Autoscaling signals
- RPS per instance
- CPU utilization
- Request queue length or backlog
- Custom metric: p95 latency
Autoscaler design
- Use short cooldowns for bursty traffic with predictive scaling if possible.
- Combine reactive autoscaling with scheduled scaling for known traffic patterns.
- Observability and SLOs Essential metrics
- RPS (requests per second)
- p50/p95/p99 latency
- Error rate (4xx/5xx)
- Backend saturation (CPU, memory, queue length)
- Healthy backend count
- Request distribution across backends
Tracing
- Use distributed tracing (OpenTelemetry) to correlate client â LB â backend and identify hotspots.
Alerts
- Error rate > threshold
- p99 latency > SLO
- Healthy backends < minimum
- Testing strategies
- Load testing: wrk, k6, locust, Gatling. Use realistic distributions (Poisson for arrivals, Pareto for heavy tails).
- Chaos testing: kill instances, add latency, partition networks (Chaos Monkey).
- Canary deployments: route small % of traffic to new version and monitor metrics before ramping.
Example wrk command:
wrk -t12 -c400 -d2m http://lb.example.com/api/endpoint
- Security and rate limiting
- DDoS mitigation at edge (CDN, WAF, rate limiting).
- Perâclient rate limits to protect backends.
- Authentication and authorization at edge or service mesh.
- TLS best practices: modern ciphers, OCSP stapling, certificate rotation.
- Integrations and real-world components
Nginx example for simple L7 balancing
http {
upstream backend {
server 10.0.0.1:8080;
server 10.0.0.2:8080;
}
server {
listen 443 ssl;
ssl_certificate /etc/ssl/cert.pem;
location / {
proxy_pass http://backend;
proxysetheader Host $host;
proxysetheader X-Real-IP $remote_addr;
proxyconnecttimeout 1s;
proxyreadtimeout 5s;
}
}
}
HAProxy snippet for leastconn
frontend http-in
bind *:80
default_backend servers
backend servers
balance leastconn
server s1 10.0.0.1:8080 check
server s2 10.0.0.2:8080 check
Envoy features
- Advanced routing, retries, circuit breakers, HTTP/2 and gRPC support, observability hooks.
Kubernetes
- Service types: ClusterIP, NodePort, LoadBalancer.
- Ingress controllers (NGINX, Traefik, Istio/Envoy) provide L7 routing and LB features.
- Use PodDisruptionBudgets and readiness probes for safe rolling updates.
AWS quick notes
- NLB: L4, high throughput, preserves client IP.
- ALB: L7, path/host routing, WebSocket support.
- ELB Classic: legacy.
- Route53: DNS routing policies (latency, geolocation, failover).
- Combine with CloudFront for edge caching and DDoS protection.
- Advanced topics
Backpressure and queueing
- When backends are saturated, queueing at LB increases latency. Prefer shedding load or autoscaling rather than unbounded queues.
Circuit breakers and retries
- Implement circuit breakers to avoid cascading failures. Use exponential backoff and jitter for retries. Ensure idempotency for retried operations.
Connection multiplexing
- Use HTTP/2 or connection pools to reduce connection overhead. For long-lived connections, track active streams.
Crossâzone balancing
- CrossâAZ balancing reduces hotspots but increases crossâAZ traffic costs. Evaluate trade-offs.
Stateful microservices
- For stateful services prefer consistent hashing or external state stores (Redis) rather than sticky sessions.
- Endâtoâend example plan to build a production LB
Phase 0 Prototype
- Implement Round Robin proxy (example above).
- Add health checks and basic metrics (requests, latency).
Phase 1 Harden
- Replace blocking HTTP client with async client.
- Add connection pooling, timeouts, retries with idempotency checks.
- Add logging and Prometheus metrics.
Phase 2 Scale
- Implement Least Connections or Weighted selection using atomic counters or per-worker counters aggregated.
- Add consistent hashing option for session affinity.
- Add TLS termination and certificate management.
Phase 3 Global
- Add DNS geoârouting and anycast for edge.
- Integrate with autoscaler using LB metrics.
- Add chaos testing and canary deployment pipeline.
Phase 4 Production
- Use Envoy or HAProxy for advanced features.
- Add WAF, rate limiting, DDoS protection, and robust observability dashboards.
- Define SLOs and runbook for incidents.
- Common pitfalls and how to avoid them
- No health checks â route to dead backends. Always implement active and passive checks.
- Blocking proxies â poor throughput. Use async or compiled proxies for high RPS.
- Sticky sessions without replication â poor resilience. Prefer external session stores or consistent hashing.
- Unbounded retries â amplify failures. Use circuit breakers and retry budgets.
- Ignoring tail latency â p99 matters more than p50. Monitor and optimize tail behavior.
- Checklist before going to production
- Health checks with sensible thresholds and backoff
- Connection draining and graceful shutdown implemented
- TLS termination and certificate rotation plan
- Autoscaling policies tested under load
- Observability: metrics, traces, logs, dashboards, alerts
- Load and chaos testing passed with rollback plan
- Security: rate limits, WAF, DDoS protections
Conclusion
Load balancing is a system design problem that blends algorithms, engineering, and operations. Start with simple, wellâinstrumented building blocks (Round Robin + health checks), measure continuously, and evolve to more advanced strategies (Least Connections, Weighted, Consistent Hashing) as traffic patterns and statefulness demand. At global scale, layering DNS, anycast, regional LBs, autoscaling, and robust observability is what keeps services like AWS alive under billions of requests.
Appendix A Example tools and commands
wrk load test
wrk -t12 -c400 -d2m http://lb.example.com/api/endpoint
k6 script example
import http from 'k6/http';
import { sleep } from 'k6';
export default function () {
http.get('http://lb.example.com/api/endpoint');
sleep(1);
}
Prometheus metric names to collect
- httprequeststotal
- httprequestdurationsecondsbucket
- backend_up
- backendactiveconnections
Appendix B Further reading and next steps
- Implement the provided Python examples and replace blocking calls with async frameworks (aiohttp, uvloop) for higher throughput.
- Try Envoy as a next step for production features like retries, circuit breakers, and advanced routing.
- Run load tests with realistic traffic shapes and perform chaos experiments to validate resilience.
Call to Action - Closing
If you enjoy this blog, please recat, save, and follow us for more, then drop a comment and say to us which part you intrested in and we deep dive into it in the next articles and blogs.
Have nice times!
Top comments (0)