Javad

Posted on Jan 3

Distributed Systems & Networking: How is AWS Still Alive While Responding to Billions of Requests per Nanosecond? 🤔

#tutorial #devops #aws #discuss

Hey Dev Community!
Welcome!

Introduction
Load balancing is the practice of distributing client requests across multiple backend resources to maximize throughput, minimize latency, and provide fault tolerance. At cloud scale this is not a single component but a layered system: edge DNS and anycast, global routing, regional load balancers, and local reverse proxies cooperate with health checks, autoscaling, telemetry, and traffic shaping to keep services alive under billions of requests.

This article is a zero-to-one-hundred practical guide. It explains core algorithms, trade-offs, operational controls, testing strategies, and step-by-step code examples you can run and extend. It covers stateless and stateful workloads, HTTP/1.1 vs HTTP/2/HTTP3 considerations, TLS termination, sticky sessions, consistent hashing, autoscaling hooks, and production hardening.

Quick decision checklist

Traffic shape: bursty vs steady
Request duration: short (milliseconds) vs long (seconds)
Statefulness: stateless vs session affinity required
Geography: single region vs global users
Failure model: acceptable downtime and recovery time objective (RTO)
SLOs: p99 latency and availability targets

Core algorithms overview

Algorithm	Key benefit	Weakness	Best use
Round Robin	Simple, low overhead	Ignores instantaneous load	Homogeneous stateless pools
Least Connections	Adapts to variable durations	Needs accurate counters	Long‑lived connections
Weighted	Capacity aware routing	Requires tuning and telemetry	Mixed-capacity clusters
Consistent Hashing	Minimal remap on churn	More complex; needs vnodes	Stateful caches and sharded state

Deep technical notes on algorithms

Round Robin

Implementation: rotate an index across healthy backends.
Pros: minimal state, deterministic.
Cons: poor when request durations vary or backends differ in capacity. Use behind autoscalers or when backends are homogeneous.

Least Connections

Implementation: maintain an active connection/stream counter per backend and choose the minimum.
For HTTP/1.1 track TCP connections; for HTTP/2/HTTP3 track active streams.
Requires atomic counters and careful decrement on request completion or connection close.

Weighted Selection

Assign a weight to each backend proportional to capacity (CPU, memory, NIC).
Selection probability ∝ weight.
Weights can be static or dynamically adjusted from telemetry.

Consistent Hashing

Map keys (client IP, session id, cache key) to a ring of virtual nodes.
When nodes join/leave only (O(k/n)) keys move.
Use many virtual nodes per physical node to smooth distribution.

Architecture layers and responsibilities

Edge Layer
- DNS geo‑routing, anycast IPs, CDN edges.
- Goal: route users to the nearest healthy region and absorb volumetric attacks.
Global Routing
- Multi‑region failover, health‑aware DNS (Route53 style), BGP anycast.
- Goal: regional failover and traffic steering.
Regional Load Balancer
- L4/L7 balancing across AZs, TLS termination, DDoS mitigation.
- Examples: AWS ELB/ALB/NLB, GCP Load Balancer.
Local Reverse Proxy
- Sidecar or node-level proxy (Envoy, Nginx, HAProxy) for fine-grained routing, retries, circuit breaking.
- Goal: per‑node resilience and observability.
Backend Pool
- Application instances, containers, or serverless functions.
- Autoscaling and health checks keep pool size appropriate.

Practical implementation components

Router core: selection algorithm and request forwarding.
Health checker: liveness and readiness probes, active and passive checks.
Connection tracker: atomic counters for least‑conn.
Session affinity: sticky cookies, IP affinity, or consistent hashing.
Autoscaler hooks: metrics → scale decisions (RPS, queue length, CPU).
Observability: metrics, traces, logs, dashboards, alerts.
Safety controls: circuit breakers, rate limiting, backpressure.

Hands‑on examples

Minimal Round Robin router with health checks in Python A simple, runnable starting point. Not production hardened but demonstrates the core ideas.

rr_lb.py
import time
import requests
import threading
from http.server import BaseHTTPRequestHandler, HTTPServer

BACKENDS = ["http://127.0.0.1:9001", "http://127.0.0.1:9002"]
alive = {b: True for b in BACKENDS}
idx = 0
lock = threading.Lock()

def healthcheckloop(interval=5):
    while True:
        for b in BACKENDS:
            try:
                r = requests.get(b + "/health", timeout=1)
                alive[b] = (r.status_code == 200)
            except Exception:
                alive[b] = False
        time.sleep(interval)

def next_backend():
    global idx
    with lock:
        n = len(BACKENDS)
        for _ in range(n):
            b = BACKENDS[idx % n]
            idx += 1
            if alive.get(b):
                return b
    return None

class ProxyHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        backend = next_backend()
        if not backend:
            self.send_response(503)
            self.end_headers()
            self.wfile.write(b"No healthy backends")
            return
        try:
            r = requests.get(backend + self.path, timeout=5)
            self.sendresponse(r.statuscode)
            for k,v in r.headers.items():
                if k.lower() not in ("content-encoding","transfer-encoding","connection"):
                    self.send_header(k, v)
            self.end_headers()
            self.wfile.write(r.content)
        except Exception as e:
            self.send_response(502)
            self.end_headers()
            self.wfile.write(str(e).encode())

if name == "main":
    threading.Thread(target=healthcheckloop, daemon=True).start()
    server = HTTPServer(("0.0.0.0", 8080), ProxyHandler)
    print("LB listening on :8080")
    server.serve_forever()

Notes

Add timeouts and connection pooling for performance.
Replace Python requests with an async HTTP client for high throughput.
Add logging and metrics (request count, latency, backend selection).

Least Connections sketch with atomic counters Key idea: increment counter on request start, decrement on finish. Use thread-safe primitives.

least_conn.py sketch
import threading
from collections import defaultdict

BACKENDS = ["b1","b2","b3"]
alive = {b: True for b in BACKENDS}
conn_count = defaultdict(int)
lock = threading.Lock()

def chooseleastconn():
    with lock:
        candidates = [b for b in BACKENDS if alive[b]]
        if not candidates:
            return None
        return min(candidates, key=lambda x: conn_count[x])

def handle_request(req):
    b = chooseleastconn()
    if not b:
        return 503
    with lock:
        conn_count[b] += 1
    try:
        # proxy to backend b
        pass
    finally:
        with lock:
            conn_count[b] -= 1

Production considerations

Use atomic counters or per‑worker counters aggregated periodically to avoid lock contention.
For HTTP/2/HTTP3 track active streams rather than TCP sockets.

Consistent Hashing implementation with virtual nodes Useful for session affinity and distributed caches.

consistent_hash.py
import hashlib
import bisect

class ConsistentHashRing:
    def init(self, vnodes=128):
        self.ring = []
        self.node_map = {}
        self.vnodes = vnodes

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node):
        for i in range(self.vnodes):
            h = self._hash(f"{node}-{i}")
            bisect.insort(self.ring, h)
            self.node_map[h] = node

    def remove_node(self, node):
        to_remove = []
        for i in range(self.vnodes):
            h = self._hash(f"{node}-{i}")
            to_remove.append(h)
        for h in to_remove:
            idx = bisect.bisect_left(self.ring, h)
            if idx < len(self.ring) and self.ring[idx] == h:
                self.ring.pop(idx)
                del self.node_map[h]

    def get_node(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.ring, h) % len(self.ring)
        return self.node_map[self.ring[idx]]

Usage

Map session id or user id to a backend.
When nodes change, only a fraction of keys move.

Health check patterns
Active probes: periodic HTTP GET to /health or TCP connect.
Passive checks: detect repeated failures from backend and mark unhealthy.
Health state machine: require N consecutive failures to mark unhealthy and M consecutive successes to mark healthy.
Grace periods: after startup, allow a warmup window before marking healthy.

Example health check policy pseudocode:

if failures >= 3 -> mark unhealthy
if successes >= 2 -> mark healthy
retry backoff: 1s, 2s, 4s

TLS termination and connection handling
Terminate TLS at edge to offload CPU and centralize certificate management.
Pass-through TLS (L4) when backend needs client certs or end-to-end encryption.
Connection draining: when removing a backend, stop new connections and wait for existing ones to finish or timeout.

Sticky sessions and stateful workloads
Cookie-based affinity: LB sets a cookie that pins client to a backend. Works but reduces flexibility.
IP affinity: map client IP to backend; fails with NAT or mobile clients.
Consistent hashing: better for caches and sharded state; avoids sticky cookie pitfalls.
Session replication: replicate session state across backends (Redis, Memcached) to keep stateless app servers.

Autoscaling and capacity planning Simple capacity formula If you expect peak RPS (R) and average request latency (L) (seconds), and each server can handle (C) concurrent requests comfortably, required servers (N) approximates:

[
N = \left\lceil \frac{R \cdot L}{C} \right\rceil
]

Autoscaling signals

RPS per instance
CPU utilization
Request queue length or backlog
Custom metric: p95 latency

Autoscaler design

Use short cooldowns for bursty traffic with predictive scaling if possible.
Combine reactive autoscaling with scheduled scaling for known traffic patterns.

Observability and SLOs Essential metrics
RPS (requests per second)
p50/p95/p99 latency
Error rate (4xx/5xx)
Backend saturation (CPU, memory, queue length)
Healthy backend count
Request distribution across backends

Tracing

Use distributed tracing (OpenTelemetry) to correlate client → LB → backend and identify hotspots.

Alerts

Error rate > threshold
p99 latency > SLO
Healthy backends < minimum

Testing strategies
Load testing: wrk, k6, locust, Gatling. Use realistic distributions (Poisson for arrivals, Pareto for heavy tails).
Chaos testing: kill instances, add latency, partition networks (Chaos Monkey).
Canary deployments: route small % of traffic to new version and monitor metrics before ramping.

Example wrk command:

wrk -t12 -c400 -d2m http://lb.example.com/api/endpoint

Security and rate limiting
DDoS mitigation at edge (CDN, WAF, rate limiting).
Per‑client rate limits to protect backends.
Authentication and authorization at edge or service mesh.
TLS best practices: modern ciphers, OCSP stapling, certificate rotation.

Integrations and real-world components

Nginx example for simple L7 balancing

http {
  upstream backend {
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
  }
  server {
    listen 443 ssl;
    ssl_certificate /etc/ssl/cert.pem;
    location / {
      proxy_pass http://backend;
      proxysetheader Host $host;
      proxysetheader X-Real-IP $remote_addr;
      proxyconnecttimeout 1s;
      proxyreadtimeout 5s;
    }
  }
}

HAProxy snippet for leastconn

frontend http-in
  bind *:80
  default_backend servers

backend servers
  balance leastconn
  server s1 10.0.0.1:8080 check
  server s2 10.0.0.2:8080 check

Envoy features

Advanced routing, retries, circuit breakers, HTTP/2 and gRPC support, observability hooks.

Kubernetes

Service types: ClusterIP, NodePort, LoadBalancer.
Ingress controllers (NGINX, Traefik, Istio/Envoy) provide L7 routing and LB features.
Use PodDisruptionBudgets and readiness probes for safe rolling updates.

AWS quick notes

NLB: L4, high throughput, preserves client IP.
ALB: L7, path/host routing, WebSocket support.
ELB Classic: legacy.
Route53: DNS routing policies (latency, geolocation, failover).
Combine with CloudFront for edge caching and DDoS protection.

Advanced topics

Backpressure and queueing

When backends are saturated, queueing at LB increases latency. Prefer shedding load or autoscaling rather than unbounded queues.

Circuit breakers and retries

Implement circuit breakers to avoid cascading failures. Use exponential backoff and jitter for retries. Ensure idempotency for retried operations.

Connection multiplexing

Use HTTP/2 or connection pools to reduce connection overhead. For long-lived connections, track active streams.

Cross‑zone balancing

Cross‑AZ balancing reduces hotspots but increases cross‑AZ traffic costs. Evaluate trade-offs.

Stateful microservices

For stateful services prefer consistent hashing or external state stores (Redis) rather than sticky sessions.

End‑to‑end example plan to build a production LB

Phase 0 Prototype

Implement Round Robin proxy (example above).
Add health checks and basic metrics (requests, latency).

Phase 1 Harden

Replace blocking HTTP client with async client.
Add connection pooling, timeouts, retries with idempotency checks.
Add logging and Prometheus metrics.

Phase 2 Scale

Implement Least Connections or Weighted selection using atomic counters or per-worker counters aggregated.
Add consistent hashing option for session affinity.
Add TLS termination and certificate management.

Phase 3 Global

Add DNS geo‑routing and anycast for edge.
Integrate with autoscaler using LB metrics.
Add chaos testing and canary deployment pipeline.

Phase 4 Production

Use Envoy or HAProxy for advanced features.
Add WAF, rate limiting, DDoS protection, and robust observability dashboards.
Define SLOs and runbook for incidents.

Common pitfalls and how to avoid them
No health checks → route to dead backends. Always implement active and passive checks.
Blocking proxies → poor throughput. Use async or compiled proxies for high RPS.
Sticky sessions without replication → poor resilience. Prefer external session stores or consistent hashing.
Unbounded retries → amplify failures. Use circuit breakers and retry budgets.
Ignoring tail latency → p99 matters more than p50. Monitor and optimize tail behavior.

Checklist before going to production
Health checks with sensible thresholds and backoff
Connection draining and graceful shutdown implemented
TLS termination and certificate rotation plan
Autoscaling policies tested under load
Observability: metrics, traces, logs, dashboards, alerts
Load and chaos testing passed with rollback plan
Security: rate limits, WAF, DDoS protections

Conclusion
Load balancing is a system design problem that blends algorithms, engineering, and operations. Start with simple, well‑instrumented building blocks (Round Robin + health checks), measure continuously, and evolve to more advanced strategies (Least Connections, Weighted, Consistent Hashing) as traffic patterns and statefulness demand. At global scale, layering DNS, anycast, regional LBs, autoscaling, and robust observability is what keeps services like AWS alive under billions of requests.

Appendix A Example tools and commands

wrk load test

wrk -t12 -c400 -d2m http://lb.example.com/api/endpoint

k6 script example

import http from 'k6/http';
import { sleep } from 'k6';
export default function () {
  http.get('http://lb.example.com/api/endpoint');
  sleep(1);
}

Prometheus metric names to collect

httprequeststotal
httprequestdurationsecondsbucket
backend_up
backendactiveconnections

Appendix B Further reading and next steps

Implement the provided Python examples and replace blocking calls with async frameworks (aiohttp, uvloop) for higher throughput.
Try Envoy as a next step for production features like retries, circuit breakers, and advanced routing.
Run load tests with realistic traffic shapes and perform chaos experiments to validate resilience.

Call to Action - Closing
If you enjoy this blog, please recat, save, and follow us for more, then drop a comment and say to us which part you intrested in and we deep dive into it in the next articles and blogs.
Have nice times!

Top comments (1)

Matt Frank • Jan 19

Excellent deep dive into load balancing strategies! The layered architecture breakdown (Edge Layer → Global Routing → Regional LB → Local Reverse Proxy → Backend Pool) really clarifies how these systems stay resilient at scale. When planning distributed architectures like this, I always start by visualizing the traffic flow and dependencies between layers. I've found infrasketch.net helpful for quickly generating these multi-tier system diagrams to explore different load balancing patterns before committing to an implementation.