DEV Community

Javad
Javad

Posted on

Distributed Systems & Networking: How is AWS Still Alive While Responding to Billions of Requests per Nanosecond? 🤔

Hey Dev Community!
Welcome!

Introduction
Load balancing is the practice of distributing client requests across multiple backend resources to maximize throughput, minimize latency, and provide fault tolerance. At cloud scale this is not a single component but a layered system: edge DNS and anycast, global routing, regional load balancers, and local reverse proxies cooperate with health checks, autoscaling, telemetry, and traffic shaping to keep services alive under billions of requests.

This article is a zero-to-one-hundred practical guide. It explains core algorithms, trade-offs, operational controls, testing strategies, and step-by-step code examples you can run and extend. It covers stateless and stateful workloads, HTTP/1.1 vs HTTP/2/HTTP3 considerations, TLS termination, sticky sessions, consistent hashing, autoscaling hooks, and production hardening.


Quick decision checklist

  • Traffic shape: bursty vs steady
  • Request duration: short (milliseconds) vs long (seconds)
  • Statefulness: stateless vs session affinity required
  • Geography: single region vs global users
  • Failure model: acceptable downtime and recovery time objective (RTO)
  • SLOs: p99 latency and availability targets

Core algorithms overview

Algorithm Key benefit Weakness Best use
Round Robin Simple, low overhead Ignores instantaneous load Homogeneous stateless pools
Least Connections Adapts to variable durations Needs accurate counters Long‑lived connections
Weighted Capacity aware routing Requires tuning and telemetry Mixed-capacity clusters
Consistent Hashing Minimal remap on churn More complex; needs vnodes Stateful caches and sharded state

Deep technical notes on algorithms

Round Robin

  • Implementation: rotate an index across healthy backends.
  • Pros: minimal state, deterministic.
  • Cons: poor when request durations vary or backends differ in capacity. Use behind autoscalers or when backends are homogeneous.

Least Connections

  • Implementation: maintain an active connection/stream counter per backend and choose the minimum.
  • For HTTP/1.1 track TCP connections; for HTTP/2/HTTP3 track active streams.
  • Requires atomic counters and careful decrement on request completion or connection close.

Weighted Selection

  • Assign a weight to each backend proportional to capacity (CPU, memory, NIC).
  • Selection probability ∝ weight.
  • Weights can be static or dynamically adjusted from telemetry.

Consistent Hashing

  • Map keys (client IP, session id, cache key) to a ring of virtual nodes.
  • When nodes join/leave only (O(k/n)) keys move.
  • Use many virtual nodes per physical node to smooth distribution.

Architecture layers and responsibilities

  1. Edge Layer

    • DNS geo‑routing, anycast IPs, CDN edges.
    • Goal: route users to the nearest healthy region and absorb volumetric attacks.
  2. Global Routing

    • Multi‑region failover, health‑aware DNS (Route53 style), BGP anycast.
    • Goal: regional failover and traffic steering.
  3. Regional Load Balancer

    • L4/L7 balancing across AZs, TLS termination, DDoS mitigation.
    • Examples: AWS ELB/ALB/NLB, GCP Load Balancer.
  4. Local Reverse Proxy

    • Sidecar or node-level proxy (Envoy, Nginx, HAProxy) for fine-grained routing, retries, circuit breaking.
    • Goal: per‑node resilience and observability.
  5. Backend Pool

    • Application instances, containers, or serverless functions.
    • Autoscaling and health checks keep pool size appropriate.

Practical implementation components

  • Router core: selection algorithm and request forwarding.
  • Health checker: liveness and readiness probes, active and passive checks.
  • Connection tracker: atomic counters for least‑conn.
  • Session affinity: sticky cookies, IP affinity, or consistent hashing.
  • Autoscaler hooks: metrics → scale decisions (RPS, queue length, CPU).
  • Observability: metrics, traces, logs, dashboards, alerts.
  • Safety controls: circuit breakers, rate limiting, backpressure.

Hands‑on examples

  1. Minimal Round Robin router with health checks in Python A simple, runnable starting point. Not production hardened but demonstrates the core ideas.
rr_lb.py
import time
import requests
import threading
from http.server import BaseHTTPRequestHandler, HTTPServer

BACKENDS = ["http://127.0.0.1:9001", "http://127.0.0.1:9002"]
alive = {b: True for b in BACKENDS}
idx = 0
lock = threading.Lock()

def healthcheckloop(interval=5):
    while True:
        for b in BACKENDS:
            try:
                r = requests.get(b + "/health", timeout=1)
                alive[b] = (r.status_code == 200)
            except Exception:
                alive[b] = False
        time.sleep(interval)

def next_backend():
    global idx
    with lock:
        n = len(BACKENDS)
        for _ in range(n):
            b = BACKENDS[idx % n]
            idx += 1
            if alive.get(b):
                return b
    return None

class ProxyHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        backend = next_backend()
        if not backend:
            self.send_response(503)
            self.end_headers()
            self.wfile.write(b"No healthy backends")
            return
        try:
            r = requests.get(backend + self.path, timeout=5)
            self.sendresponse(r.statuscode)
            for k,v in r.headers.items():
                if k.lower() not in ("content-encoding","transfer-encoding","connection"):
                    self.send_header(k, v)
            self.end_headers()
            self.wfile.write(r.content)
        except Exception as e:
            self.send_response(502)
            self.end_headers()
            self.wfile.write(str(e).encode())

if name == "main":
    threading.Thread(target=healthcheckloop, daemon=True).start()
    server = HTTPServer(("0.0.0.0", 8080), ProxyHandler)
    print("LB listening on :8080")
    server.serve_forever()
Enter fullscreen mode Exit fullscreen mode

Notes

  • Add timeouts and connection pooling for performance.
  • Replace Python requests with an async HTTP client for high throughput.
  • Add logging and metrics (request count, latency, backend selection).

  1. Least Connections sketch with atomic counters Key idea: increment counter on request start, decrement on finish. Use thread-safe primitives.
least_conn.py sketch
import threading
from collections import defaultdict

BACKENDS = ["b1","b2","b3"]
alive = {b: True for b in BACKENDS}
conn_count = defaultdict(int)
lock = threading.Lock()

def chooseleastconn():
    with lock:
        candidates = [b for b in BACKENDS if alive[b]]
        if not candidates:
            return None
        return min(candidates, key=lambda x: conn_count[x])

def handle_request(req):
    b = chooseleastconn()
    if not b:
        return 503
    with lock:
        conn_count[b] += 1
    try:
        # proxy to backend b
        pass
    finally:
        with lock:
            conn_count[b] -= 1
Enter fullscreen mode Exit fullscreen mode

Production considerations

  • Use atomic counters or per‑worker counters aggregated periodically to avoid lock contention.
  • For HTTP/2/HTTP3 track active streams rather than TCP sockets.

  1. Consistent Hashing implementation with virtual nodes Useful for session affinity and distributed caches.
consistent_hash.py
import hashlib
import bisect

class ConsistentHashRing:
    def init(self, vnodes=128):
        self.ring = []
        self.node_map = {}
        self.vnodes = vnodes

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node):
        for i in range(self.vnodes):
            h = self._hash(f"{node}-{i}")
            bisect.insort(self.ring, h)
            self.node_map[h] = node

    def remove_node(self, node):
        to_remove = []
        for i in range(self.vnodes):
            h = self._hash(f"{node}-{i}")
            to_remove.append(h)
        for h in to_remove:
            idx = bisect.bisect_left(self.ring, h)
            if idx < len(self.ring) and self.ring[idx] == h:
                self.ring.pop(idx)
                del self.node_map[h]

    def get_node(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect(self.ring, h) % len(self.ring)
        return self.node_map[self.ring[idx]]
Enter fullscreen mode Exit fullscreen mode

Usage

  • Map session id or user id to a backend.
  • When nodes change, only a fraction of keys move.

  1. Health check patterns
  2. Active probes: periodic HTTP GET to /health or TCP connect.
  3. Passive checks: detect repeated failures from backend and mark unhealthy.
  4. Health state machine: require N consecutive failures to mark unhealthy and M consecutive successes to mark healthy.
  5. Grace periods: after startup, allow a warmup window before marking healthy.

Example health check policy pseudocode:

if failures >= 3 -> mark unhealthy
if successes >= 2 -> mark healthy
retry backoff: 1s, 2s, 4s
Enter fullscreen mode Exit fullscreen mode

  1. TLS termination and connection handling
  2. Terminate TLS at edge to offload CPU and centralize certificate management.
  3. Pass-through TLS (L4) when backend needs client certs or end-to-end encryption.
  4. Connection draining: when removing a backend, stop new connections and wait for existing ones to finish or timeout.

  1. Sticky sessions and stateful workloads
  2. Cookie-based affinity: LB sets a cookie that pins client to a backend. Works but reduces flexibility.
  3. IP affinity: map client IP to backend; fails with NAT or mobile clients.
  4. Consistent hashing: better for caches and sharded state; avoids sticky cookie pitfalls.
  5. Session replication: replicate session state across backends (Redis, Memcached) to keep stateless app servers.

  1. Autoscaling and capacity planning Simple capacity formula If you expect peak RPS (R) and average request latency (L) (seconds), and each server can handle (C) concurrent requests comfortably, required servers (N) approximates:

[
N = \left\lceil \frac{R \cdot L}{C} \right\rceil
]

Autoscaling signals

  • RPS per instance
  • CPU utilization
  • Request queue length or backlog
  • Custom metric: p95 latency

Autoscaler design

  • Use short cooldowns for bursty traffic with predictive scaling if possible.
  • Combine reactive autoscaling with scheduled scaling for known traffic patterns.

  1. Observability and SLOs Essential metrics
  2. RPS (requests per second)
  3. p50/p95/p99 latency
  4. Error rate (4xx/5xx)
  5. Backend saturation (CPU, memory, queue length)
  6. Healthy backend count
  7. Request distribution across backends

Tracing

  • Use distributed tracing (OpenTelemetry) to correlate client → LB → backend and identify hotspots.

Alerts

  • Error rate > threshold
  • p99 latency > SLO
  • Healthy backends < minimum

  1. Testing strategies
  2. Load testing: wrk, k6, locust, Gatling. Use realistic distributions (Poisson for arrivals, Pareto for heavy tails).
  3. Chaos testing: kill instances, add latency, partition networks (Chaos Monkey).
  4. Canary deployments: route small % of traffic to new version and monitor metrics before ramping.

Example wrk command:

wrk -t12 -c400 -d2m http://lb.example.com/api/endpoint
Enter fullscreen mode Exit fullscreen mode

  1. Security and rate limiting
  2. DDoS mitigation at edge (CDN, WAF, rate limiting).
  3. Per‑client rate limits to protect backends.
  4. Authentication and authorization at edge or service mesh.
  5. TLS best practices: modern ciphers, OCSP stapling, certificate rotation.

  1. Integrations and real-world components

Nginx example for simple L7 balancing

http {
  upstream backend {
    server 10.0.0.1:8080;
    server 10.0.0.2:8080;
  }
  server {
    listen 443 ssl;
    ssl_certificate /etc/ssl/cert.pem;
    location / {
      proxy_pass http://backend;
      proxysetheader Host $host;
      proxysetheader X-Real-IP $remote_addr;
      proxyconnecttimeout 1s;
      proxyreadtimeout 5s;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

HAProxy snippet for leastconn

frontend http-in
  bind *:80
  default_backend servers

backend servers
  balance leastconn
  server s1 10.0.0.1:8080 check
  server s2 10.0.0.2:8080 check
Enter fullscreen mode Exit fullscreen mode

Envoy features

  • Advanced routing, retries, circuit breakers, HTTP/2 and gRPC support, observability hooks.

Kubernetes

  • Service types: ClusterIP, NodePort, LoadBalancer.
  • Ingress controllers (NGINX, Traefik, Istio/Envoy) provide L7 routing and LB features.
  • Use PodDisruptionBudgets and readiness probes for safe rolling updates.

AWS quick notes

  • NLB: L4, high throughput, preserves client IP.
  • ALB: L7, path/host routing, WebSocket support.
  • ELB Classic: legacy.
  • Route53: DNS routing policies (latency, geolocation, failover).
  • Combine with CloudFront for edge caching and DDoS protection.

  1. Advanced topics

Backpressure and queueing

  • When backends are saturated, queueing at LB increases latency. Prefer shedding load or autoscaling rather than unbounded queues.

Circuit breakers and retries

  • Implement circuit breakers to avoid cascading failures. Use exponential backoff and jitter for retries. Ensure idempotency for retried operations.

Connection multiplexing

  • Use HTTP/2 or connection pools to reduce connection overhead. For long-lived connections, track active streams.

Cross‑zone balancing

  • Cross‑AZ balancing reduces hotspots but increases cross‑AZ traffic costs. Evaluate trade-offs.

Stateful microservices

  • For stateful services prefer consistent hashing or external state stores (Redis) rather than sticky sessions.

  1. End‑to‑end example plan to build a production LB

Phase 0 Prototype

  • Implement Round Robin proxy (example above).
  • Add health checks and basic metrics (requests, latency).

Phase 1 Harden

  • Replace blocking HTTP client with async client.
  • Add connection pooling, timeouts, retries with idempotency checks.
  • Add logging and Prometheus metrics.

Phase 2 Scale

  • Implement Least Connections or Weighted selection using atomic counters or per-worker counters aggregated.
  • Add consistent hashing option for session affinity.
  • Add TLS termination and certificate management.

Phase 3 Global

  • Add DNS geo‑routing and anycast for edge.
  • Integrate with autoscaler using LB metrics.
  • Add chaos testing and canary deployment pipeline.

Phase 4 Production

  • Use Envoy or HAProxy for advanced features.
  • Add WAF, rate limiting, DDoS protection, and robust observability dashboards.
  • Define SLOs and runbook for incidents.

  1. Common pitfalls and how to avoid them
  2. No health checks → route to dead backends. Always implement active and passive checks.
  3. Blocking proxies → poor throughput. Use async or compiled proxies for high RPS.
  4. Sticky sessions without replication → poor resilience. Prefer external session stores or consistent hashing.
  5. Unbounded retries → amplify failures. Use circuit breakers and retry budgets.
  6. Ignoring tail latency → p99 matters more than p50. Monitor and optimize tail behavior.

  1. Checklist before going to production
  2. Health checks with sensible thresholds and backoff
  3. Connection draining and graceful shutdown implemented
  4. TLS termination and certificate rotation plan
  5. Autoscaling policies tested under load
  6. Observability: metrics, traces, logs, dashboards, alerts
  7. Load and chaos testing passed with rollback plan
  8. Security: rate limits, WAF, DDoS protections

Conclusion
Load balancing is a system design problem that blends algorithms, engineering, and operations. Start with simple, well‑instrumented building blocks (Round Robin + health checks), measure continuously, and evolve to more advanced strategies (Least Connections, Weighted, Consistent Hashing) as traffic patterns and statefulness demand. At global scale, layering DNS, anycast, regional LBs, autoscaling, and robust observability is what keeps services like AWS alive under billions of requests.


Appendix A Example tools and commands

wrk load test

wrk -t12 -c400 -d2m http://lb.example.com/api/endpoint
Enter fullscreen mode Exit fullscreen mode

k6 script example

import http from 'k6/http';
import { sleep } from 'k6';
export default function () {
  http.get('http://lb.example.com/api/endpoint');
  sleep(1);
}
Enter fullscreen mode Exit fullscreen mode

Prometheus metric names to collect

  • httprequeststotal
  • httprequestdurationsecondsbucket
  • backend_up
  • backendactiveconnections

Appendix B Further reading and next steps

  • Implement the provided Python examples and replace blocking calls with async frameworks (aiohttp, uvloop) for higher throughput.
  • Try Envoy as a next step for production features like retries, circuit breakers, and advanced routing.
  • Run load tests with realistic traffic shapes and perform chaos experiments to validate resilience.

Call to Action - Closing
If you enjoy this blog, please recat, save, and follow us for more, then drop a comment and say to us which part you intrested in and we deep dive into it in the next articles and blogs.
Have nice times!

Top comments (0)