Covers: L4 vs L7, Algorithms, Sticky Sessions, Active-Active vs Active-Passive, Health Checks, SSL Termination
The Day a Load Balancer Saved Black Friday
November 2018. A major US retailer's website. 6:00 AM.
Black Friday traffic floods in. Within minutes, 3 of their 12 application servers are overwhelmed — CPU pegged at 100%, response times climbing past 10 seconds. Users are abandoning. Revenue is evaporating at roughly $50,000 per minute.
Then nothing catastrophic happens.
Because a load balancer is watching. It detects the three struggling servers via health checks, marks them as unhealthy, and silently reroutes all traffic to the remaining nine. Users experience a brief slowdown — then recovery. The site stays up.
That load balancer made a decision in milliseconds that would have taken an on-call engineer 15 minutes to even diagnose.
Load balancing is the single most important component between your users and your servers. Understanding it deeply — not just "it distributes traffic" — is what separates engineers who design systems from engineers who just build features.
What a Load Balancer Actually Does
At its core, a load balancer sits between clients and servers and does three things:
Client Request
↓
[Load Balancer]
1. Decides which server gets this request (routing algorithm)
2. Checks if that server is healthy (health checks)
3. Forwards the request and relays the response
↓
[Server 1] [Server 2] [Server 3] [Server N]
But that description skips the important details — how it decides, how it checks health, and what layer of the network it operates at. Those details determine everything about its performance, intelligence, and cost.
L4 vs L7: The Most Important Distinction
Load balancers operate at different layers of the network stack. The two that matter for system design are Layer 4 (Transport) and Layer 7 (Application).
L4 Load Balancing (Transport Layer)
L4 load balancers work at the TCP/IP level. They see:
- Source IP address
- Destination IP address
- Port number
They do not see: HTTP headers, URL paths, cookies, request bodies — anything above the transport layer.
Client: "I want to connect to 203.0.113.1:443"
L4 LB: "I'll forward this TCP connection to Server 3"
(doesn't know or care what HTTP request is inside)
How it works:
The L4 LB intercepts the TCP handshake, establishes its own connection to the backend server, and proxies all bytes between the two connections. It never parses the HTTP inside.
Advantages:
- Extremely fast — no content parsing, just byte forwarding
- Works for any TCP/UDP protocol (HTTP, gRPC, SMTP, gaming protocols, databases)
- Lower latency and higher throughput than L7
Disadvantages:
- Cannot make routing decisions based on content (can't route
/api/*to one server and/static/*to another) - Cannot do SSL termination based on hostname (SNI can partially help)
- Sticky sessions only possible by IP, not by session cookie
Real examples: Google Maglev, AWS Network Load Balancer (NLB), HAProxy in TCP mode.
When to use: High-throughput, low-latency workloads. Streaming. Gaming servers. Database connections. Any non-HTTP protocol.
L7 Load Balancing (Application Layer)
L7 load balancers fully parse the HTTP request. They see:
- URL paths and query parameters
- HTTP headers (including cookies, Authorization)
- Request body
- Hostname (for virtual hosting)
Client: "GET /api/users/123 HTTP/1.1"
"Host: api.example.com"
"Cookie: session_id=abc123"
L7 LB: "This is an API request → route to API server pool"
"This cookie maps to Server 2 (sticky session)"
"I'll terminate SSL, inspect the request, then forward"
Advantages:
-
Content-based routing: Route
/api/*to API servers,/static/*to file servers,/admin/*to admin servers - SSL termination: Decrypt HTTPS at the LB, forward plain HTTP to backend (backend doesn't handle encryption overhead)
- Cookie-based sticky sessions: Route the same user to the same server based on their session cookie
- Header manipulation: Add, remove, or rewrite headers before forwarding
- A/B testing: Route 5% of traffic to the new version, 95% to the old
-
Health checks: Can check that
/healthreturns HTTP 200, not just that the TCP port is open
Disadvantages:
- More CPU-intensive (must parse every HTTP request)
- Slightly higher latency than L4
- Only works for HTTP/HTTPS (and protocols built on it like gRPC)
Real examples: AWS Application Load Balancer (ALB), Nginx, HAProxy in HTTP mode, Cloudflare.
When to use: Web applications. REST APIs. Microservices with path-based routing. Anything where you need to make routing decisions based on request content.
The 5 Load Balancing Algorithms
Once the LB knows it needs to forward a request, which server gets it? That's determined by the routing algorithm.
Algorithm 1: Round Robin
The simplest algorithm. Requests go to servers in order, cycling through the list.
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (back to start)
Request 5 → Server B
...
When it works: All servers have identical hardware and handle requests of similar duration.
When it breaks: If Server A is handling a 30-second database migration request while Servers B and C are idle, Round Robin keeps sending new requests to Server A regardless. It knows nothing about server load.
Algorithm 2: Least Connections
Route each new request to the server with the fewest active connections.
Server A: 150 active connections
Server B: 23 active connections ← next request goes here
Server C: 87 active connections
When it works: Requests have variable duration (some take 1ms, others take 5 seconds). Least Connections naturally routes to less busy servers.
When it breaks: Connection count isn't always a perfect proxy for load. A server with 10 heavy database queries might be more loaded than a server with 100 trivial cache reads.
Most commonly used algorithm in production for web application workloads. AWS ALB uses a variant called "Least Outstanding Requests."
Algorithm 3: IP Hash
Hash the client's IP address to determine which server gets the request. The same IP always maps to the same server.
Client IP: 192.168.1.1
hash(192.168.1.1) % 3 = 2 → always Server C
Client IP: 10.0.0.5
hash(10.0.0.5) % 3 = 0 → always Server A
When it works: You need server affinity (the same client always hits the same server) but can't use cookie-based sticky sessions. Useful for stateful legacy applications that can't be made stateless.
When it breaks: If a large corporate network has thousands of users behind a single NAT IP, they all hash to the same server. Severe imbalance.
Algorithm 4: Weighted Round Robin / Weighted Least Connections
Assign a weight to each server proportional to its capacity. Servers with higher weights receive proportionally more requests.
Server A (16 CPU): weight 4 → gets 4 requests per cycle
Server B (8 CPU): weight 2 → gets 2 requests per cycle
Server C (4 CPU): weight 1 → gets 1 request per cycle
When it works: Mixed hardware fleet. Gradually rolling out new, more powerful servers. Canary deployments (weight 1 for the new version = 1/7 of traffic).
Algorithm 5: Random
Pick a server at random. Statistically equivalent to Round Robin over time, but simpler to implement in distributed LB systems where multiple LB nodes can't easily synchronize a round-robin counter.
Sticky Sessions: When State Forces You to Stay
In a perfectly stateless system, any server can handle any request. But the real world has legacy applications, WebSocket connections, and stateful computations that break if different requests go to different servers.
Sticky sessions (session affinity) ensure all requests from the same user go to the same backend server.
Cookie-Based Stickiness (L7 only)
The LB injects a cookie into the first response that identifies which backend server handled it:
First request:
Client → LB → Server B (LB inserts cookie: SERVERID=server_b)
All subsequent requests:
Client sends Cookie: SERVERID=server_b
LB reads cookie → routes to Server B
The problem: If Server B dies, all users stuck to it lose their session. They get routed to a new server, which knows nothing about them. Their session is lost.
Mitigation: Use sticky sessions only for truly stateful operations (WebSocket upgrades, long-running computations). Store actual session data in Redis so even if stickiness breaks, the new server can reconstruct the session from Redis.
IP-Based Stickiness (L4 and L7)
Same as IP Hash — client IP maps to a fixed backend. Simpler but less precise (NAT issues).
Active-Active vs Active-Passive HA for Load Balancers
Load balancers themselves can fail. A single LB is a single point of failure — defeating its purpose. So LBs are deployed in high-availability pairs.
Active-Passive
DNS → [LB Primary (Active)] → Servers
↓ heartbeat
[LB Secondary (Passive)]
- Primary handles all traffic
- Secondary monitors Primary via heartbeat
- If Primary fails, Secondary detects it (usually within 1-2 seconds) and takes over via IP failover — the virtual IP address moves from Primary to Secondary
- Users experience a brief blip (1-5 seconds) during failover
Wasted capacity: The Passive LB sits idle during normal operation.
Active-Active
DNS (round-robin) → [LB Node 1] → Servers
→ [LB Node 2] → Servers
- Both LBs handle traffic simultaneously
- If one fails, DNS or an upstream router stops sending traffic to it
- No wasted capacity
- Slightly more complex (both LBs must share session state)
Used by: AWS (all their managed LBs are Active-Active internally), Cloudflare, every large-scale web infrastructure.
Health Checks: How LBs Know Servers Are Alive
A load balancer that routes to dead servers is worse than no load balancer. Health checks are how LBs detect and respond to server failures automatically.
Passive Health Checks
The LB observes responses to real requests. If a server returns 5xx errors or times out repeatedly, it's marked unhealthy.
Server A returns 503 for 5 consecutive requests
→ LB marks Server A as unhealthy
→ New requests stop going to Server A
→ Server A is periodically retried (e.g., every 30 seconds)
→ When Server A returns 200, it's marked healthy again
Advantage: No extra traffic. Works automatically.
Disadvantage: Real user requests are used as health probes — some users get errors.
Active Health Checks
The LB proactively sends health check requests to each backend every few seconds:
Every 10 seconds:
LB → GET /health → Server A → HTTP 200 ✓ (healthy)
LB → GET /health → Server B → timeout ✗ (unhealthy)
LB → GET /health → Server C → HTTP 200 ✓ (healthy)
Result: Server B removed from rotation immediately
(before any user request reaches it)
L4 health check: Just checks if the TCP port is open (server process is running).
L7 health check: Sends an HTTP request and checks the response code and body. More thorough — a server might accept TCP connections but return 500 for all requests.
Best practice: Implement a /health or /healthz endpoint in every service that:
- Returns HTTP 200 when healthy
- Returns HTTP 503 when unhealthy (overloaded, dependencies down)
- Checks actual dependencies (database connection, cache connection) not just "process is running"
SSL Termination: Encryption at the Edge
HTTPS requires TLS/SSL encryption. Encrypting and decrypting every request is CPU-intensive. If every backend server handles its own SSL, you're wasting server capacity on crypto.
SSL termination means the load balancer handles all encryption/decryption:
Client (HTTPS) ←——encrypted——→ [Load Balancer] ←——plain HTTP——→ Backend Servers
LB decrypts HTTPS from client → forwards plain HTTP to backend
Backend responds in plain HTTP → LB encrypts → sends HTTPS to client
Advantages:
- Backend servers don't spend CPU on SSL — they do application work
- Certificates are managed in one place (LB), not on every server
- Can inspect, modify, and log HTTP traffic at the LB (impossible with encrypted traffic end-to-end)
Security concern: Traffic between LB and backend is unencrypted. If the internal network is untrusted, use SSL re-encryption — LB terminates client SSL and establishes a new SSL connection to the backend.
mTLS (mutual TLS): For zero-trust architectures (Day 7), both sides authenticate. The backend verifies the LB's certificate AND the LB verifies the backend's certificate. No unverified internal connections.
Designing a Load Balancer: The Interview Question
"Design a load balancer that can handle 1 million requests per second."
Here's how to think through it:
Layer choice:
Use L7 for web traffic (need content-based routing, SSL termination)
Use L4 for raw TCP throughput or non-HTTP protocols
Algorithm choice:
Least Connections for variable-duration requests (web APIs)
Round Robin for uniform-duration requests (static content)
Weighted for heterogeneous server fleet
HA design:
Active-Active pair behind DNS or upstream router
Shared session state in Redis if sticky sessions needed
Health checks:
Active HTTP health checks to /health every 10 seconds
Remove server after 3 consecutive failures (avoid flapping)
Re-add after 2 consecutive successes
Scale the LB itself:
At 1M RPS, even an L7 LB can be a bottleneck
Solution: DNS load balancing across multiple LB clusters
Or: Anycast routing (Google's approach — same IP, closest datacenter answers)
Single point of failure audit:
LB → Active-Active pair ✓
Servers → Multiple instances ✓
DNS → Multiple nameservers ✓
Network → Multiple ISP connections ✓
Real Systems
AWS Elastic Load Balancer (ELB):
Three types:
- ALB (Application) — L7, HTTP/HTTPS/gRPC, path-based routing, best for microservices
- NLB (Network) — L4, TCP/UDP, extreme throughput, lowest latency
- CLB (Classic) — Legacy, avoid for new systems
Nginx:
Arguably the most versatile load balancer. Handles L7 load balancing, static file serving, SSL termination, and reverse proxying — all in one process. Used by over 400 million websites.
Google Maglev:
Google's custom L4 load balancer, designed to handle millions of packets per second per machine. Uses consistent hashing (Day 8) across a pool of Maglev machines to ensure the same flow always reaches the same backend — even if Maglev machines are added or removed.
Key Takeaways
- A load balancer distributes traffic, detects failures via health checks, and reroutes automatically — in milliseconds.
- L4 (TCP): Faster, protocol-agnostic, no content visibility. Use for high-throughput non-HTTP workloads.
- L7 (HTTP): Smarter, content-aware, supports path routing and cookie stickiness. Use for web apps and APIs.
- 5 algorithms: Round Robin (simple), Least Connections (variable load), IP Hash (affinity), Weighted (mixed fleet), Random (distributed LBs).
- Sticky sessions are a necessary evil for stateful apps — but always back session data with Redis so failover doesn't lose user state.
- Active-Active LB pairs for HA — no wasted capacity, seamless failover.
-
Health checks are what make load balancers intelligent — always implement
/healthendpoints that check real dependencies. - SSL termination at the LB reduces backend CPU overhead and centralizes certificate management.
What's Next
Topic 11 covers CDNs — how a network of globally distributed servers brings your content milliseconds away from every user on earth, and why Netflix pre-positions content inside ISP networks before you even request it.
Tags: system-design load-balancing networking backend distributed-systems software-architecture interview-prep
Top comments (0)