Diganto Paul

Posted on Jul 2

Load Balancing in System Design: A Practical Guide

#systemdesign #loadbalancing

How to distribute traffic, choose the right algorithm, and keep systems resilient at scale

Introduction

Every system that outgrows a single server eventually faces the same question: how do you spread incoming requests across multiple machines without breaking things? Load balancing is the answer — but doing it well requires more than just "add a load balancer." This guide covers the core concepts, algorithms, and architectural decisions behind effective load balancing in modern system design.

1. Why Load Balancing Matters

A load balancer sits between clients and backend servers, distributing requests so that:

No single server is overwhelmed while others sit idle
Failed servers are automatically removed from rotation
The system can scale horizontally by adding more servers behind the balancer
Latency is reduced by routing to the nearest or fastest available server

Without load balancing, scaling means buying a bigger machine (vertical scaling) — a strategy that hits a ceiling fast and creates a single point of failure.

2. Layer 4 vs. Layer 7 Load Balancing

Type	Operates At	Routing Decisions Based On	Examples
Layer 4 (Transport)	TCP/UDP	IP address, port	AWS NLB, HAProxy (TCP mode), IPVS
Layer 7 (Application)	HTTP/HTTPS	URL path, headers, cookies, content	NGINX, AWS ALB, Envoy

Layer 4 is faster and simpler — it just forwards packets. Layer 7 is smarter — it can route /api/* to one service and /static/* to another, terminate TLS, and inspect request content. Most modern architectures use Layer 7 for flexibility, falling back to Layer 4 when raw throughput matters most.

3. Load Balancing Algorithms

Choosing the right algorithm depends on your traffic pattern and backend characteristics.

Round Robin — requests distributed sequentially across servers. Simple, but ignores server load.
Least Connections — routes to the server with the fewest active connections. Better for long-lived or uneven requests.
Weighted Round Robin / Least Connections — accounts for servers with different capacities.
IP Hash — routes based on client IP, useful for session affinity without sticky sessions.
Least Response Time — sends traffic to the fastest-responding, least-loaded server. More adaptive, more overhead.
Random with Two Choices — picks two servers at random and routes to the less loaded one; a good balance of simplicity and effectiveness at scale.

upstream backend {
    least_conn;
    server 10.0.0.1:8080 weight=3;
    server 10.0.0.2:8080 weight=2;
    server 10.0.0.3:8080 weight=1;
}

server {
    listen 80;
    location / {
        proxy_pass http://backend;
    }
}

Tip: Start with round robin or least connections. Reach for adaptive algorithms only once you have metrics showing simple strategies aren't enough.

4. Health Checks

A load balancer is only as good as its ability to detect unhealthy servers.

Active health checks — the balancer periodically pings a /health endpoint.
Passive health checks — the balancer observes real traffic failures (timeouts, 5xx errors) and reacts.

healthCheck:
  path: /health
  intervalSeconds: 10
  timeoutSeconds: 2
  unhealthyThreshold: 3
  healthyThreshold: 2

Best practices:

Keep health check endpoints lightweight — don't run full dependency checks on every ping.
Use both active and passive checks together; passive checks catch issues active checks miss between intervals.
Set thresholds to avoid flapping (a server bouncing in and out of rotation).

5. Session Persistence (Sticky Sessions)

Some applications need a client to keep hitting the same backend server, typically because session state is stored in memory rather than a shared store.

Approach	How It Works	Trade-off
Cookie-based affinity	LB injects a cookie tying client to a server	Breaks if that server goes down
IP hash	Client IP maps deterministically to a server	Uneven for clients behind shared IPs (e.g., NAT)
Stateless design	Session state moved to Redis/DB, no affinity needed	Requires architectural change, but most scalable

Recommendation: where possible, design services to be stateless and externalize session data. It removes the need for sticky sessions entirely and makes failover seamless.

6. Global vs. Local Load Balancing

Local load balancing distributes traffic across servers within a single data center or region.
Global (DNS-based) load balancing distributes traffic across regions, typically using:
- GeoDNS — routes users to the nearest region
- Anycast routing — same IP announced from multiple locations; network routes to the closest
- Latency-based routing — routes based on measured latency (e.g., AWS Route 53)

A typical production setup layers both:

Client → Global LB (DNS/Anycast) → Regional Load Balancer → Service Instances

This combination minimizes latency for users while still balancing load within each region.

7. Load Balancers as a Single Point of Failure

Ironically, a load balancer can itself become the bottleneck or single point of failure if not designed carefully.

Mitigations:

Run load balancers in active-active or active-passive pairs.
Use a floating/virtual IP (via keepalived or a cloud-managed VIP) that fails over automatically.
For DNS-based global balancing, keep TTLs low enough to allow fast failover, but not so low that DNS query volume becomes a cost or performance issue.

        ┌────────────┐
        │   VIP      │
        └─────┬──────┘
       ┌───────┴───────┐
       │               │
 ┌─────▼─────┐   ┌─────▼─────┐
 │  LB Node A │   │  LB Node B │  (active-passive, keepalived)
 │  (active)  │   │  (standby) │
 └─────┬─────┘   └───────────┘
       │
 ┌─────▼──────────────────┐
 │   Backend Server Pool   │
 └──────────────────────────┘

8. Rate Limiting and Load Shedding

Load balancing isn't just about distribution — it's also about protecting backends from being overwhelmed entirely.

Rate limiting — cap requests per client/IP/API key to prevent abuse or runaway clients.
Circuit breaking — stop routing to a backend that's failing repeatedly, giving it time to recover.
Load shedding — intentionally drop or reject low-priority requests when the system is at capacity, preserving service for critical traffic.

rateLimiting:
  requestsPerSecond: 100
  burst: 20
  keyBy: client_ip

These mechanisms turn a load balancer from a passive traffic router into an active guardian of system stability.

Closing Thoughts

Good load balancing is invisible when it works — traffic flows smoothly, failures go unnoticed by users, and the system scales without drama. The key is treating the load balancer as a first-class architectural component, not an afterthought: pick the right layer and algorithm for your traffic, make health checks meaningful, remove single points of failure, and pair distribution with protection through rate limiting and circuit breaking. Get these fundamentals right, and load balancing becomes one less thing to worry about as your system grows.

DEV Community