DEV Community

Diganto Paul
Diganto Paul

Posted on

Load Balancing in System Design: A Practical Guide

How to distribute traffic, choose the right algorithm, and keep systems resilient at scale


Introduction

Every system that outgrows a single server eventually faces the same question: how do you spread incoming requests across multiple machines without breaking things? Load balancing is the answer — but doing it well requires more than just "add a load balancer." This guide covers the core concepts, algorithms, and architectural decisions behind effective load balancing in modern system design.


1. Why Load Balancing Matters

A load balancer sits between clients and backend servers, distributing requests so that:

  • No single server is overwhelmed while others sit idle
  • Failed servers are automatically removed from rotation
  • The system can scale horizontally by adding more servers behind the balancer
  • Latency is reduced by routing to the nearest or fastest available server

Without load balancing, scaling means buying a bigger machine (vertical scaling) — a strategy that hits a ceiling fast and creates a single point of failure.


2. Layer 4 vs. Layer 7 Load Balancing

Type Operates At Routing Decisions Based On Examples
Layer 4 (Transport) TCP/UDP IP address, port AWS NLB, HAProxy (TCP mode), IPVS
Layer 7 (Application) HTTP/HTTPS URL path, headers, cookies, content NGINX, AWS ALB, Envoy

Layer 4 is faster and simpler — it just forwards packets. Layer 7 is smarter — it can route /api/* to one service and /static/* to another, terminate TLS, and inspect request content. Most modern architectures use Layer 7 for flexibility, falling back to Layer 4 when raw throughput matters most.


3. Load Balancing Algorithms

Choosing the right algorithm depends on your traffic pattern and backend characteristics.

  • Round Robin — requests distributed sequentially across servers. Simple, but ignores server load.
  • Least Connections — routes to the server with the fewest active connections. Better for long-lived or uneven requests.
  • Weighted Round Robin / Least Connections — accounts for servers with different capacities.
  • IP Hash — routes based on client IP, useful for session affinity without sticky sessions.
  • Least Response Time — sends traffic to the fastest-responding, least-loaded server. More adaptive, more overhead.
  • Random with Two Choices — picks two servers at random and routes to the less loaded one; a good balance of simplicity and effectiveness at scale.
upstream backend {
    least_conn;
    server 10.0.0.1:8080 weight=3;
    server 10.0.0.2:8080 weight=2;
    server 10.0.0.3:8080 weight=1;
}

server {
    listen 80;
    location / {
        proxy_pass http://backend;
    }
}
Enter fullscreen mode Exit fullscreen mode

Tip: Start with round robin or least connections. Reach for adaptive algorithms only once you have metrics showing simple strategies aren't enough.


4. Health Checks

A load balancer is only as good as its ability to detect unhealthy servers.

  • Active health checks — the balancer periodically pings a /health endpoint.
  • Passive health checks — the balancer observes real traffic failures (timeouts, 5xx errors) and reacts.
healthCheck:
  path: /health
  intervalSeconds: 10
  timeoutSeconds: 2
  unhealthyThreshold: 3
  healthyThreshold: 2
Enter fullscreen mode Exit fullscreen mode

Best practices:

  • Keep health check endpoints lightweight — don't run full dependency checks on every ping.
  • Use both active and passive checks together; passive checks catch issues active checks miss between intervals.
  • Set thresholds to avoid flapping (a server bouncing in and out of rotation).

5. Session Persistence (Sticky Sessions)

Some applications need a client to keep hitting the same backend server, typically because session state is stored in memory rather than a shared store.

Approach How It Works Trade-off
Cookie-based affinity LB injects a cookie tying client to a server Breaks if that server goes down
IP hash Client IP maps deterministically to a server Uneven for clients behind shared IPs (e.g., NAT)
Stateless design Session state moved to Redis/DB, no affinity needed Requires architectural change, but most scalable

Recommendation: where possible, design services to be stateless and externalize session data. It removes the need for sticky sessions entirely and makes failover seamless.


6. Global vs. Local Load Balancing

  • Local load balancing distributes traffic across servers within a single data center or region.
  • Global (DNS-based) load balancing distributes traffic across regions, typically using:
    • GeoDNS — routes users to the nearest region
    • Anycast routing — same IP announced from multiple locations; network routes to the closest
    • Latency-based routing — routes based on measured latency (e.g., AWS Route 53)

A typical production setup layers both:

Client → Global LB (DNS/Anycast) → Regional Load Balancer → Service Instances
Enter fullscreen mode Exit fullscreen mode

This combination minimizes latency for users while still balancing load within each region.


7. Load Balancers as a Single Point of Failure

Ironically, a load balancer can itself become the bottleneck or single point of failure if not designed carefully.

Mitigations:

  • Run load balancers in active-active or active-passive pairs.
  • Use a floating/virtual IP (via keepalived or a cloud-managed VIP) that fails over automatically.
  • For DNS-based global balancing, keep TTLs low enough to allow fast failover, but not so low that DNS query volume becomes a cost or performance issue.
        ┌────────────┐
        │   VIP      │
        └─────┬──────┘
       ┌───────┴───────┐
       │               │
 ┌─────▼─────┐   ┌─────▼─────┐
 │  LB Node A │   │  LB Node B │  (active-passive, keepalived)
 │  (active)  │   │  (standby) │
 └─────┬─────┘   └───────────┘
       │
 ┌─────▼──────────────────┐
 │   Backend Server Pool   │
 └──────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

8. Rate Limiting and Load Shedding

Load balancing isn't just about distribution — it's also about protecting backends from being overwhelmed entirely.

  • Rate limiting — cap requests per client/IP/API key to prevent abuse or runaway clients.
  • Circuit breaking — stop routing to a backend that's failing repeatedly, giving it time to recover.
  • Load shedding — intentionally drop or reject low-priority requests when the system is at capacity, preserving service for critical traffic.
rateLimiting:
  requestsPerSecond: 100
  burst: 20
  keyBy: client_ip
Enter fullscreen mode Exit fullscreen mode

These mechanisms turn a load balancer from a passive traffic router into an active guardian of system stability.


Closing Thoughts

Good load balancing is invisible when it works — traffic flows smoothly, failures go unnoticed by users, and the system scales without drama. The key is treating the load balancer as a first-class architectural component, not an afterthought: pick the right layer and algorithm for your traffic, make health checks meaningful, remove single points of failure, and pair distribution with protection through rate limiting and circuit breaking. Get these fundamentals right, and load balancing becomes one less thing to worry about as your system grows.

Top comments (0)