What is Load Balancing?

#systemdesign

Load balancing is the fundamental technique used in modern distributed systems to distribute incoming network traffic across multiple backend servers or resources in order to ensure no single server becomes overwhelmed, thereby improving responsiveness, availability, and scalability. At its core, a load balancer acts as a traffic cop that sits between clients and the actual application servers, intelligently routing each request to the most appropriate server based on predefined rules and real-time conditions.

Why Load Balancing Is Essential in System Design

In any production-grade application that serves millions of users, relying on a single server is impractical and risky. A sudden spike in traffic, such as during a flash sale or viral event, can cause that server to slow down, crash, or become unresponsive. Load balancing solves this by enabling horizontal scaling — the ability to add more servers dynamically — while maintaining a seamless user experience. It also provides fault tolerance: if one server fails, the load balancer automatically stops sending traffic to it and redirects requests to healthy servers. This ensures the system remains highly available even during hardware failures, maintenance windows, or unexpected load surges.

Core Components of a Load Balancer

A typical load balancer consists of the following essential elements:

Frontend Listener: The entry point that accepts incoming client requests on specific ports (usually 80 for HTTP or 443 for HTTPS).
Backend Pool: A group of healthy application servers (often called targets or origins) that actually process the requests.
Health Check Mechanism: Continuous monitoring that probes each backend server to verify it is responding correctly. A failed health check removes the server from the active pool until it recovers.
Routing Engine: The brain that applies load balancing algorithms and rules to decide which server receives each request.
Session Persistence Layer (optional): Ensures that a user’s subsequent requests are routed to the same server when necessary (also known as sticky sessions).

How Load Balancing Works Step by Step

A client (browser, mobile app, or another service) sends a request to the public IP or domain of the load balancer.
The load balancer inspects the request headers, source IP, or other metadata.
Using its configured algorithm and current server metrics (CPU load, active connections, response time), the load balancer selects the optimal backend server.
The request is forwarded (proxied) to the chosen server.
The backend server processes the request and sends the response back through the load balancer to the client.
The load balancer may also perform TLS termination, compression, or request rewriting before forwarding.

Types of Load Balancers

Load balancers are broadly classified into two layers of the OSI model:

Layer 4 (Transport Layer) Load Balancers: Operate at the TCP/UDP level. They forward packets based on IP address and port without inspecting the actual content of the request. Examples include AWS Network Load Balancer and HAProxy in TCP mode. They are extremely fast and suitable for high-throughput scenarios but cannot make routing decisions based on HTTP headers or URL paths.
Layer 7 (Application Layer) Load Balancers: Operate at the HTTP/HTTPS level. They can read the full request, including URL, headers, cookies, and method. This allows advanced routing such as sending image requests to one pool and API requests to another. Examples include AWS Application Load Balancer, NGINX, and Envoy. They support content-based routing, rate limiting, and header manipulation but introduce slightly higher latency due to inspection.

Load balancers can also be deployed as:

Hardware appliances (F5 BIG-IP, Citrix ADC) — expensive but offer high performance and specialized ASIC chips.
Software solutions (NGINX, HAProxy, Traefik) — run on commodity servers or containers.
Cloud-managed services (AWS ELB, Google Cloud Load Balancing, Azure Load Balancer) — fully managed with auto-scaling built in.

Popular Load Balancing Algorithms

The choice of algorithm directly impacts system performance. Here are the most widely used ones with detailed explanations:

Round Robin: Requests are distributed sequentially across the backend servers in a cyclic order. Simple and fair when all servers have identical capacity.
Weighted Round Robin: Each server is assigned a weight based on its capacity. A more powerful server receives proportionally more requests.
Least Connections: The load balancer routes the next request to the server currently handling the fewest active connections. Excellent for uneven workloads.
Least Response Time: Routes to the server with the lowest average response time, combining connection count and latency.
IP Hash: Uses the client’s IP address to consistently route requests to the same server. Useful for session persistence without cookies.
Random: Selects a server at random. Surprisingly effective and simple to implement.

Complete NGINX Configuration Example

Below is a production-ready NGINX configuration that demonstrates load balancing with health checks, weighted round robin, and session persistence. Every line is explained in detail.

# Global settings
worker_processes auto;
events {
    worker_connections 1024;
}

http {
    # Define the upstream (backend pool)
    upstream backend_servers {
        # Least Connections algorithm with weights
        least_conn;

        server app-server-1.example.com:8080 weight=3 max_fails=3 fail_timeout=30s;
        server app-server-2.example.com:8080 weight=2 max_fails=3 fail_timeout=30s;
        server app-server-3.example.com:8080 weight=1 max_fails=3 fail_timeout=30s;

        # Health check (requires nginx-plus or open-source module)
        # In open-source NGINX, use external tools like consul-template
        keepalive 32;
    }

    server {
        listen 80;
        listen 443 ssl;
        server_name myapp.com;

        # SSL termination happens here
        ssl_certificate /etc/nginx/ssl/fullchain.pem;
        ssl_certificate_key /etc/nginx/ssl/privkey.pem;

        location / {
            # Forward request to the upstream pool
            proxy_pass http://backend_servers;

            # Preserve original host and client IP
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # Enable session persistence using cookies
            proxy_cookie_path / "/; secure; HttpOnly";

            # Timeout settings for reliability
            proxy_connect_timeout 5s;
            proxy_send_timeout 10s;
            proxy_read_timeout 10s;
        }
    }
}

Explanation of key directives:

upstream backend_servers defines the pool of servers.
least_conn activates the Least Connections algorithm.
weight=3 gives app-server-1 three times more traffic than app-server-3.
max_fails=3 fail_timeout=30s removes a server after three consecutive failures for 30 seconds.
proxy_pass http://backend_servers forwards traffic to the chosen server.
Header directives ensure the backend knows the original client information.

Advanced Load Balancing Concepts

Consistent Hashing is often combined with load balancing to minimize disruption when servers are added or removed. Instead of rehashing everything, only a small portion of traffic is affected.

Global Server Load Balancing (GSLB) extends the concept across multiple data centers using DNS-based routing (Anycast or GeoDNS) to direct users to the nearest healthy region.

Auto-scaling integration allows the load balancer to dynamically register new instances launched by Kubernetes Horizontal Pod Autoscaler or AWS Auto Scaling Groups.

If you found this deep dive into load balancing valuable and want to master the remaining 99 system design concepts with equally detailed explanations, code examples, and diagrams, grab the complete System Design eBook at https://codewithdhanian.gumroad.com/l/urcjee. You can also support the creation of more high-quality technical content by buying me a coffee at https://ko-fi.com/codewithdhanian.