Load Balancers in System Design

#systemdesign

Load balancers form the backbone of scalable and reliable distributed systems. A load balancer is a dedicated component or service that distributes incoming network traffic across multiple backend servers or resources to ensure no single server becomes overwhelmed. By intelligently routing requests, load balancers prevent bottlenecks, improve response times, enhance fault tolerance, and maintain high availability even during traffic spikes or server failures.

In modern system design, load balancers sit between clients and application servers, acting as a traffic cop that decides which backend should handle each request. They operate at different layers of the network stack, support sophisticated routing logic, and integrate seamlessly with other core concepts such as caching strategies, microservices architecture, and API gateways. Understanding load balancers deeply is essential because they directly influence horizontal scaling, stateless design, and overall system resilience.

What Is a Load Balancer and How Does It Work

At its core, a load balancer receives incoming client requests and forwards them to one of several healthy backend servers according to a predefined load balancing algorithm. Once the backend processes the request, the response typically travels back through the load balancer to the client, although some advanced configurations allow direct server-to-client responses.

The load balancer maintains a list of registered backend servers, continuously monitors their health, and applies rules for routing. This process happens transparently from the client’s perspective; the client believes it is communicating with a single endpoint while the load balancer orchestrates the distribution behind the scenes.

Load balancers can be deployed as hardware appliances, software solutions running on virtual machines, or fully managed cloud services. Regardless of the form, every load balancer performs three fundamental operations: receiving traffic, selecting a backend, and forwarding the request while preserving necessary connection state when required.

Types of Load Balancers

Load balancers are broadly classified by the network layer at which they operate and by their deployment model.

Layer 4 Load Balancers

Layer 4 load balancers operate at the transport layer of the OSI model, handling raw TCP and UDP traffic without inspecting the application payload. They make routing decisions based solely on IP address, port, and protocol information. Because they do not parse HTTP headers or cookies, Layer 4 load balancers are extremely fast and impose minimal latency.

A classic example is a Layer 4 load balancer forwarding TCP connections to backend servers listening on port 8080. Once the TCP handshake completes, the load balancer simply relays packets between client and server without further intervention. This makes Layer 4 load balancers ideal for high-throughput scenarios such as gaming servers, databases, or any protocol that does not require content-based routing.

Layer 7 Load Balancers

Layer 7 load balancers, also known as application load balancers, inspect the actual content of the request at the application layer. They can read HTTP headers, cookies, query parameters, and even the request body to make intelligent routing decisions. This capability enables advanced features such as path-based routing, host-based routing, and content-based routing.

For instance, a Layer 7 load balancer can send requests to /api/v1/users to one set of servers while routing /api/v1/orders to another set, effectively implementing microservice-level traffic segmentation. Because they terminate and re-establish connections, Layer 7 load balancers can also perform SSL/TLS termination, header manipulation, and response compression directly at the edge.

Load Balancing Algorithms

The intelligence of a load balancer resides primarily in its load balancing algorithm. Each algorithm balances traffic differently and suits specific workloads.

Round Robin distributes requests sequentially across the pool of servers. The first request goes to server A, the second to server B, and so on, cycling back to A after the last server. This algorithm is simple and fair when all servers have identical capacity and the workload is uniform.

Least Connections directs new requests to the server currently handling the fewest active connections. It performs exceptionally well when request processing times vary significantly, preventing any single server from becoming overloaded.

Weighted Round Robin extends the basic round-robin approach by assigning different weights to servers based on their capacity. A server with twice the resources of others receives twice as many requests.

IP Hash generates a hash from the client’s IP address and maps it to a specific backend server. This ensures that the same client consistently reaches the same server, which is useful for applications that maintain session state on the backend without shared storage.

Least Response Time selects the server that has the lowest average response time and the fewest active connections, combining latency awareness with connection load.

More advanced load balancers also support consistent hashing, which minimizes disruption when servers are added or removed from the pool by ensuring that most existing connections continue to route to the same servers.

Health Checks and Fault Tolerance

A production-grade load balancer never blindly forwards traffic to a backend. Instead, it performs regular health checks to determine server availability. Health checks can be HTTP-based (probing a /health endpoint), TCP-based (attempting a connection on a specific port), or custom script-based.

When a server fails a configurable number of consecutive health checks, the load balancer marks it as unhealthy and stops sending new traffic to it. Once the server recovers and passes subsequent health checks, it is automatically returned to the active pool. This mechanism provides zero-downtime failover and is the foundation of high-availability architectures.

Session Persistence and Sticky Sessions

Some applications require that a client’s subsequent requests reach the same backend server because session data resides in memory rather than a shared cache or database. Session persistence, commonly called sticky sessions, solves this requirement.

Layer 7 load balancers achieve sticky sessions by inspecting cookies or injecting their own session cookie that encodes the backend server identifier. All future requests carrying that cookie are routed to the same server for the duration of the session. While sticky sessions violate pure stateless design principles, they remain necessary for legacy applications or specific performance optimizations.

SSL/TLS Termination and Security Features

Modern load balancers handle SSL/TLS termination at the edge, offloading the computationally expensive encryption and decryption from backend application servers. The load balancer presents the public certificate to clients, decrypts incoming traffic, and forwards plaintext requests to the backends over a private network. This approach centralizes certificate management and allows backends to focus purely on business logic.

Additional security capabilities include rate limiting, IP whitelisting, WAF (Web Application Firewall) integration, and automatic DDoS mitigation. By terminating connections early, the load balancer can drop malicious traffic before it ever reaches the application layer.

Complete Implementation Example Using Nginx

Nginx is one of the most widely deployed open-source load balancers and reverse proxies. Below is a complete, production-ready Nginx configuration that demonstrates Layer 7 load balancing with multiple advanced features.

events {
    worker_connections 1024;
}

http {
    upstream backend_servers {
        # Least Connections algorithm
        least_conn;

        # Backend server definitions with weights
        server app-server-1.example.internal:8080 weight=2 max_fails=3 fail_timeout=30s;
        server app-server-2.example.internal:8080 weight=2 max_fails=3 fail_timeout=30s;
        server app-server-3.example.internal:8080 weight=1 max_fails=3 fail_timeout=30s;

        # Sticky sessions using cookie
        sticky cookie srv_id expires=1h path=/;
    }

    server {
        listen 443 ssl;
        server_name api.example.com;

        # SSL/TLS termination
        ssl_certificate /etc/nginx/ssl/api.example.com.crt;
        ssl_certificate_key /etc/nginx/ssl/api.example.com.key;
        ssl_protocols TLSv1.2 TLSv1.3;
        ssl_ciphers HIGH:!aNULL:!MD5;

        # Health check location (Nginx Plus or lua module required for active checks)
        location /health {
            access_log off;
            return 200 "healthy\n";
        }

        location / {
            # Forward requests to upstream
            proxy_pass http://backend_servers;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # Connection timeouts and buffering
            proxy_connect_timeout 60s;
            proxy_send_timeout 60s;
            proxy_read_timeout 60s;
            proxy_buffering on;
            proxy_buffer_size 8k;
            proxy_buffers 8 8k;

            # Enable compression
            gzip on;
            gzip_types text/plain text/css application/json;
        }
    }
}

This configuration uses the least_conn algorithm, defines server weights, implements sticky sessions via a cookie named srv_id, performs basic failure detection with max_fails and fail_timeout, and terminates SSL/TLS at the load balancer. Each directive is carefully chosen to balance performance, reliability, and security.

Complete Implementation Example Using HAProxy

HAProxy is another battle-tested load balancer favored in high-performance environments. The following is a full HAProxy configuration file showcasing Layer 4 and Layer 7 capabilities in a single instance.

global
    log 127.0.0.1 local0
    maxconn 4096
    stats socket /var/run/haproxy.sock mode 660 level admin

defaults
    mode http
    timeout connect 5000ms
    timeout client 30000ms
    timeout server 30000ms

frontend https_frontend
    bind *:443 ssl crt /etc/haproxy/certs/api.example.com.pem
    acl is_health path /health
    use_backend health_check if is_health
    default_backend app_servers

backend health_check
    http-request return status 200 content-type "text/plain" string "healthy"

backend app_servers
    balance leastconn
    cookie SERVERID insert indirect nocache
    server app1 10.0.0.1:8080 check cookie s1 weight 2
    server app2 10.0.0.2:8080 check cookie s2 weight 2
    server app3 10.0.0.3:8080 check cookie s3 weight 1
    http-check send meth GET uri /health
    http-check expect status 200

HAProxy’s configuration clearly separates the frontend (where SSL termination occurs) from the backend (where health checks and load distribution happen). The leastconn balance method, cookie-based persistence, and active http-check ensure robust operation under heavy load.

Integration Patterns in Modern Architectures

In microservices architecture, the load balancer often lives inside an API gateway or service mesh. Tools such as Kubernetes Ingress controllers, Istio, or AWS Application Load Balancer combine load balancing, routing, and observability in a single control plane. The load balancer becomes the single entry point for external traffic, enforcing rate limiting, authentication, and routing rules before traffic reaches individual service pods.

When scaling to hundreds of instances, consistent hashing combined with service discovery ensures minimal reshuffling of traffic when nodes join or leave the cluster. This combination keeps the system both responsive and stable.

The load balancer also plays a critical role in disaster recovery strategies. By maintaining multiple active-active data centers and using DNS-based global load balancing or anycast routing, traffic can be redirected within seconds of a regional outage.

Every essential aspect of load balancers — from algorithm selection to health monitoring, from SSL termination to session management — must be designed with the broader system design principles in mind. A well-implemented load balancer transforms a fragile collection of servers into a resilient, scalable, and highly available distributed system.

Check the Load balancing system architecture overview below

To master more concepts like this, consider acquiring the System Design Handbook available at https://codewithdhanian.gumroad.com/l/ntmcf.