Rasheed Bustamam

Posted on Feb 24 • Originally published at bustamam.hashnode.dev

Building a Load Balancer from Scratch on a $20 VPS

#webdev

The journey to a million users begins with a simple load balancer

If you’ve deployed a web app, you’ve already built the first 10% of a scalable system.

The next 10% is learning what happens when one server isn’t enough, and what failure feels like in production.

If you’ve never deployed a web app, start here: Reddit Guide

In this post, I’ll recreate the "load balancer" chapter from Alex Xu’s System Design Interview on a $20 VPS, using Caddy as an L7 reverse proxy. The goal isn’t novelty, it’s building intuition for retries, health checks, and failover.

Baseline Assumptions

I had my portfolio site deployed on a Hetzner VPS. The app is a Tanstack Start app, but we are assuming you have something similar to the following:

An app that runs in a container on port 3000.
A CI process builds/pushes to GHCR.
A server pulls latest via compose.

Once you have an app that lives on a server somewhere, we are good to go.

Load Balancing

Now what if I post a cool project on Hacker News and it goes viral and I get thousands of people looking at my site? My poor half-CPU server would probably melt. I could scale vertically and just add more compute. But let's say it went really viral and I got millions of people looking at my cool project! Well, you can't have a single machine with infinite CPU.

That's where the load balancer comes in. A load balancer is also a failure detector and a policy engine: it decides where traffic goes and how quickly it gives up when a backend misbehaves. It can do a few things:

It can distribute traffic among several servers. This assumes your app is largely stateless (or that state lives in shared systems like a DB/Redis).
If one server is down, your app can still work because it exists on the other ones

Without a load balancer:

One server → single point of failure (SPOF)
One server → limited CPU
One server → deployment risk

Load balancing is the first real step from "app" to "system." Think of it as traffic control: each request gets directed to a healthy server.

Of course this can cause some funny behavior! If server 1 had one version of an app and server 2 had another version, then you can easily run into a situation where you see a bug happen in one user's session and not another. This is where observability comes into play, but that is not in the scope of this blog post.

There are two types of load balancers: L4 and L7.

L4 Load Balancing

L4 load balancing is on the transport layer. L4 is concerned with TCP/UDP forwarding, but is blind to HTTP routes. In other words, it allows you to route requests to different servers without knowing what is in the request. Thus, you cannot inspect the MIME type or URL of the request, for example, or any of its contents. But it's used because it's fast.

L7 Load Balancing

L7 is on the application layer. L7 understands HTTP, can route by path/headers, can do HTTP health checks. For example, we can route requests to /api to our API server, but all other requests go to our web app server.

For the purposes of this article, we will be focusing on L7 load balancing using Caddy reverse proxy. I'm using L7 because it’s the most common "first load balancer" in web stacks and it's where practical concerns like health checks and timeouts show up fast.

Wiring two backends behind one entrypoint

So, how do you set up a load balancer? Well, I spun up another small Hetzner server in the same network zone (in my case, eu-central). I copied the bustamam-tech portion of the docker-compose file to the new server. I ran docker compose up -d to spin up my portfolio site on that server.

Then, I set up a private network on Hetzner. Go to your Hetzner project and go to Networks. Click on Create Network.

You can name your network whatever you want. I called mine load-balancer-test. You can change the name later. Ensure the network zone matches the same network zone of the two servers. Then, for IP range, you can leave that as default.

In case you're curious about what the IP range means though, this Stack Overflow post explains what the IP range means, and you can do more research into it. But suffice to say, it allocates a bunch of private IP addresses for the resources you put into your network. the 16 means that you get around 65k IP addresses, which ought to be plenty.

Then, click into your network and click on "Attach Resources." Click on your servers and you should see the private IP addresses your resources have.

Note: I used /24 for my IP range; this gives me 256 unique IP addresses which should suffice for my experimentation

Now your servers can talk to each other! Let's add a basic /api/whoami route to our app so we know which server we're talking to.

import { createFileRoute } from "@tanstack/react-router";
import { json } from "@tanstack/react-start";

function getServerId(): string {
  return (
    process.env.SERVER_ID ?? 'unknown'
  );
}

export const Route = createFileRoute("/api/whoami")({
  server: {
    handlers: {
      GET: async () => {
        const serverId = getServerId();
        return json({
          message: `hello from server ${serverId}`,
          serverId,
          pid: process.pid,
          time: new Date().toISOString(),
        });
      },
    },
  },
});

Great, now let's update your docker-compose.yml files on both of our servers.

server-1:

  bustamam-tech:
    image: ghcr.io/abustamam/bustamam-tech:latest
    container_name: bustamam-tech
    environment:
      SERVER_ID: bustamam-tech-1
    expose:
      - "3000"
    restart: unless-stopped

Notice the addition of the SERVER_ID env var.

server-2:

services:
  bustamam-tech:
    image: ghcr.io/abustamam/bustamam-tech:latest
    container_name: bustamam-tech
    environment:
      SERVER_ID: bustamam-tech-2
    ports:
      - "10.0.0.3:3100:3000"
    restart: unless-stopped

On server-2, we bind the container to its private IP so other machines in the network can reach it.

Note: 10.0.0.3 is the private IP address of server-2.

Now, try doing an API call from one server to another (for example, from bustamam-tech-1 I ran http://10.0.0.3:3100/api/whoami and got a response).

root@bustamam-tech-1: $ curl http://10.0.0.3:3100/api/whoami

{"message":"hello from server bustamam-tech-2","serverId":"bustamam-tech-2","pid":1,"time":"2026-02-24T18:15:13.089Z"}

Hooray! There's a lot of neat stuff you can do just by setting up a private network (shared DBs, etc), but we'll probably tackle that later in the series.

Note: For cost, I colocated the load balancer and one backend on the same box. This is a SPOF (single point of failure). In production, the LB should be an independent failure domain (or managed).

To set up your load balancer using Caddy, update your Caddyfile to reference your other server:

bustamam.tech {
    reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
        lb_policy round_robin
    }
}

Notice that we added 10.0.0.3:3100 as a target. If we were using a dedicated server for load balancing, then we'd need to update bustamam-tech:3000 in a similar fashion and use its private IP address, but since we are using the server on which the bustamam-tech service is running, we can just refer to the server name.

lb_policy round_robin means that the load balancer will forward each request to each server once before starting over from the first one. There's a bunch of other algorithms; another commonly used one is least_conn which will connect the request to the server with the fewest connections. For more information, check out the Caddy docs.

And that's it! Restart your Caddy server docker compose restart caddy and then from a terminal outside of your Hetzner network (like running locally), run the following bash command:

for i in {1..10}; do curl {your_whoami_route}; echo; done

If you properly see the requests alternating, congrats! You just set up your own load balancer.

Failover: the first real production footgun

Let's test failover. If server 2 goes down, then the load balancer should only ever connect to server 1.

Run docker compose down on server 2 to simulate the server being down. Then run the bash script again.

for i in {1..10}; do curl {your_whoami_route}; echo; done

Uh oh! Notice the 5 seconds of latency? This means failover isn't working. We need a way for our load balancer to know if a server is up or not. There's a few ways to do this, but a common solution is to use a health check, which is a route that basically says "I'm alive!".

Why health checks? It doesn't actually do anything, right? It's just a route that returns static OK.

Well, this is why we have health checks:

Without health checks, the LB only discovers failure after a request fails.
That means user-facing latency spikes.
Health checks convert reactive detection into proactive detection.

Without health checks, failover happens only after a user request times out. That means your users become your monitoring system, which is not an ideal experience for them.

Well, we already have a /api/whoami route that doesn't do anything but return an environment variable. Let's use that.

bustamam.tech {
    reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
        lb_policy round_robin

        # Active health checking
        health_uri /api/whoami
    }
}

Note: in many production environments, the health check route is at a route like /api/healthz or /api/healthcheck that just returns { OK: true } or something like that. I'll leave that as an exercise for the reader to implement if interested.

Restart your Caddy service (docker compose restart caddy), then try it again.

for i in {1..10}; do curl {your_whoami_route}; echo; done

Yay! All of our traffic is being routed to server 1!

So, what happened during the failover process?

Caddy picked server-2 (round robin)
server-2 was down
the client waited for any connect/response timeouts
only then did Caddy tried the next upstream

Without health checks, your first failure is paid for by a real user.

The final config I went with is:

bustamam.tech {
    reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
        lb_policy round_robin

        # total retry window across upstreams
        lb_try_duration 3s             

        # how often to retry upstreams within that window
        lb_try_interval 250ms

        # Active health checking
        health_uri /api/healthz # note I changed this from /api/whoami
        health_interval 5s
        health_timeout 2s

        # How long to consider a backend “down” after failures (circuit breaker window)
        # duration to keep an upstream marked as unhealthy
        fail_duration 10s              

        # threshold of failures before marking an upstream down
        max_fails 1                    

        # Fail fast when an upstream is unresponsive
        transport http {
            # TCP connect timeout to the upstream
            dial_timeout 1s            

            # slow backend detection (time waiting for first byte)
            response_header_timeout 2s 
        }
    }
}

Again, refer to the Caddy docs for more information on the config. These settings are basically: bounded retries, active health checks, and a circuit breaker, plus aggressive timeouts so failure is detected quickly. Rule of thumb: timeouts first, retries second. Retries without timeouts just turn slow failures into traffic jams.

And there we have it! A complete load balancer in just a few lines of code.

Wrapping it up

At this point, you've moved from deploying an app to operating a system.

We explored L7 load balancing. We set up a second server to host our web app, we used Caddy to implement failover, and we watched it work before our eyes. Best of all, this didn't require a lot of code!

Notably though, we did not cover:

Database replication
Session stickiness
Deployment coordination
Distributed logging
Observability

We will tackle these later in the series.

Caddy is great for reverse proxies and basic L7 load balancing. But many companies will expect you to also know how to set up a load balancer in nginx or HAProxy. Next: nginx or HAProxy, and why teams choose one over the other (operability, observability, failure semantics).