The journey to a million users begins with a simple load balancer
If you’ve deployed a web app, you’ve already built the first 10% of a scalable system.
The next 10% is learning what happens when one server isn’t enough, and what failure feels like in production.
If you’ve never deployed a web app, start here: Reddit Guide
In this post, I’ll recreate the "load balancer" chapter from Alex Xu’s System Design Interview on a $20 VPS, using Caddy as an L7 reverse proxy. The goal isn’t novelty, it’s building intuition for retries, health checks, and failover.
Baseline Assumptions
I had my portfolio site deployed on a Hetzner VPS. The app is a Tanstack Start app, but we are assuming you have something similar to the following:
An app that runs in a container on port 3000.
A CI process builds/pushes to GHCR.
A server pulls latest via compose.
Once you have an app that lives on a server somewhere, we are good to go.
Load Balancing
Now what if I post a cool project on Hacker News and it goes viral and I get thousands of people looking at my site? My poor half-CPU server would probably melt. I could scale vertically and just add more compute. But let's say it went really viral and I got millions of people looking at my cool project! Well, you can't have a single machine with infinite CPU.
That's where the load balancer comes in. A load balancer is also a failure detector and a policy engine: it decides where traffic goes and how quickly it gives up when a backend misbehaves. It can do a few things:
It can distribute traffic among several servers. This assumes your app is largely stateless (or that state lives in shared systems like a DB/Redis).
If one server is down, your app can still work because it exists on the other ones
Without a load balancer:
One server → single point of failure (SPOF)
One server → limited CPU
One server → deployment risk
Load balancing is the first real step from "app" to "system." Think of it as traffic control: each request gets directed to a healthy server.
Of course this can cause some funny behavior! If server 1 had one version of an app and server 2 had another version, then you can easily run into a situation where you see a bug happen in one user's session and not another. This is where observability comes into play, but that is not in the scope of this blog post.
There are two types of load balancers: L4 and L7.
L4 Load Balancing
L4 load balancing is on the transport layer. L4 is concerned with TCP/UDP forwarding, but is blind to HTTP routes. In other words, it allows you to route requests to different servers without knowing what is in the request. Thus, you cannot inspect the MIME type or URL of the request, for example, or any of its contents. But it's used because it's fast.
L7 Load Balancing
L7 is on the application layer. L7 understands HTTP, can route by path/headers, can do HTTP health checks. For example, we can route requests to /api to our API server, but all other requests go to our web app server.
For the purposes of this article, we will be focusing on L7 load balancing using Caddy reverse proxy. I'm using L7 because it’s the most common "first load balancer" in web stacks and it's where practical concerns like health checks and timeouts show up fast.
Wiring two backends behind one entrypoint
So, how do you set up a load balancer? Well, I spun up another small Hetzner server in the same network zone (in my case, eu-central). I copied the bustamam-tech portion of the docker-compose file to the new server. I ran docker compose up -d to spin up my portfolio site on that server.
Then, I set up a private network on Hetzner. Go to your Hetzner project and go to Networks. Click on Create Network.
You can name your network whatever you want. I called mine load-balancer-test. You can change the name later. Ensure the network zone matches the same network zone of the two servers. Then, for IP range, you can leave that as default.
In case you're curious about what the IP range means though, this Stack Overflow post explains what the IP range means, and you can do more research into it. But suffice to say, it allocates a bunch of private IP addresses for the resources you put into your network. the 16 means that you get around 65k IP addresses, which ought to be plenty.
Then, click into your network and click on "Attach Resources." Click on your servers and you should see the private IP addresses your resources have.
Note: I used
/24for my IP range; this gives me 256 unique IP addresses which should suffice for my experimentation
Now your servers can talk to each other! Let's add a basic /api/whoami route to our app so we know which server we're talking to.
import { createFileRoute } from "@tanstack/react-router";
import { json } from "@tanstack/react-start";
function getServerId(): string {
return (
process.env.SERVER_ID ?? 'unknown'
);
}
export const Route = createFileRoute("/api/whoami")({
server: {
handlers: {
GET: async () => {
const serverId = getServerId();
return json({
message: `hello from server ${serverId}`,
serverId,
pid: process.pid,
time: new Date().toISOString(),
});
},
},
},
});
Great, now let's update your docker-compose.yml files on both of our servers.
server-1:
bustamam-tech:
image: ghcr.io/abustamam/bustamam-tech:latest
container_name: bustamam-tech
environment:
SERVER_ID: bustamam-tech-1
expose:
- "3000"
restart: unless-stopped
Notice the addition of the SERVER_ID env var.
server-2:
services:
bustamam-tech:
image: ghcr.io/abustamam/bustamam-tech:latest
container_name: bustamam-tech
environment:
SERVER_ID: bustamam-tech-2
ports:
- "10.0.0.3:3100:3000"
restart: unless-stopped
On server-2, we bind the container to its private IP so other machines in the network can reach it.
Note: 10.0.0.3 is the private IP address of server-2.
Now, try doing an API call from one server to another (for example, from bustamam-tech-1 I ran http://10.0.0.3:3100/api/whoami and got a response).
root@bustamam-tech-1: $ curl http://10.0.0.3:3100/api/whoami
{"message":"hello from server bustamam-tech-2","serverId":"bustamam-tech-2","pid":1,"time":"2026-02-24T18:15:13.089Z"}
Hooray! There's a lot of neat stuff you can do just by setting up a private network (shared DBs, etc), but we'll probably tackle that later in the series.
Note: For cost, I colocated the load balancer and one backend on the same box. This is a SPOF (single point of failure). In production, the LB should be an independent failure domain (or managed).
To set up your load balancer using Caddy, update your Caddyfile to reference your other server:
bustamam.tech {
reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
lb_policy round_robin
}
}
Notice that we added 10.0.0.3:3100 as a target. If we were using a dedicated server for load balancing, then we'd need to update bustamam-tech:3000 in a similar fashion and use its private IP address, but since we are using the server on which the bustamam-tech service is running, we can just refer to the server name.
lb_policy round_robin means that the load balancer will forward each request to each server once before starting over from the first one. There's a bunch of other algorithms; another commonly used one is least_conn which will connect the request to the server with the fewest connections. For more information, check out the Caddy docs.
And that's it! Restart your Caddy server docker compose restart caddy and then from a terminal outside of your Hetzner network (like running locally), run the following bash command:
for i in {1..10}; do curl {your_whoami_route}; echo; done
If you properly see the requests alternating, congrats! You just set up your own load balancer.
Failover: the first real production footgun
Let's test failover. If server 2 goes down, then the load balancer should only ever connect to server 1.
Run docker compose down on server 2 to simulate the server being down. Then run the bash script again.
for i in {1..10}; do curl {your_whoami_route}; echo; done
Uh oh! Notice the 5 seconds of latency? This means failover isn't working. We need a way for our load balancer to know if a server is up or not. There's a few ways to do this, but a common solution is to use a health check, which is a route that basically says "I'm alive!".
Why health checks? It doesn't actually do anything, right? It's just a route that returns static OK.
Well, this is why we have health checks:
Without health checks, the LB only discovers failure after a request fails.
That means user-facing latency spikes.
Health checks convert reactive detection into proactive detection.
Without health checks, failover happens only after a user request times out. That means your users become your monitoring system, which is not an ideal experience for them.
Well, we already have a /api/whoami route that doesn't do anything but return an environment variable. Let's use that.
bustamam.tech {
reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
lb_policy round_robin
# Active health checking
health_uri /api/whoami
}
}
Note: in many production environments, the health check route is at a route like
/api/healthzor/api/healthcheckthat just returns{ OK: true }or something like that. I'll leave that as an exercise for the reader to implement if interested.
Restart your Caddy service (docker compose restart caddy), then try it again.
for i in {1..10}; do curl {your_whoami_route}; echo; done
Yay! All of our traffic is being routed to server 1!
So, what happened during the failover process?
Caddy picked server-2 (round robin)
server-2 was down
the client waited for any connect/response timeouts
only then did Caddy tried the next upstream
Without health checks, your first failure is paid for by a real user.
The final config I went with is:
bustamam.tech {
reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
lb_policy round_robin
# total retry window across upstreams
lb_try_duration 3s
# how often to retry upstreams within that window
lb_try_interval 250ms
# Active health checking
health_uri /api/healthz # note I changed this from /api/whoami
health_interval 5s
health_timeout 2s
# How long to consider a backend “down” after failures (circuit breaker window)
# duration to keep an upstream marked as unhealthy
fail_duration 10s
# threshold of failures before marking an upstream down
max_fails 1
# Fail fast when an upstream is unresponsive
transport http {
# TCP connect timeout to the upstream
dial_timeout 1s
# slow backend detection (time waiting for first byte)
response_header_timeout 2s
}
}
}
Again, refer to the Caddy docs for more information on the config. These settings are basically: bounded retries, active health checks, and a circuit breaker, plus aggressive timeouts so failure is detected quickly. Rule of thumb: timeouts first, retries second. Retries without timeouts just turn slow failures into traffic jams.
And there we have it! A complete load balancer in just a few lines of code.
Wrapping it up
At this point, you've moved from deploying an app to operating a system.
We explored L7 load balancing. We set up a second server to host our web app, we used Caddy to implement failover, and we watched it work before our eyes. Best of all, this didn't require a lot of code!
Notably though, we did not cover:
Database replication
Session stickiness
Deployment coordination
Distributed logging
Observability
We will tackle these later in the series.
Caddy is great for reverse proxies and basic L7 load balancing. But many companies will expect you to also know how to set up a load balancer in nginx or HAProxy. Next: nginx or HAProxy, and why teams choose one over the other (operability, observability, failure semantics).






Top comments (0)