Rasheed Bustamam

Posted on Feb 25 • Originally published at bustamam.hashnode.dev

NGINX Load Balancing, Failover & TLS on a VPS

#webdev

Using the tools of titans

In our previous post, we built an L7 load balancer using Caddy reverse proxy. In this post, we'll migrate that configuration over to nginx so we can compare tradeoffs. But first, what is nginx?

What is Nginx?

NGINX, or nginx because I don't like screaming (pronounced engine-x, though some folks will say en-jinx), is a high-performance HTTP server and reverse proxy commonly used for load balancing, TLS termination, and serving static content.

Note: for the purposes of this guide, when nginx is written, it should be assumed it's nginx OSS and not nginx's enterprise offering, nginx Plus. Ensure that when reading docs about nginx, you are reading docs about nginx OSS, typically hosted at nginx.org

Where Caddy optimizes for simplicity and automatic TLS, nginx exposes lower-level control over request routing, buffering, and upstream behavior. In our previous setup, Caddy handled reverse proxying and active health checks across two upstream nodes.

In this migration, we move to nginx to gain explicit control over upstream pools, failure detection, and connection timeouts.

Preconditions

I'm assuming you read the last post. If not, here are our baseline assumptions:

Domain bustamam.tech A record points to server-1 public IP
server-1 and server-2 are on the same Hetzner private network
server-2 exposes app on private IP/port: 10.0.0.3:3100 -> container:3000

To confirm, from an ssh session in server-1, run this:

curl -s http://10.0.0.3:3100/api/whoami

If that returns server-2 (or whatever your SERVER_ID is) then we can continue.

$ curl http://10.0.0.3:3100/api/whoami
{"message":"hello from server bustamam-tech-2","serverId":"bustamam-tech-2","pid":1,"time":"2026-02-24T22:34:37.462Z"}

Basic nginx scaffold

Right now, our app looks something like this:

Internet (HTTPS)
        ↓
     Caddy
        ↓
   App containers

Caddy:

Listens on 80/443 (http/s)
Owns the TLS cert
Decrypts HTTPS to HTTP
Forwards HTTP to upstream containers
Does L7 load balancing between them

We are going to introduce nginx as the new edge reverse proxy.

That means nginx will do the same exact thing, and we'll remove Caddy from the loop.

The new architecture becomes:

Internet (HTTPS)
        ↓
     nginx
        ↓
   App containers

We are not:

Changing the app
Changing Docker build
Changing the private network
Moving certs to backend servers
Doing TLS passthrough

We are just replacing Caddy with nginx as the TLS-terminating L7 proxy.

TLS termination at nginx does the following:

It keeps certificates in one place
It allows HTTP-aware load balancing
It lets nginx inspect requests if needed
It simplifies backend containers (they only speak HTTP)

This is a common production pattern.

In order for all of this to work, we need three things:

1. nginx needs config files

So it knows:

which domain it serves
where to proxy traffic
where the cert files are

You can check out the documentation on nginx web servers here.

2. nginx needs certificates

Let's Encrypt cert + private key must live in a mounted volume.

Documentation on nginx certs here.

3. nginx needs to expose ports 80 and 443

Because it becomes the public entrypoint.

Let's start with the config. On your load balancer server, run the following:

mkdir -p nginx/conf.d

This is where your nginx site configs will live. The location is arbitrary -- we will map it to a docker volume.

OK, let's spin nginx up. Update your docker-compose.yml file:

services:
  # caddy, your app, etc
  nginx:
    image: nginx:1.27-alpine
    container_name: nginx
    ports:
      - "8080:8080"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
    depends_on: [bustamam-tech]
    restart: unless-stopped

Then, create a file called 00-shadow.conf in your conf.d directory.

Note: conf files' names are not super important, but nginx does load them in alphabetical/numerical order, so it's a common practice to prepend with 00 for sorting purposes.

# Shadow nginx: runs on :8080 so we can test without touching Caddy (:80/:443)

upstream bustamam_upstreams {
  # round robin + retry-on-failure behavior are nginx defaults
  server bustamam-tech:3000;
  server 10.0.0.3:3100;

  # Note: we're currently relying on nginx's default passive health checks
}


server {
  listen 8080;
  server_name _;

  location / {
    proxy_pass http://bustamam_upstreams;

    # Minimum headers to keep apps happy
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }
}

What this does: nginx is now an L7 proxy and can load balance, but it's intentionally naive.

Let's spin up our nginx service.

docker compose up nginx -d

# after it's running

docker compose ps

# you should see your nginx service, as well as any other service that might be running

Now you can do the loop:

for i in {1..10}; do curl -s http://bustamam.tech:8080/api/whoami; echo; done

Note: this is http, not https, and note the port as well, it matches the port in the .conf file.

You should get alternating server IDs. If you don't, double check your config!

Testing Failover

Let's pull down server-2 for a second and try this again. docker compose down on server-2. Then try curl again.

Uh-oh! We're hanging where the round robin would have sent us to server-2! Let's fix that.

Note: nginx has some pretty long defaults, so while it may feel like forever, it might be something like 60 seconds. While it is said that patience is a virtue, a user won't use an app that takes 60 seconds to load or fetch data! Timeouts are your first production knob.

Handling Failover

Let's update our config so we don't wait forever when a destination server is down.

  location / {
    proxy_pass http://bustamam_upstreams;

    proxy_connect_timeout 1s; # timeout for connecting to the upstream https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_connect_timeout
    proxy_read_timeout 5s; # timeout for reading the response from the upstream https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout

    # Minimum headers to keep apps happy
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
  }

Restart nginx on server-1

docker compose restart nginx

Note: for the remainder of this post, we will assume that a config change is followed by container restart.

And try it again!

Great! But wait, if server-2 is down, how long are we waiting before nginx sends the request to server-1? Let's instrument some observability. Update your location config so we have access to the upstream IP addresses:

  location / {
    proxy_pass http://bustamam_upstreams;

    proxy_connect_timeout 1s; # timeout for connecting to the upstream https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_connect_timeout
    proxy_read_timeout 5s; # timeout for reading the response from the upstream https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout

    # Minimum headers to keep apps happy
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    add_header X-Upstream $upstream_addr always;
  }

And let's try a slightly different bash script.

for i in {1..10}; do
  echo "---- $i ----"

  curl -s -D headers.txt \
       -w "\ncode=%{http_code} time=%{time_total}\n" \
       http://bustamam.tech:8080/api/whoami

  grep -i x-upstream headers.txt
  echo
done

Aha! Notice that even though we're getting 200's and getting the right server to respond, look at the third one. We added a whole second to our latency, and you can see that a request attempted to go to server-2 in the X-Upstream header. Even when the request succeeds, failover can still cost you a timeout. Success isn't the same as fast.

Let's flesh this out a bit more. Let's update our upstream config. Defaults exist, but we want our system to be able to explain itself:

# Shadow nginx: runs on :8080 so we can test without touching Caddy (:80/:443)

upstream bustamam_upstreams {
  # primary server with default settings
  # note that because this service lives on this machine, if this server is down, the nginx container will also be down.
  server bustamam-tech:3000;

  # secondary server with custom settings
  # max_fails=1
  #   If 1 request fails within the fail_timeout window,
  #   mark this upstream as "unavailable".
  #
  # fail_timeout=10s
  #   How long to consider that backend "down" before retrying it.
  #
  server 10.0.0.3:3100 max_fails=1 fail_timeout=10s;
}

server {
  listen 8080;
  server_name _;

  location / {
    proxy_pass http://bustamam_upstreams;

    proxy_connect_timeout 1s; # timeout for connecting to the upstream https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_connect_timeout
    proxy_read_timeout 5s; # timeout for reading the response from the upstream https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout

    # Minimum headers to keep apps happy
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    add_header X-Upstream $upstream_addr always;


    # Note: this is nginx's default https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream
    proxy_next_upstream error timeout;

    # how many retries to attempt before giving up https://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream_tries
    proxy_next_upstream_tries 2; # default is 0, which means unbounded retries!
  }
}

Let's quickly talk about max_fails, fail_timeout, proxy_next_upstream and proxy_next_upstream_tries.

max_fails / fail_timeout is upstream-level passive failure marking.
proxy_next_upstream / tries is request-level retry routing.

Think of this as two layers: upstream marking (which servers are considered eligible) and per-request retries (what nginx does when a request fails mid-flight).

Note: unbounded does not mean infinite. To be more explicit, 0 means no explicit limit (i.e., not bounded by tries). In practice, retries are still bounded by timeouts and available upstreams, but it's not a safe default if you're trying to reason about worst-case latency.

Feel free to play around with numbers! For example:

server 10.0.0.3:3100 max_fails=1 fail_timeout=60s;

Now if this server fails once, then it won't be tried again for another 60 seconds. But it also means that if the server came back up, no one would be able to access it for 60 seconds. This is where metrics and understanding your system as a whole is important. For a toy project like this, even 10+ minutes would be fine.

The trade-off is fast recovery (low timeout) vs minimizing probe spikes (high timeout).

You can also tighten up the connect timeout:

proxy_connect_timeout 200ms;

But this only works if your private network is reliably fast. If it ever takes more than 200ms for the server to respond, you may mark an otherwise healthy server as dead due to jitter.

Note: If you're running into issues with your config, you should the difference between upstream definition (where servers live) and proxy behavior (how requests fail over). Mixing them up leads to configs that look reasonable but don't load, and the failure mode is "nothing works, and you're not sure why" unless you validate with nginx -t

So, let's summarize what nginx is doing so far.

nginx tries server-2 (round robin)
it fails, and nginx marks it 'down" for ~10s
for the next ~10 seconds, nginx only uses server-1 (fast)
once the 10s window expires, nginx will probe server-2 again by selecting it for a real request
that request pays the 1s connect timeout (your ~1.5s)
nginx retries server-1 and succeeds
server-2 gets marked down again for another 10 seconds
if server-2 ever comes back up, then any probes will mark server-2 back online

So it seems like we're at parity with Caddy, right? Well, unfortunately, no. We still need TLS termination. Let's handle that next.

TLS Termination

Right now:

Caddy terminates TLS on :443 and proxies to your backends.
nginx is shadow-testing on :8080 (plain HTTP).

TLS termination means:

The client's HTTPS connection ends at nginx. nginx decrypts the request, then forwards it to your upstreams over plain HTTP (usually over a private network/VPC).

So:

Browser ⇄ HTTPS ⇄ nginx (edge)
nginx ⇄ HTTP ⇄ upstreams (private)

That's what we mean when we say "terminate TLS at the load balancer."

Our plan is to replace Caddy. We want the following:

nginx serves HTTP on :80 and handles the Let's Encrypt ACME challenge
certbot obtains certs via webroot
nginx serves HTTPS on :443 using those certs and proxies to upstreams
shut down Caddy (to free 80/443), bring up nginx+certbot

Redirect http to https

Alright, we'll need some new directories for configs and certs.

mkdir -p nginx/www nginx/letsencrypt

Your directory structure should look something like this:

I have a few extra files from messing around with configs. And again, the directory names are arbitrary. We'll get them mapped in docker. Important to understand that certbot doesn't "talk to nginx." They just share a filesystem. Certbot writes files. nginx serves them. That's it.

nginx/www is where the ACME challenge files are written. When Let's Encrypt validates your domain, it requests http://bustamam.tech/.well-known/acme-challenge/<token> . Certbot writes that token file into your www/ directory, and nginx will serve that directory.
nginx/letsencrypt is where certs live (shared with nginx). When certbot succeeds, it writes cert files into: /etc/letsencrypt/live/bustamam.tech/ . So whatever local directory maps to /etc/letsencrypt must also be shared between certbot (read/write) and nginx (read-only).

Note: for more information on ACME and other Let's Encrypt challenges, check out their documentation on challenge types

Let's delete everything in conf.d and start with a fresh config: bustamam.tech.conf (or whatever you wanna name it)

# ================================
# Upstreams
# ================================
upstream bustamam_upstreams {
  # Primary (local container)
  server bustamam-tech:3000;

  # Secondary (remote server over private network)
  server 10.0.0.3:3100 max_fails=1 fail_timeout=10s;
}

# ================================
# HTTP (port 80)
# - Serve ACME challenge
# - Redirect everything else to HTTPS
# ================================
server {
  listen 80;
  server_name bustamam.tech;

  # Let's Encrypt HTTP-01 challenge files live here
  location /.well-known/acme-challenge/ {
    root /var/www/certbot;
  }

  # Everything else goes to HTTPS
  location / {
    return 301 https://$host$request_uri;
  }
}

Footgun: We are purposely deferring https for later in this article. If you enable the listen 443 ssl server block before certs exist, nginx may fail to start, and you'll see port 80 "hang" because nothing is listening. The bootstrap sequence is: HTTP first → obtain cert → enable HTTPS.

OK, now we need to update our docker-compose.yml file:

  nginx:
    image: nginx:1.27-alpine
    container_name: nginx
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx/conf.d:/etc/nginx/conf.d:ro
      - ./nginx/www:/var/www/certbot:ro
      - ./nginx/letsencrypt:/etc/letsencrypt:ro
    depends_on:
      - bustamam-tech
    restart: unless-stopped

  certbot:
    image: certbot/certbot:latest
    container_name: certbot
    volumes:
      - ./nginx/www:/var/www/certbot:rw
      - ./nginx/letsencrypt:/etc/letsencrypt:rw
    restart: "no"

Important to note:

nginx mounts certs directory read-only
certbot mounts cert directory read-write

Now let's bring our creation to life.

Bring nginx up on port 80, test http

Caddy is currently occupying ports 80 and 443. So if you have Caddy running, bring it down with docker compose down caddy. Then, bring up nginx. If it's already running, run docker compose restart nginx. Otherwise, docker compose up nginx -d.

Then test http connection:

curl -I http://bustamam.tech

You should see a 301 redirect to https, which is exactly what we want.

Note: if this hangs, you may need to debug if the services are running on the ports. Try running this on the host machine: sudo ss -lntp | grep -E ':80|:443' and starting there.

But we don't have https set up. Let's go do that.

Set up https

Let's update our conf file:

# ================================
# Upstreams
# ================================
upstream bustamam_upstreams {
  # Primary (local container)
  server bustamam-tech:3000;

  # Secondary (remote server over private network)
  server 10.0.0.3:3100 max_fails=1 fail_timeout=10s;
}

# ================================
# HTTP (port 80)
# - Serve ACME challenge
# - Redirect everything else to HTTPS
# ================================
server {
  listen 80;
  server_name bustamam.tech;

  # Let's Encrypt HTTP-01 challenge files live here
  location /.well-known/acme-challenge/ {
    root /var/www/certbot;
  }

  # Everything else goes to HTTPS
  location / {
    return 301 https://$host$request_uri;
  }
}

# ================================
# HTTPS (port 443)
# - Terminate TLS here
# - Reverse proxy to upstreams over HTTP
# ================================
server {
  listen 443 ssl;
  server_name bustamam.tech;

  # TLS certs (provided by certbot via shared volume)
  ssl_certificate     /etc/letsencrypt/live/bustamam.tech/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/bustamam.tech/privkey.pem;

  # A minimal modern TLS posture
  ssl_protocols TLSv1.2 TLSv1.3;

  location / {
    proxy_pass http://bustamam_upstreams;

    # Fail fast
    proxy_connect_timeout 1s;
    proxy_read_timeout 5s;
    proxy_send_timeout 5s;

    # Deterministic retry behavior (make defaults explicit)
    proxy_next_upstream error timeout http_502 http_503 http_504;
    proxy_next_upstream_tries 2;

    # Forwarding headers
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    # Debug: show which upstream served (or was attempted)
    add_header X-Upstream $upstream_addr always;
  }
}

The http part (port 80) is the same. https is just a barebones skeleton with some sensible defaults. The ssl_certificates don't exist yet though, so let's make those.

Obtain the certificates

Let's start with a test cert. In your host machine, run this command:

docker compose run --rm certbot certonly \
  --webroot -w /var/www/certbot \
  -d bustamam.tech \
  --test-cert \
  --agree-tos \
  -m rasheed.bustamam@gmail.com \
  --no-eff-email

It'll probably pull from docker, and when it succeeds, you should see a bunch of stuff appear under your letsencrypt directory:

If yes, then rerun the command without the test-cert flag.

docker compose run --rm certbot certonly \
  --webroot -w /var/www/certbot \
  -d bustamam.tech \
  --agree-tos \
  -m rasheed.bustamam@gmail.com \
  --no-eff-email

It's possible this will ask you to reuse your current cert, or create a new one. Choose to create a new one; you can't use a test cert in production environments.

Now let's restart nginx so it can read our new certs!

Activate https in nginx

Just run

docker compose restart nginx

And test:

curl -I https://bustamam.tech

Let's test our whoami route too:

curl -s https://bustamam.tech/api/whoami

Now we have https working and our load balancer is still working!

Now, I have to note -- since we are managing our own certs, we also have to renew it:

docker compose run --rm certbot renew --webroot -w /var/www/certbot
docker exec nginx nginx -s reload

You can run this on a cronjob if you'd like, but it's not in the scope of this article.

Comparison to Caddy

Now that we finally got parity with Caddy, let's compare!

As a reminder, this was our Caddyfile:

bustamam.tech {
    reverse_proxy bustamam-tech:3000 10.0.0.3:3100 {
        lb_policy round_robin

        # total retry window across upstreams
        lb_try_duration 3s             

        # how often to retry upstreams within that window
        lb_try_interval 250ms

        # Active health checking
        health_uri /api/healthz
        health_interval 5s
        health_timeout 2s

        # How long to consider a backend "down" after failures (circuit breaker window)
        # duration to keep an upstream marked as unhealthy
        fail_duration 10s              

        # threshold of failures before marking an upstream down
        max_fails 1                    

        # Fail fast when an upstream is unresponsive
        transport http {
            # TCP connect timeout to the upstream
            dial_timeout 1s            

            # slow backend detection (time waiting for first byte)
            response_header_timeout 2s 
        }
    }
}

We got active health checking and automatic TLS issuance and renewal. And then this was nginx:

# ================================
# Upstreams
# ================================
upstream bustamam_upstreams {
  # Primary (local container)
  server bustamam-tech:3000;

  # Secondary (remote server over private network)
  server 10.0.0.3:3100 max_fails=1 fail_timeout=10s;
}

# ================================
# HTTP (port 80)
# - Serve ACME challenge
# - Redirect everything else to HTTPS
# ================================
server {
  listen 80;
  server_name bustamam.tech;

  # Let's Encrypt HTTP-01 challenge files live here
  location /.well-known/acme-challenge/ {
    root /var/www/certbot;
  }

  # Everything else goes to HTTPS
  location / {
    return 301 https://$host$request_uri;
  }
}

# ================================
# HTTPS (port 443)
# - Terminate TLS here
# - Reverse proxy to upstreams over HTTP
# ================================
server {
  listen 443 ssl;
  server_name bustamam.tech;

  # TLS certs (provided by certbot via shared volume)
  ssl_certificate     /etc/letsencrypt/live/bustamam.tech/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/bustamam.tech/privkey.pem;

  # A minimal modern TLS posture
  ssl_protocols TLSv1.2 TLSv1.3;

  location / {
    proxy_pass http://bustamam_upstreams;

    # Fail fast
    proxy_connect_timeout 1s;
    proxy_read_timeout 5s;
    proxy_send_timeout 5s;

    # Deterministic retry behavior (make defaults explicit)
    proxy_next_upstream error timeout http_502 http_503 http_504;
    proxy_next_upstream_tries 2;

    # Forwarding headers
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto https;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

    # Debug: show which upstream served (or was attempted)
    add_header X-Upstream $upstream_addr always;
  }
}

We get passive health checking, and then we needed Certbot to manage our certs for us.

So you may be asking, "Why is nginx better than Caddy??" and the answer is that it isn't, not necessarily. Caddy is the better default for small systems. nginx is better when you need explicit control, standardized ops, or you're operating inside a bigger ecosystem.

Doing nginx here isn't "because it's better," it's because it teaches you how the edge actually works when the platform stops holding your hand.

Caddy can keep a backend out of rotation before a user hits it. nginx usually learns a backend is dead because a user hit it (or because passive marking was configured).

Caddy is the better default for small systems. nginx is better when you need explicit control, standardized ops, or you're operating inside a bigger ecosystem.

It's important to call out that we don't want to be comparing "lines of config" when evaluating tools. It's a matter of what you own vs what you delegate.

Caddy: batteries included, opinionated defaults

We got, almost for free:

automatic TLS issuance/renewal
active health checks
nice LB ergonomics (health_uri, fail_duration, etc.)
fewer footguns
10-ish lines of config

So for a $20 VPS and learning, Caddy is amazing.

nginx OSS: modular and explicit

We had to build the edge out of primitives:

TLS is not automatic (had to use certbot)
health checks are passive unless you add extra machinery
reload behavior and config validation are on you
you need to understand contexts (upstream vs location) or you break it
about 60-ish lines of config

That pain is the point: nginx forces us to learn the contract between:

TCP port binding
TLS termination
request routing
retries/timeouts
failure detection
certificate lifecycle

This is the systems knowledge that we're trying to learn in the first place.

When to Choose nginx over Caddy

Ideally, your team is already using one and you just need to learn it :)

But for greenfield projects, or for understanding when to migrate from Caddy to nginx:

1) When you need a boring industry standard

nginx is everywhere. If you join a team with existing nginx infra, knowing it is immediate leverage.

2) When you need predictable, explicit behavior at the edge

In nginx you can be extremely specific about:

what counts as retryable
how many tries
timeouts per phase (connect/send/read)
failure semantics per upstream

Caddy has knobs too, but nginx's model maps closely to how a lot of production stacks think.

3) When the ecosystem around it matters

nginx has deep integration patterns with:

legacy deployments
enterprise tooling
common security hardening playbooks
common debugging muscle memory (every SRE has done nginx -T, nginx -t, reloads, etc.)

4) When performance tuning at massive scale is the job

At large companies, nobody is choosing nginx because "it's faster" per request in isolation. They're choosing because:

they know how to operate it safely
they know how it fails
it has predictable resource profiles and instrumentation patterns

The interesting part isn't that we got it working. It's that we can now explain worst-case latency: connect timeout + number of tries + fail_timeout window. That's the difference between 'it seems fine' and 'I can predict how it fails.'

Conclusion

For my $20 VPS and my hobby projects, Caddy is obviously the better tool. It's simpler, safer, and gives me active health checks and automatic TLS with almost no ceremony.

I rebuilt it in nginx anyway because nginx makes the hidden parts visible: TLS bootstrapping, reload semantics, passive vs active failure detection, and how retries interact with timeouts. Those are the concepts that scale, and that's the whole point of this series.

In the next post, we'll actually go in the opposite direction -- we'll use a managed service to do all of this for us. See you there!

DEV Community