Rost

Posted on Mar 27 • Originally published at glukhov.org

Ollama behind a reverse proxy with Caddy or Nginx for HTTPS streaming

#selfhosting #llm #ollama #devops

Running Ollama behind a reverse proxy is the simplest way to get HTTPS, optional access control, and predictable streaming behaviour.

This post focuses on Caddy and Nginx ingress for the Ollama API, not on client code.

If you already have Python or Go clients talking to Ollama, this post is the missing piece: ingress and transport for the same API.

For how Ollama fits alongside vLLM, Docker Model Runner, LocalAI, and cloud hosting trade-offs, see
LLM Hosting in 2026: Local, Self-Hosted & Cloud Infrastructure Compared.

For request examples and client code, see
Ollama CLI Cheatsheet.

For UI and multi-user layers, see
Open WebUI overview, quickstart and alternatives.

For the bigger picture on self-hosting and data control, see
LLM self-hosting and AI sovereignty.

For a reproducible single-node Ollama service in Docker Compose (persistent volumes, OLLAMA_HOST, NVIDIA GPUs, upgrades), see
Ollama in Docker Compose with GPU and Persistent Model Storage.

Why you should proxy Ollama instead of exposing port 11434

Ollama is designed to run locally first. Out of the box it binds to localhost on port 11434, which is great for a developer workstation and a not-so-subtle hint that the raw port is not meant to be internet-facing.

I treat port 11434 as an internal, high-cost API. If it is reachable from the public internet, anyone who finds it can burn your CPU or GPU time, fill your disk by pulling models, or just keep connections open until something times out. A reverse proxy does not make Ollama safer by magic, but it gives you a place to put the controls that matter at the edge: TLS, authentication, timeouts, rate limits, and logs.

This matters because the local Ollama API does not ship with a built-in authentication layer. If you expose it, you typically add auth at the edge or keep it private and reachable only over a trusted network.

The second reason is UX. Ollama streams responses by default. If the proxy buffers or compresses in the wrong spot, streaming feels broken and UIs look like they are "thinking" with no output.

Minimal architecture and binding strategy

A clean minimum looks like this:

Client (curl, Python, Go, UI)
        |
        | HTTPS (optional Basic Auth or SSO)
        v
Reverse proxy (Caddy or Nginx)
        |
        | HTTP (private LAN, localhost, or Docker network)
        v
Ollama server (ollama serve on 127.0.0.1:11434)

Two practical rules keep this boring in the best way.

First, keep Ollama private and move exposure to the proxy. If Caddy or Nginx runs on the same host, proxy to 127.0.0.1:11434 and do not change Ollama's bind address. If the proxy runs elsewhere (separate host, separate VM, or a container network), bind Ollama to a private interface, not 0.0.0.0 on the public NIC, and lean on a firewall.

Second, decide early whether browsers will call Ollama directly. If a browser-based tool hits Ollama from a different origin, you may need to deal with CORS. If everything is served from one domain via the proxy (recommended for sanity), you can often avoid CORS entirely and keep Ollama strict.

Reverse proxy configs for streaming and WebSockets

Ollama's API is regular HTTP, and its streaming is newline-delimited JSON (NDJSON). That means you want a proxy that can do three things well:

Do not buffer streaming responses.
Do not kill long-running requests just because the model took a while to speak.
If a UI uses WebSockets (some do), forward Upgrade cleanly.

You can keep this simple. In many cases, "correct WebSockets handling" is just having a config that is Upgrade-safe even if the upstream does not use WebSockets today.

Caddy Caddyfile example

Caddy is the "less config, more defaults" option. If you put a public domain name in the site address, Caddy will typically obtain and renew certificates automatically.

Minimal reverse proxy, HTTPS, and streaming-friendly settings:

# ollama.example.com A/AAAA -> your proxy host
ollama.example.com {

    # Optional Basic Auth at the edge.
    # Generate a password hash with:
    #   caddy hash-password --algorithm bcrypt
    #
    # basic_auth {
    #   alice $2a$12$REDACTED...
    # }

    reverse_proxy 127.0.0.1:11434 {

        # Some setups prefer pinning the upstream Host header.
        # Ollama's own docs show this pattern for Nginx.
        header_up Host localhost:11434

        # For streaming or chat-like workloads, prefer low latency.
        # NDJSON streaming usually flushes immediately anyway, but this makes it explicit.
        flush_interval -1

        transport http {
            # Avoid upstream gzip negotiation if it interferes with streaming.
            compression off

            # Give Ollama time to load a model and produce the first chunk.
            response_header_timeout 10m
            dial_timeout 10s
        }
    }
}

If you already have an SSO gateway (oauth2-proxy, Authelia, authentik outpost, etc), Caddy has an opinionated forward auth directive. The pattern is "auth first, then proxy":

ollama.example.com {
    forward_auth 127.0.0.1:4180 {
        uri /oauth2/auth
        # Copy the identity headers your gateway returns, if you need them.
        copy_headers X-Auth-Request-User X-Auth-Request-Email Authorization
    }

    reverse_proxy 127.0.0.1:11434
}

Nginx server block example

Nginx gives you a bit more rope. The upside is that the knobs are explicit, and it has built-in primitives for rate limiting and connection limiting. The footgun is buffering: Nginx buffers proxied responses by default, which is the opposite of what you want for NDJSON streaming.

This example includes:

HTTP to HTTPS redirect
TLS certificate paths (Certbot style)
WebSocket-safe Upgrade forwarding
Streaming-friendly proxy_buffering off
Longer timeouts than the default 60s

# /etc/nginx/conf.d/ollama.conf

# WebSocket-safe Connection header handling
map $http_upgrade $connection_upgrade {
    default upgrade;
    ""      close;
}

# Optional request rate limiting (IP-based)
# limit_req_zone $binary_remote_addr zone=ollama_rate:10m rate=10r/s;

server {
    listen 80;
    server_name ollama.example.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl http2;
    server_name ollama.example.com;

    ssl_certificate     /etc/letsencrypt/live/ollama.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ollama.example.com/privkey.pem;

    # Optional Basic Auth at the edge.
    # auth_basic "Ollama";
    # auth_basic_user_file /etc/nginx/.htpasswd;

    location / {
        # Optional rate limit
        # limit_req zone=ollama_rate burst=20 nodelay;

        proxy_pass http://127.0.0.1:11434;

        # Match Ollama docs pattern when proxying to localhost.
        proxy_set_header Host localhost:11434;

        # WebSocket Upgrade handling (harmless if unused).
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;

        # Critical for NDJSON streaming.
        proxy_buffering off;

        # Prevent 60s idle timeouts while waiting for tokens.
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
    }
}

If you want an SSO-style gate in Nginx, the equivalent pattern is auth_request. Nginx sends a subrequest to your auth service, and only proxies to Ollama when auth returns 2xx.

TLS automation and renewal gotchas

For TLS, the operational split is simple.

With Caddy, TLS is usually "part of the reverse proxy". Automatic HTTPS is one of its flagship features, so certificate issuance and renewal are coupled to keeping Caddy running, having working DNS, and exposing ports 80 and 443.

With Nginx, TLS is usually "a separate ACME client plus Nginx". The common failure mode is not crypto, it is plumbing:

Port 80 not reachable for HTTP-01 challenges.
Certificates stored in a container but not persisted.
Rate limits when doing repeated fresh installs or test deploys.

A subtle point that matters for long-lived services is that certificate lifetimes are short by design. Treat renewals as a background automation requirement, not an annual calendar event.

Authentication, abuse control, and verification

This is the part that makes an internet-facing LLM endpoint feel professional.

Authentication options, from blunt to elegant

Basic Auth at the proxy is blunt, but surprisingly effective for a private endpoint. It is also easy to apply to both HTTP requests and WebSocket upgrades.

If you want browser-friendly login flows, forward auth and auth_request are the common pattern. Your proxy stays stateless, and an auth gateway owns sessions and MFA. The trade-off is more moving parts.

If you are already running Open WebUI, you can also rely on its app-level authentication and keep Ollama itself private. The proxy then protects Open WebUI, not Ollama directly.

If you do not need public access at all, a network-only approach can be cleaner. For example, Tailscale Serve can expose a local service inside your tailnet without opening inbound ports on your router.

Abuse basics for an expensive API

Ollama is a powerful local API, and its surface goes beyond generation. It has endpoints for chat, embeddings, listing models, and version checks. Treat the whole API as sensitive.

Official API reference (endpoints and streaming): https://docs.ollama.com/api

At the proxy layer, there are three low-effort controls that reduce day-one pain:

Rate limiting per IP on generation endpoints.
Connection limits to stop a small number of clients holding everything open.
Conservative timeouts that match your model and hardware reality, not generic web defaults.

At the Ollama layer, it can also reject overload with 503 and has server-side knobs for queueing. Proxy rate limiting keeps you from getting there as often.

Verification checklist

Use the same checks you would use for any streaming API.

Basic connectivity and TLS
- curl -sS https://ollama.example.com/api/version
- curl -sS https://ollama.example.com/api/tags | head
Streaming works end to end (no buffering)
- curl -N https://ollama.example.com/api/generate -H "Content-Type: application/json" -d '{"model":"mistral","prompt":"Write 10 words only.","stream":true}'

If you are behind Basic Auth:

curl -N -u alice:REDACTED https://ollama.example.com/api/generate -H "Content-Type: application/json" -d '{"model":"mistral","prompt":"Write 10 words only.","stream":true}'

Browser UI sanity
- Load your chat UI and trigger a response.
- If the UI uses WebSockets, confirm you do not see 400 or 426 errors and the connection stays open during generation.

If the curl output only appears at the end, it is almost always buffering at the proxy. Re-check proxy_buffering off in Nginx, and consider forcing low-latency flushing in Caddy for the Ollama site block.