DEV Community

Cover image for vLLM + LiteLLM Production Deployment: Building a Self-Hosted OpenAI-Compatible Gateway
Meyr
Meyr

Posted on

vLLM + LiteLLM Production Deployment: Building a Self-Hosted OpenAI-Compatible Gateway

How to put LiteLLM in front of vLLM as a single, stable, OpenAI-compatible endpoint — and the four failure modes that will eat your afternoon if nobody warns you.

The problem

You stood up vLLM, it serves an OpenAI-compatible API on :8000, and your first app talks to it directly. Then a second app shows up. Then a third. Now every app hard-codes a backend hostname, a port, a model ID, and (if you did the responsible thing) an API key. Swap the model, move the box, or add a second GPU node, and you're editing config in five places and restarting things you forgot existed.

Hitting vLLM directly gives you no per-app keys, no rate limits, no fallback if a backend is down, no usage or cost tracking, and no model aliasing — so the day you rename served-model-name from qwen3.6-35b to gpt-oss-120b, every downstream client 404s at once. (Ask me how I know.)

A gateway fixes the topology: apps point at one stable endpoint with one key shape, and the gateway routes, authenticates, and tracks. LiteLLM is the pragmatic pick because it speaks OpenAI in and OpenAI/Anthropic out, so vLLM sits behind it unchanged.

The architecture

The whole idea fits in one diagram:

Apps never learn that llm-backend-01 exists. They call gateway.lan:4000/v1/... with a model alias (say, coder), and LiteLLM maps that alias to a real backend + the real upstream model name + the upstream key. Add a node, rename a model, rotate a backend key — you change the gateway's config, and not one client config moves. That decoupling is the entire value, and it's free. Everything below makes it real.

The minimal working build

This gets you a working gateway in front of one or two vLLM backends. You need: a host that can route to your vLLM box(es) on :8000, Docker + Compose, and vLLM already serving (with --served-model-name and --api-key set — note the exact served name; it matters a lot in Trap 2).

1. config.yaml

This is the whole brain of the gateway. model_name is the alias your apps call. The string after openai/ in litellm_params.model is what LiteLLM sends upstream and must equal vLLM's --served-model-name. The openai/ prefix just tells LiteLLM "this backend speaks OpenAI-compatible," which vLLM does.

# config.yaml
model_list:
  # --- backend 1: a general/reasoning model, aliased to "default" ---
  - model_name: default                       # what apps call
    litellm_params:
      model: openai/gpt-oss-120b              # MUST match --served-model-name on the backend
      api_base: http://llm-backend-01:8000/v1 # routable address of the vLLM host (NOT localhost — see Trap 1)
      api_key: os.environ/BACKEND_01_KEY      # vLLM's --api-key for this box

  # --- backend 2: a coding model, aliased to "coder" ---
  - model_name: coder
    litellm_params:
      model: openai/qwen3-coder-30b
      api_base: http://llm-backend-02:8000/v1
      api_key: os.environ/BACKEND_02_KEY

litellm_settings:
  drop_params: true          # silently drop params a given backend doesn't support
  num_retries: 2             # cheap resilience; full fallback chains are a bigger topic

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY   # the key YOUR apps present to the gateway
Enter fullscreen mode Exit fullscreen mode

2. docker-compose.yml

# docker-compose.yml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-stable
    container_name: litellm
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./config.yaml:/app/config.yaml:ro
    environment:
      LITELLM_MASTER_KEY: ${LITELLM_MASTER_KEY}
      BACKEND_01_KEY: ${BACKEND_01_KEY}
      BACKEND_02_KEY: ${BACKEND_02_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    # If vLLM runs on the SAME host as this container, see Trap 1 before you
    # touch api_base — localhost will not mean what you think it means.
    # extra_hosts:
    #   - "host.docker.internal:host-gateway"
Enter fullscreen mode Exit fullscreen mode

3. .env

# .env  (never commit this)
LITELLM_MASTER_KEY=sk-gateway-CHANGE-ME
BACKEND_01_KEY=sk-backend01-CHANGE-ME       # == vLLM --api-key on llm-backend-01
BACKEND_02_KEY=sk-backend02-CHANGE-ME       # == vLLM --api-key on llm-backend-02
Enter fullscreen mode Exit fullscreen mode

4. Bring it up and prove a request routes through

docker compose up -d
docker compose logs -f litellm    # watch for the router loading both deployments

# (a) ask the GATEWAY what models it exposes — you should see your aliases, not the upstream names
curl -s http://localhost:4000/v1/models \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" | jq '.data[].id'
# -> "default"
# -> "coder"

# (b) prove a request routes alias -> backend -> back
curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"coder","messages":[{"role":"user","content":"reply with one word: hi"}]}' \
  | jq -r '.choices[0].message.content'
# -> hi
Enter fullscreen mode Exit fullscreen mode

If (a) lists your aliases and (b) returns a word, you have a working gateway. Every app now points at http://gateway.lan:4000/v1, presents LITELLM_MASTER_KEY, and asks for default or coder. The backends, their real model names, and their real keys are now an implementation detail you can change without telling anyone.

The traps

This is the part the quickstarts skip. All four cost me real time.

Trap 1 — api_base: http://localhost:8000 from inside the container

Symptom: LiteLLM logs Connection refused / Cannot connect to host localhost:8000, but curl http://localhost:8000/v1/models from the host shell works fine. Maddening, because "it's literally right there."

Cause: localhost inside the LiteLLM container is the container, not your host or your vLLM box. The proxy is looking for vLLM inside its own namespace and finding nothing.

Fix: give api_base an address that resolves from inside the container:

  • vLLM on a different host → use that host's name/IP: http://llm-backend-01:8000/v1 (make sure the container can resolve it — DNS or an extra_hosts entry).
  • vLLM on the same host as the container → http://host.docker.internal:8000/v1 with extra_hosts: ["host.docker.internal:host-gateway"], or hit the Docker bridge gateway directly, e.g. http://172.17.0.1:8000/v1.

I hit the identical class of bug wiring a sidecar container to another service: the working URL was the bridge/container address (172.17.0.x), never localhost. When a container can't reach a service that's plainly running, suspect the network namespace first.

Trap 2 — served-model-name ≠ the upstream model in litellm_params → silent 404

Symptom: the app works perfectly against vLLM directly, but through the gateway you get 404 model not found (or an empty/400 that gives nothing away). It feels intermittent because some of your apps were already calling the right string.

Cause: there are two names in play and people conflate them. model_name is the alias apps call. The part after openai/ in litellm_params.model is sent verbatim as the model field to vLLM, and vLLM only answers to its exact --served-model-name. Rename the model on the backend and forget to update the gateway, and every routed call 404s while a direct call (with the new name) succeeds.

Fix: pin the chain explicitly and verify both ends.

# what does the backend actually serve itself as?
curl -s http://llm-backend-01:8000/v1/models \
  -H "Authorization: Bearer $BACKEND_01_KEY" | jq -r '.data[].id'
# -> gpt-oss-120b      <-- this string must appear after openai/ in config.yaml
Enter fullscreen mode Exit fullscreen mode

--served-model-name gpt-oss-120b on vLLM, model: openai/gpt-oss-120b in config, model_name: default (or whatever your apps prefer) as the alias. This is also why the gateway is worth it — when I swapped a backend model, the served name changed and would have broken every client at once; behind the gateway it was a one-line config.yaml edit.

Trap 3 — you fix the key/route, and LiteLLM keeps serving the broken one

Symptom: you correct an api_key or api_base, restart vLLM, re-test… still 401 (or still routing to the old backend). The change visibly "didn't take."

Cause: LiteLLM caches its resolved deployments in the router. This bites hardest in DB-backed mode (store_model_in_db: true, the common production setup) — editing a model in the DB or UI does not hot-reload the live router. The classic version: you add an --api-key to a backend that used to be keyless, update the gateway, and the gateway keeps serving the cached keyless deployment, which now returns 401.

Fix: force the router to reload.

docker restart litellm
Enter fullscreen mode Exit fullscreen mode

This got me twice — once after rotating a backend key, once after a model swap. Bake it into the runbook: any change to a model's key, base URL, or params is followed by a docker restart litellm, or you're debugging a ghost.

Trap 4 — "healthy" ≠ "ready": the backend is still loading while the gateway is green

Symptom: systemctl is-active vllm says active, the gateway's /health is green, but requests routed through LiteLLM 500 or connection-reset for the first 30–90 seconds after a backend restart — or forever, on a bad boot.

Cause: a vLLM systemd unit reports active the instant the process starts, long before the model is servable. The backend isn't ready until it has loaded tens of GB of weights and, on first boot, JIT-compiled its sampler kernel. On Blackwell + FlashInfer that JIT step needs ninja on the unit's PATH and a writable HOME for its cache — get either wrong and the port may open but never serve, while systemd still cheerfully says active until it crash-loops.

Fix: never gate readiness on systemctl or process liveness. Poll the real endpoint until it answers:

# readiness gate: succeeds only when the model is actually loaded and serving
until curl -fsS http://llm-backend-01:8000/v1/models \
        -H "Authorization: Bearer $BACKEND_01_KEY" >/dev/null 2>&1; do
  echo "backend warming up..."; sleep 3
done
echo "backend ready"
Enter fullscreen mode Exit fullscreen mode

On the vLLM unit, make sure PATH includes the venv's bin (so ninja resolves for the kernel JIT) and HOME is set (so the compiled-kernel cache warms once and survives restarts instead of recompiling every boot). Watch journalctl -u vllm -f for Application startup completethat's your real "up," not is-active.

Honorable mention (a general LiteLLM trap, not from my own scars): LiteLLM's token-counting/cost path can try to fetch tiktoken encoding files at runtime, which fails on an air-gapped or offline host. If you run isolated, pre-seed the encodings and set the offline/cache env vars before you go dark.

What this deliberately doesn't cover

The build above is the honest minimum: one stable endpoint, model aliasing, upstream keys, two backends, and the four traps that make it actually stay up. It is not the production system. Everything that turns "it works on my LAN" into "it survives a team, a budget, and a 2am page" is out of scope here on purpose:

  • Per-team / per-app virtual keys, budgets, and spend caps
  • Rate limiting (RPM/TPM) per key and per model
  • Fallback + retry chains and automatic failover across backends
  • Load balancing across multiple vLLM replicas of the same model
  • Observability — Prometheus + Grafana metrics, Langfuse request tracing, real cost/usage accounting
  • Auth in front of the gateway — SSO/proxy, TLS termination, network hardening
  • DB-backed config (postgres), the admin API/UI workflow, and the cache-reload gotchas that come with it
  • RAG and tool/web-search wiring through the gateway
  • The full 2 am failure catalogue — KV-cache OOM, crash-loops, driver/kernel drift after patching, silent context truncation, and how to tell which one you're looking at

I'm assembling all of that — copy-ready configs, the routing/fallback/observability stack, and the complete failure catalog with symptom→fix for each — into a production playbook for running a private, self-hosted LLM gateway: the version that survives a team, a budget, and a 2 am page.

It's in progress now. If you want it when it lands, you can pre-order here https://meyr.gumroad.com/l/vllm-playbook and early buyers get it at a discount off the launch price. No spam, no drip sequence; you'll get one email when it's ready.

If you've hit a failure mode that isn't in the four above, drop it in the comments — I'm actively cataloguing them, and the gnarly ones will go in the playbook (credited if you want).

Top comments (0)