<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Meyr</title>
    <description>The latest articles on DEV Community by Meyr (@meyr_cruywagen).</description>
    <link>https://dev.to/meyr_cruywagen</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4000420%2F61ddb8ee-7e52-4a79-9d98-c7711766dd9f.png</url>
      <title>DEV Community: Meyr</title>
      <link>https://dev.to/meyr_cruywagen</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/meyr_cruywagen"/>
    <language>en</language>
    <item>
      <title>vLLM + LiteLLM Production Deployment: Building a Self-Hosted OpenAI-Compatible Gateway</title>
      <dc:creator>Meyr</dc:creator>
      <pubDate>Thu, 25 Jun 2026 06:37:26 +0000</pubDate>
      <link>https://dev.to/meyr_cruywagen/vllm-litellm-production-deployment-building-a-self-hosted-openai-compatible-gateway-50e1</link>
      <guid>https://dev.to/meyr_cruywagen/vllm-litellm-production-deployment-building-a-self-hosted-openai-compatible-gateway-50e1</guid>
      <description>&lt;p&gt;&lt;em&gt;How to put LiteLLM in front of vLLM as a single, stable, OpenAI-compatible endpoint — and the four failure modes that will eat your afternoon if nobody warns you.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;You stood up vLLM, it serves an OpenAI-compatible API on &lt;code&gt;:8000&lt;/code&gt;, and your first app talks to it directly. Then a second app shows up. Then a third. Now every app hard-codes a backend hostname, a port, a model ID, and (if you did the responsible thing) an API key. Swap the model, move the box, or add a second GPU node, and you're editing config in five places and restarting things you forgot existed.&lt;/p&gt;

&lt;p&gt;Hitting vLLM directly gives you no per-app keys, no rate limits, no fallback if a backend is down, no usage or cost tracking, and no model aliasing — so the day you rename &lt;code&gt;served-model-name&lt;/code&gt; from &lt;code&gt;qwen3.6-35b&lt;/code&gt; to &lt;code&gt;gpt-oss-120b&lt;/code&gt;, every downstream client 404s at once. (Ask me how I know.)&lt;/p&gt;

&lt;p&gt;A gateway fixes the topology: apps point at &lt;strong&gt;one&lt;/strong&gt; stable endpoint with &lt;strong&gt;one&lt;/strong&gt; key shape, and the gateway routes, authenticates, and tracks. LiteLLM is the pragmatic pick because it speaks OpenAI in &lt;em&gt;and&lt;/em&gt; OpenAI/Anthropic out, so vLLM sits behind it unchanged.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;The whole idea fits in one diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3dlpqblhsvzpzt215ikl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3dlpqblhsvzpzt215ikl.png" alt=" " width="799" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Apps never learn that &lt;code&gt;llm-backend-01&lt;/code&gt; exists. They call &lt;code&gt;gateway.lan:4000/v1/...&lt;/code&gt; with a model &lt;strong&gt;alias&lt;/strong&gt; (say, &lt;code&gt;coder&lt;/code&gt;), and LiteLLM maps that alias to a real backend + the real upstream model name + the upstream key. Add a node, rename a model, rotate a backend key — you change the gateway's config, and not one client config moves. That decoupling is the entire value, and it's free. Everything below makes it real.&lt;/p&gt;

&lt;h2&gt;
  
  
  The minimal working build
&lt;/h2&gt;

&lt;p&gt;This gets you a working gateway in front of one or two vLLM backends. You need: a host that can route to your vLLM box(es) on &lt;code&gt;:8000&lt;/code&gt;, Docker + Compose, and vLLM already serving (with &lt;code&gt;--served-model-name&lt;/code&gt; and &lt;code&gt;--api-key&lt;/code&gt; set — note the exact served name; it matters a lot in Trap 2).&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;code&gt;config.yaml&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the whole brain of the gateway. &lt;code&gt;model_name&lt;/code&gt; is the &lt;strong&gt;alias&lt;/strong&gt; your apps call. The string after &lt;code&gt;openai/&lt;/code&gt; in &lt;code&gt;litellm_params.model&lt;/code&gt; is what LiteLLM sends upstream and &lt;strong&gt;must equal vLLM's &lt;code&gt;--served-model-name&lt;/code&gt;&lt;/strong&gt;. The &lt;code&gt;openai/&lt;/code&gt; prefix just tells LiteLLM "this backend speaks OpenAI-compatible," which vLLM does.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# config.yaml&lt;/span&gt;
&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# --- backend 1: a general/reasoning model, aliased to "default" ---&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;                       &lt;span class="c1"&gt;# what apps call&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-oss-120b&lt;/span&gt;              &lt;span class="c1"&gt;# MUST match --served-model-name on the backend&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://llm-backend-01:8000/v1&lt;/span&gt; &lt;span class="c1"&gt;# routable address of the vLLM host (NOT localhost — see Trap 1)&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/BACKEND_01_KEY&lt;/span&gt;      &lt;span class="c1"&gt;# vLLM's --api-key for this box&lt;/span&gt;

  &lt;span class="c1"&gt;# --- backend 2: a coding model, aliased to "coder" ---&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coder&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai/qwen3-coder-30b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://llm-backend-02:8000/v1&lt;/span&gt;
      &lt;span class="na"&gt;api_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/BACKEND_02_KEY&lt;/span&gt;

&lt;span class="na"&gt;litellm_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;drop_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;          &lt;span class="c1"&gt;# silently drop params a given backend doesn't support&lt;/span&gt;
  &lt;span class="na"&gt;num_retries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;             &lt;span class="c1"&gt;# cheap resilience; full fallback chains are a bigger topic&lt;/span&gt;

&lt;span class="na"&gt;general_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;master_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LITELLM_MASTER_KEY&lt;/span&gt;   &lt;span class="c1"&gt;# the key YOUR apps present to the gateway&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;code&gt;docker-compose.yml&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;litellm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io/berriai/litellm:main-stable&lt;/span&gt;
    &lt;span class="na"&gt;container_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;litellm&lt;/span&gt;
    &lt;span class="na"&gt;restart&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;unless-stopped&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4000:4000"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./config.yaml:/app/config.yaml:ro&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${LITELLM_MASTER_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;BACKEND_01_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${BACKEND_01_KEY}&lt;/span&gt;
      &lt;span class="na"&gt;BACKEND_02_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${BACKEND_02_KEY}&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--config"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/app/config.yaml"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4000"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="c1"&gt;# If vLLM runs on the SAME host as this container, see Trap 1 before you&lt;/span&gt;
    &lt;span class="c1"&gt;# touch api_base — localhost will not mean what you think it means.&lt;/span&gt;
    &lt;span class="c1"&gt;# extra_hosts:&lt;/span&gt;
    &lt;span class="c1"&gt;#   - "host.docker.internal:host-gateway"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;code&gt;.env&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# .env  (never commit this)&lt;/span&gt;
&lt;span class="nv"&gt;LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-gateway-CHANGE-ME
&lt;span class="nv"&gt;BACKEND_01_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-backend01-CHANGE-ME       &lt;span class="c"&gt;# == vLLM --api-key on llm-backend-01&lt;/span&gt;
&lt;span class="nv"&gt;BACKEND_02_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-backend02-CHANGE-ME       &lt;span class="c"&gt;# == vLLM --api-key on llm-backend-02&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Bring it up and prove a request routes through
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
docker compose logs &lt;span class="nt"&gt;-f&lt;/span&gt; litellm    &lt;span class="c"&gt;# watch for the router loading both deployments&lt;/span&gt;

&lt;span class="c"&gt;# (a) ask the GATEWAY what models it exposes — you should see your aliases, not the upstream names&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:4000/v1/models &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="s1"&gt;'.data[].id'&lt;/span&gt;
&lt;span class="c"&gt;# -&amp;gt; "default"&lt;/span&gt;
&lt;span class="c"&gt;# -&amp;gt; "coder"&lt;/span&gt;

&lt;span class="c"&gt;# (b) prove a request routes alias -&amp;gt; backend -&amp;gt; back&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:4000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model":"coder","messages":[{"role":"user","content":"reply with one word: hi"}]}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.choices[0].message.content'&lt;/span&gt;
&lt;span class="c"&gt;# -&amp;gt; hi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;(a)&lt;/code&gt; lists your aliases and &lt;code&gt;(b)&lt;/code&gt; returns a word, you have a working gateway. Every app now points at &lt;code&gt;http://gateway.lan:4000/v1&lt;/code&gt;, presents &lt;code&gt;LITELLM_MASTER_KEY&lt;/code&gt;, and asks for &lt;code&gt;default&lt;/code&gt; or &lt;code&gt;coder&lt;/code&gt;. The backends, their real model names, and their real keys are now an implementation detail you can change without telling anyone.&lt;/p&gt;

&lt;h2&gt;
  
  
  The traps
&lt;/h2&gt;

&lt;p&gt;This is the part the quickstarts skip. All four cost me real time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 1 — &lt;code&gt;api_base: http://localhost:8000&lt;/code&gt; from inside the container
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; LiteLLM logs &lt;code&gt;Connection refused&lt;/code&gt; / &lt;code&gt;Cannot connect to host localhost:8000&lt;/code&gt;, but &lt;code&gt;curl http://localhost:8000/v1/models&lt;/code&gt; from the host shell works fine. Maddening, because "it's literally right there."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; &lt;code&gt;localhost&lt;/code&gt; inside the LiteLLM &lt;strong&gt;container&lt;/strong&gt; is the container, not your host or your vLLM box. The proxy is looking for vLLM inside its own namespace and finding nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; give &lt;code&gt;api_base&lt;/code&gt; an address that resolves &lt;em&gt;from inside the container&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vLLM on a &lt;strong&gt;different host&lt;/strong&gt; → use that host's name/IP: &lt;code&gt;http://llm-backend-01:8000/v1&lt;/code&gt; (make sure the container can resolve it — DNS or an &lt;code&gt;extra_hosts&lt;/code&gt; entry).&lt;/li&gt;
&lt;li&gt;vLLM on the &lt;strong&gt;same host&lt;/strong&gt; as the container → &lt;code&gt;http://host.docker.internal:8000/v1&lt;/code&gt; with &lt;code&gt;extra_hosts: ["host.docker.internal:host-gateway"]&lt;/code&gt;, or hit the Docker bridge gateway directly, e.g. &lt;code&gt;http://172.17.0.1:8000/v1&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hit the identical class of bug wiring a sidecar container to another service: the working URL was the bridge/container address (&lt;code&gt;172.17.0.x&lt;/code&gt;), never &lt;code&gt;localhost&lt;/code&gt;. When a container can't reach a service that's plainly running, suspect the network namespace first.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 2 — &lt;code&gt;served-model-name&lt;/code&gt; ≠ the upstream model in &lt;code&gt;litellm_params&lt;/code&gt; → silent 404
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; the app works perfectly against vLLM directly, but through the gateway you get &lt;code&gt;404 model not found&lt;/code&gt; (or an empty/400 that gives nothing away). It feels intermittent because &lt;em&gt;some&lt;/em&gt; of your apps were already calling the right string.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; there are &lt;strong&gt;two&lt;/strong&gt; names in play and people conflate them. &lt;code&gt;model_name&lt;/code&gt; is the alias apps call. The part after &lt;code&gt;openai/&lt;/code&gt; in &lt;code&gt;litellm_params.model&lt;/code&gt; is sent &lt;strong&gt;verbatim&lt;/strong&gt; as the &lt;code&gt;model&lt;/code&gt; field to vLLM, and vLLM only answers to its exact &lt;code&gt;--served-model-name&lt;/code&gt;. Rename the model on the backend and forget to update the gateway, and every routed call 404s while a direct call (with the new name) succeeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; pin the chain explicitly and verify both ends.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# what does the backend actually serve itself as?&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://llm-backend-01:8000/v1/models &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$BACKEND_01_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.data[].id'&lt;/span&gt;
&lt;span class="c"&gt;# -&amp;gt; gpt-oss-120b      &amp;lt;-- this string must appear after openai/ in config.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--served-model-name gpt-oss-120b&lt;/code&gt; on vLLM, &lt;code&gt;model: openai/gpt-oss-120b&lt;/code&gt; in config, &lt;code&gt;model_name: default&lt;/code&gt; (or whatever your apps prefer) as the alias. This is also &lt;em&gt;why&lt;/em&gt; the gateway is worth it — when I swapped a backend model, the served name changed and would have broken every client at once; behind the gateway it was a one-line &lt;code&gt;config.yaml&lt;/code&gt; edit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 3 — you fix the key/route, and LiteLLM keeps serving the broken one
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; you correct an &lt;code&gt;api_key&lt;/code&gt; or &lt;code&gt;api_base&lt;/code&gt;, restart vLLM, re-test… still &lt;code&gt;401&lt;/code&gt; (or still routing to the old backend). The change visibly "didn't take."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; LiteLLM caches its resolved deployments in the router. This bites hardest in DB-backed mode (&lt;code&gt;store_model_in_db: true&lt;/code&gt;, the common production setup) — editing a model in the DB or UI does &lt;strong&gt;not&lt;/strong&gt; hot-reload the live router. The classic version: you add an &lt;code&gt;--api-key&lt;/code&gt; to a backend that used to be keyless, update the gateway, and the gateway &lt;em&gt;keeps serving the cached keyless deployment&lt;/em&gt;, which now returns &lt;code&gt;401&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; force the router to reload.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker restart litellm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This got me twice — once after rotating a backend key, once after a model swap. Bake it into the runbook: &lt;strong&gt;any&lt;/strong&gt; change to a model's key, base URL, or params is followed by a &lt;code&gt;docker restart litellm&lt;/code&gt;, or you're debugging a ghost.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trap 4 — "healthy" ≠ "ready": the backend is still loading while the gateway is green
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; &lt;code&gt;systemctl is-active vllm&lt;/code&gt; says &lt;code&gt;active&lt;/code&gt;, the gateway's &lt;code&gt;/health&lt;/code&gt; is green, but requests routed through LiteLLM &lt;code&gt;500&lt;/code&gt; or connection-reset for the first 30–90 seconds after a backend restart — or forever, on a bad boot.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cause:&lt;/strong&gt; a vLLM systemd unit reports &lt;code&gt;active&lt;/code&gt; the instant the process starts, long before the model is servable. The backend isn't ready until it has loaded tens of GB of weights &lt;strong&gt;and&lt;/strong&gt;, on first boot, JIT-compiled its sampler kernel. On Blackwell + FlashInfer that JIT step needs &lt;code&gt;ninja&lt;/code&gt; on the unit's &lt;code&gt;PATH&lt;/code&gt; and a writable &lt;code&gt;HOME&lt;/code&gt; for its cache — get either wrong and the port may open but never serve, while systemd still cheerfully says &lt;code&gt;active&lt;/code&gt; until it crash-loops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; never gate readiness on &lt;code&gt;systemctl&lt;/code&gt; or process liveness. Poll the &lt;strong&gt;real&lt;/strong&gt; endpoint until it answers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# readiness gate: succeeds only when the model is actually loaded and serving&lt;/span&gt;
&lt;span class="k"&gt;until &lt;/span&gt;curl &lt;span class="nt"&gt;-fsS&lt;/span&gt; http://llm-backend-01:8000/v1/models &lt;span class="se"&gt;\&lt;/span&gt;
        &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$BACKEND_01_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;/dev/null 2&amp;gt;&amp;amp;1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"backend warming up..."&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nb"&gt;sleep &lt;/span&gt;3
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"backend ready"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the vLLM unit, make sure &lt;code&gt;PATH&lt;/code&gt; includes the venv's &lt;code&gt;bin&lt;/code&gt; (so &lt;code&gt;ninja&lt;/code&gt; resolves for the kernel JIT) and &lt;code&gt;HOME&lt;/code&gt; is set (so the compiled-kernel cache warms once and survives restarts instead of recompiling every boot). Watch &lt;code&gt;journalctl -u vllm -f&lt;/code&gt; for &lt;code&gt;Application startup complete&lt;/code&gt; — &lt;em&gt;that's&lt;/em&gt; your real "up," not &lt;code&gt;is-active&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Honorable mention (a general LiteLLM trap, not from my own scars): LiteLLM's token-counting/cost path can try to fetch &lt;code&gt;tiktoken&lt;/code&gt; encoding files at runtime, which fails on an air-gapped or offline host. If you run isolated, pre-seed the encodings and set the offline/cache env vars before you go dark.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What this deliberately doesn't cover
&lt;/h2&gt;

&lt;p&gt;The build above is the honest minimum: one stable endpoint, model aliasing, upstream keys, two backends, and the four traps that make it actually stay up. It is &lt;strong&gt;not&lt;/strong&gt; the production system. Everything that turns "it works on my LAN" into "it survives a team, a budget, and a 2am page" is out of scope here on purpose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-team / per-app &lt;strong&gt;virtual keys&lt;/strong&gt;, budgets, and spend caps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate limiting&lt;/strong&gt; (RPM/TPM) per key and per model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fallback + retry chains&lt;/strong&gt; and automatic failover across backends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load balancing&lt;/strong&gt; across multiple vLLM replicas of the same model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt; — Prometheus + Grafana metrics, Langfuse request tracing, real cost/usage accounting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth in front of the gateway&lt;/strong&gt; — SSO/proxy, TLS termination, network hardening&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DB-backed config&lt;/strong&gt; (postgres), the admin API/UI workflow, and the cache-reload gotchas that come with it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG and tool/web-search wiring&lt;/strong&gt; through the gateway&lt;/li&gt;
&lt;li&gt;The full &lt;strong&gt;2 am failure catalogue&lt;/strong&gt; — KV-cache OOM, crash-loops, driver/kernel drift after patching, silent context truncation, and how to tell which one you're looking at&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm assembling all of that — copy-ready configs, the routing/fallback/observability stack, and the complete failure catalog with symptom→fix for each — into a &lt;strong&gt;production playbook&lt;/strong&gt; for running a private, self-hosted LLM gateway: the version that survives a team, a budget, and a 2 am page.&lt;/p&gt;

&lt;p&gt;It's in progress now. If you want it when it lands, you can &lt;strong&gt;pre-order here&lt;/strong&gt; &lt;a href="https://meyr.gumroad.com/l/vllm-playbook" rel="noopener noreferrer"&gt;https://meyr.gumroad.com/l/vllm-playbook&lt;/a&gt; and early buyers get it at a discount off the launch price. No spam, no drip sequence; you'll get one email when it's ready.&lt;/p&gt;

&lt;p&gt;If you've hit a failure mode that isn't in the four above, drop it in the comments — I'm actively cataloguing them, and the gnarly ones will go in the playbook (credited if you want).&lt;/p&gt;

</description>
      <category>vllm</category>
      <category>litellm</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
