<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hanlin Xiang</title>
    <description>The latest articles on DEV Community by Hanlin Xiang (@ai-gateway-veteran).</description>
    <link>https://dev.to/ai-gateway-veteran</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3988994%2F78c4bdc4-22db-4b3f-83ca-994d4b4d8f20.png</url>
      <title>DEV Community: Hanlin Xiang</title>
      <link>https://dev.to/ai-gateway-veteran</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ai-gateway-veteran"/>
    <language>en</language>
    <item>
      <title>The #3 Production Killer in Your LiteLLM Setup: Key Cache Invalidation (and How to Fix It)</title>
      <dc:creator>Hanlin Xiang</dc:creator>
      <pubDate>Fri, 19 Jun 2026 14:18:28 +0000</pubDate>
      <link>https://dev.to/ai-gateway-veteran/the-3-production-killer-in-your-litellm-setup-key-cache-invalidation-and-how-to-fix-it-5af5</link>
      <guid>https://dev.to/ai-gateway-veteran/the-3-production-killer-in-your-litellm-setup-key-cache-invalidation-and-how-to-fix-it-5af5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This is the pitfall that cost me 3 hours at 2 AM. If you're running LiteLLM Proxy in production, it will hit you too — usually at the worst possible time.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;p&gt;I run LiteLLM Proxy + New API in front of 18 provider channels. One night, I rotated an API key for a provider that had been flagged for unusual spending.&lt;/p&gt;

&lt;p&gt;Standard procedure:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate new key in provider dashboard&lt;/li&gt;
&lt;li&gt;Update &lt;code&gt;config.yaml&lt;/code&gt; with new key&lt;/li&gt;
&lt;li&gt;Run &lt;code&gt;litellm --config config.yaml --reload&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reload succeeded. No errors. The config showed the new key. I went to sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The next morning, the old key was still being used.&lt;/strong&gt; Every single request was still authenticating with the rotated-out key. The provider's dashboard showed traffic from both keys — the new one (from config validation) and the old one (from actual API calls).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Happens
&lt;/h2&gt;

&lt;p&gt;LiteLLM caches API keys in-memory for performance. When you &lt;code&gt;--reload&lt;/code&gt;, the &lt;strong&gt;config&lt;/strong&gt; is reloaded, but the &lt;strong&gt;key store&lt;/strong&gt; is not purged. The worker process holds the old keys in a dictionary that persists across config reloads.&lt;/p&gt;

&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;config.yaml&lt;/code&gt; shows the new key ✅&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;litellm --model_cost_map&lt;/code&gt; shows the new key ✅&lt;/li&gt;
&lt;li&gt;The actual HTTP requests use the old key ❌&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You won't notice until the old key expires or is revoked — at which point every request to that provider starts returning &lt;code&gt;401&lt;/code&gt;, and your fallback chain kicks in, routing traffic to your most expensive model.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Option 1: Purge the cache manually (no downtime)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:4000/cache/purge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$LITELLM_MASTER_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This clears the in-memory key cache. The next request will pull the key from the freshly reloaded config.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 2: Use Redis for shared key state (recommended for multi-worker)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Set &lt;code&gt;REDIS_HOST&lt;/code&gt; in your environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;REDIS_HOST=redis://redis:6379&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;REDIS_CONNECTION_POOL_SIZE=5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Redis, keys are stored externally. A config reload triggers a Redis key update, and all workers pick it up immediately. No stale keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 3: Restart the worker (downtime: 2-5 seconds)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker restart litellm-proxy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Brute force, but guaranteed to work. Use this if you're in a hurry and can afford a brief blip.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Detect It Before Users Do
&lt;/h2&gt;

&lt;p&gt;Add this to your monitoring — a simple script that checks whether the key in config matches the key actually being used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check which key is being used for a specific model&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:4000/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &lt;/span&gt;&lt;span class="nv"&gt;$LITELLM_API_KEY&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "openai/gpt-4o", "messages": [{"role": "user", "content": "test"}], "max_tokens": 1}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  | jq &lt;span class="s1"&gt;'.usage'&lt;/span&gt;

&lt;span class="c"&gt;# Compare with the key in config&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"api_key:"&lt;/span&gt; config.yaml | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the provider's response includes a &lt;code&gt;x-api-key-id&lt;/code&gt; header (OpenAI does), you can verify which key was used without guessing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;Key cache invalidation is &lt;strong&gt;Pitfall #3&lt;/strong&gt; in my production survival map. There are 4 more deployment pitfalls and 3 hidden cost traps that I documented after 6 months of running this stack:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;503 on every request after adding a provider&lt;/strong&gt; — model name mismatch&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Costs 3× higher than expected&lt;/strong&gt; — fallback chain hits expensive models by default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keys rotated but old ones still work&lt;/strong&gt; ← &lt;em&gt;this one&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming responses cut off mid-token&lt;/strong&gt; — Nginx/Cloudflare buffering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New API channels show "insufficient quota" with balance &amp;gt; 0&lt;/strong&gt; — weight = 0 by default&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these took me 1-2 hours to diagnose in production. The full one-page reference card with all 5 pitfalls, 3 cost traps, a failure decision tree, and a pre-launch security checklist is available here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;👉 &lt;a href="https://payhip.com/b/S96bB" rel="noopener noreferrer"&gt;AI API Gateway Pitfall Map — $9&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's the page you print and pin next to your monitor — because when your gateway goes down at 2 AM, you won't be reading a 40-page guide.&lt;/p&gt;

</description>
      <category>api</category>
      <category>backend</category>
      <category>devops</category>
      <category>llm</category>
    </item>
    <item>
      <title>5 Pitfalls I Hit Running LiteLLM Proxy in Production (with a 1-page survival map)</title>
      <dc:creator>Hanlin Xiang</dc:creator>
      <pubDate>Wed, 17 Jun 2026 12:09:44 +0000</pubDate>
      <link>https://dev.to/ai-gateway-veteran/5-pitfalls-i-hit-running-litellm-proxy-in-production-with-a-1-page-survival-map-4k1h</link>
      <guid>https://dev.to/ai-gateway-veteran/5-pitfalls-i-hit-running-litellm-proxy-in-production-with-a-1-page-survival-map-4k1h</guid>
      <description>&lt;p&gt;I've spent the last 6 months running an 18-channel LLM gateway in production — LiteLLM Proxy backed by Redis and PostgreSQL, routing traffic across OpenAI, Anthropic, Google, DeepSeek, and several smaller providers. What started as a weekend project turned into a 24/7 operation serving multiple AI agents and internal tools.&lt;/p&gt;

&lt;p&gt;This post covers the 5 pitfalls that hit me hardest, with real error examples and the fixes that worked. If you're running LiteLLM Proxy (or considering it), these are the things I wish someone had told me before I went to production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall #1: Silent OOM (Memory Leak + No systemd MemoryMax)
&lt;/h2&gt;

&lt;p&gt;LiteLLM has a known memory leak under high concurrency. Without a hard memory limit, the process will eat all available RAM until the kernel's OOM-killer takes it down — usually at 3 AM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The symptom: requests start timing out intermittently
# Check dmesg for the kill signal
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dmesg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-T&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;capture_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;oom&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;litellm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FOUND: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;line&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The fix: systemd unit with MemoryMax
# /etc/systemd/system/litellm.service
# [Service]
# ExecStart=/usr/bin/litellm --config /opt/litellm/config.yaml --port 4000
# MemoryMax=4G
# Restart=always
# RestartSec=10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fix is simple but easy to miss: set &lt;code&gt;MemoryMax=4G&lt;/code&gt; (or whatever your server can spare) in your systemd unit. The proxy will restart cleanly instead of being force-killed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall #2: Key Cache Miss (OpenAI 8min Cache vs 24h Cache Key)
&lt;/h2&gt;

&lt;p&gt;This was the single most painful bug I encountered. I rotated a provider API key through the config and ran &lt;code&gt;litellm --reload&lt;/code&gt;. The config file updated, but LiteLLM caches keys in-memory. The old key kept getting used for hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What happened: silent 401s that looked like "provider outage"
# The config was correct, but the in-memory cache wasn't purged
&lt;/span&gt;
&lt;span class="c1"&gt;# WRONG: this only reloads config, not the key store
# litellm --reload
&lt;/span&gt;
&lt;span class="c1"&gt;# RIGHT: purge the cache explicitly
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:4000/cache/purge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer YOUR_MASTER_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Or just restart the worker entirely
# If using REDIS_HOST for shared state, flush that too:
# redis-cli -h $REDIS_HOST FLUSHDB
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key insight: &lt;code&gt;--reload&lt;/code&gt; refreshes the config file but does NOT purge the in-memory key cache. You need to either hit &lt;code&gt;/cache/purge&lt;/code&gt; or restart the worker. If you're using Redis for shared key state, flush that too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall #3: Retry Storm (4xx Retries Cause Rate-Limit Avalanche)
&lt;/h2&gt;

&lt;p&gt;LiteLLM retries &lt;code&gt;num_retries=3&lt;/code&gt; by default. A single failed call becomes 3x the token spend. Worse: on 4xx errors (which should NOT be retried), the retry logic can trigger rate-limit cascades.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The problem: default config retries everything
# config.yaml (BAD)
# litellm_settings:
#   num_retries: 3  # This retries even 4xx errors!
&lt;/span&gt;
&lt;span class="c1"&gt;# The fix: retry only 5xx, use fallbacks for 4xx
# config.yaml (GOOD)
&lt;/span&gt;&lt;span class="n"&gt;litellm_settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;num_retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
  &lt;span class="n"&gt;retry_policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;InternalServerError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;RateLimitError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;  &lt;span class="c1"&gt;# Don't retry rate limits — use fallbacks instead
&lt;/span&gt;
&lt;span class="n"&gt;model_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gpt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="n"&gt;litellm_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;gpt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;
      &lt;span class="n"&gt;fallbacks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3.5-sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set &lt;code&gt;num_retries: 1&lt;/code&gt; for non-critical paths. Use fallbacks (cheapest-first) instead of retries for cost control. A retry storm on a rate-limited provider can 3x your spend in 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall #4: Cost Unobserved (Multi-Provider Routing Weights)
&lt;/h2&gt;

&lt;p&gt;When you route across multiple providers, LiteLLM's default &lt;code&gt;fallbacks&lt;/code&gt; are sequential — not cost-sorted. One upstream failure can route all traffic to your most expensive model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The problem: fallbacks hit the most expensive model first
# config.yaml (BAD)
&lt;/span&gt;&lt;span class="n"&gt;model_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gpt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="n"&gt;litellm_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;gpt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;
      &lt;span class="n"&gt;fallbacks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3.5-sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="c1"&gt;# If gpt-4o fails, it tries mini (cheap) then sonnet (expensive)
&lt;/span&gt;      &lt;span class="c1"&gt;# But if mini also fails, ALL traffic goes to sonnet
&lt;/span&gt;
&lt;span class="c1"&gt;# The fix: sort fallbacks cheapest-first + set max_budget per team
# config.yaml (GOOD)
&lt;/span&gt;&lt;span class="n"&gt;litellm_settings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="n"&gt;max_budget&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt;  &lt;span class="c1"&gt;# Daily budget cap in USD
&lt;/span&gt;  &lt;span class="n"&gt;budget_duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1d&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;model_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;gpt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;
    &lt;span class="n"&gt;litellm_params&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
      &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;gpt&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;
      &lt;span class="n"&gt;allowed_fails&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
      &lt;span class="n"&gt;fallbacks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Only fall back to cheaper models
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor per-provider spend daily. The LiteLLM UI shows cost breakdowns, but only if you've configured &lt;code&gt;master_key&lt;/code&gt; and &lt;code&gt;database_url&lt;/code&gt; properly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pitfall #5: Metric Blindness (Incomplete Prometheus Metrics)
&lt;/h2&gt;

&lt;p&gt;LiteLLM's built-in Prometheus metrics don't cover per-provider latency percentiles or cost attribution. You're flying blind on the most important signals for production operations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What LiteLLM exposes by default:
# - litellm_requests_total
# - litellm_request_duration_seconds (aggregate, not per-provider)
# - litellm_spend_total (only if database is configured)
&lt;/span&gt;
&lt;span class="c1"&gt;# What's MISSING:
# - Per-provider P95/P99 latency
# - Per-provider error rate
# - Per-team cost breakdown
# - Cache hit/miss ratio
&lt;/span&gt;
&lt;span class="c1"&gt;# The fix: add a custom middleware to emit per-provider metrics
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;litellm.integrations.custom_logger&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CustomLogger&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;prometheus_client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;prom&lt;/span&gt;

&lt;span class="n"&gt;provider_latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;prom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;litellm_provider_latency_seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Latency by provider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;PerProviderMetrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CustomLogger&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;log_success_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_obj&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;end_time&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;litellm_params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;custom_llm_provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;litellm_params&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;latency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;total_seconds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;provider_latency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;observe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without per-provider metrics, you can't tell if DeepSeek is slow today or if OpenAI is throttling you. Add a custom logger to fill the gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: A 1-Page Survival Map
&lt;/h2&gt;

&lt;p&gt;After hitting all 5 of these pitfalls in production (and losing too many weekends to debugging), I compiled everything into a single-page survival map. It covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All 5 deployment pitfalls with symptoms, root causes, and fixes&lt;/li&gt;
&lt;li&gt;3 hidden cost traps (retry amplification, embedding tax, idle-connection keep-alive)&lt;/li&gt;
&lt;li&gt;A failure decision tree for any error code you'll see&lt;/li&gt;
&lt;li&gt;A pre-launch security checklist&lt;/li&gt;
&lt;li&gt;Copy-paste diagnostic commands&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full map is available here: &lt;a href="https://payhip.com/b/S96bB" rel="noopener noreferrer"&gt;https://payhip.com/b/S96bB&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's $9, no email signup, no affiliate. Just the thing I wish I had when I started.&lt;/p&gt;




&lt;p&gt;Have you hit any of these pitfalls? Or did I miss something that's bitten you? Drop a comment — I'll be responding to everyone. You can find me at &lt;a class="mentioned-user" href="https://dev.to/ai-gateway-veteran"&gt;@ai-gateway-veteran&lt;/a&gt; on Reddit and X.&lt;/p&gt;

</description>
      <category>python</category>
      <category>litellm</category>
      <category>devops</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
