DEV Community

Hanlin Xiang
Hanlin Xiang

Posted on

5 Pitfalls I Hit Running LiteLLM Proxy in Production (with a 1-page survival map)

I've spent the last 6 months running an 18-channel LLM gateway in production — LiteLLM Proxy backed by Redis and PostgreSQL, routing traffic across OpenAI, Anthropic, Google, DeepSeek, and several smaller providers. What started as a weekend project turned into a 24/7 operation serving multiple AI agents and internal tools.

This post covers the 5 pitfalls that hit me hardest, with real error examples and the fixes that worked. If you're running LiteLLM Proxy (or considering it), these are the things I wish someone had told me before I went to production.

Pitfall #1: Silent OOM (Memory Leak + No systemd MemoryMax)

LiteLLM has a known memory leak under high concurrency. Without a hard memory limit, the process will eat all available RAM until the kernel's OOM-killer takes it down — usually at 3 AM.

# The symptom: requests start timing out intermittently
# Check dmesg for the kill signal
import subprocess
result = subprocess.run(["dmesg", "-T"], capture_output=True, text=True)
for line in result.stdout.split("\n"):
    if "oom" in line.lower() and "litellm" in line.lower():
        print(f"FOUND: {line}")

# The fix: systemd unit with MemoryMax
# /etc/systemd/system/litellm.service
# [Service]
# ExecStart=/usr/bin/litellm --config /opt/litellm/config.yaml --port 4000
# MemoryMax=4G
# Restart=always
# RestartSec=10
Enter fullscreen mode Exit fullscreen mode

The fix is simple but easy to miss: set MemoryMax=4G (or whatever your server can spare) in your systemd unit. The proxy will restart cleanly instead of being force-killed.

Pitfall #2: Key Cache Miss (OpenAI 8min Cache vs 24h Cache Key)

This was the single most painful bug I encountered. I rotated a provider API key through the config and ran litellm --reload. The config file updated, but LiteLLM caches keys in-memory. The old key kept getting used for hours.

# What happened: silent 401s that looked like "provider outage"
# The config was correct, but the in-memory cache wasn't purged

# WRONG: this only reloads config, not the key store
# litellm --reload

# RIGHT: purge the cache explicitly
import requests
requests.post(
    "http://localhost:4000/cache/purge",
    headers={"Authorization": "Bearer YOUR_MASTER_KEY"}
)

# Or just restart the worker entirely
# If using REDIS_HOST for shared state, flush that too:
# redis-cli -h $REDIS_HOST FLUSHDB
Enter fullscreen mode Exit fullscreen mode

The key insight: --reload refreshes the config file but does NOT purge the in-memory key cache. You need to either hit /cache/purge or restart the worker. If you're using Redis for shared key state, flush that too.

Pitfall #3: Retry Storm (4xx Retries Cause Rate-Limit Avalanche)

LiteLLM retries num_retries=3 by default. A single failed call becomes 3x the token spend. Worse: on 4xx errors (which should NOT be retried), the retry logic can trigger rate-limit cascades.

# The problem: default config retries everything
# config.yaml (BAD)
# litellm_settings:
#   num_retries: 3  # This retries even 4xx errors!

# The fix: retry only 5xx, use fallbacks for 4xx
# config.yaml (GOOD)
litellm_settings:
  num_retries: 1
  retry_policy:
    InternalServerError:
      retries: 1
    RateLimitError:
      retries: 0  # Don't retry rate limits — use fallbacks instead

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      fallbacks: ["anthropic/claude-3.5-sonnet"]
Enter fullscreen mode Exit fullscreen mode

Set num_retries: 1 for non-critical paths. Use fallbacks (cheapest-first) instead of retries for cost control. A retry storm on a rate-limited provider can 3x your spend in 5 minutes.

Pitfall #4: Cost Unobserved (Multi-Provider Routing Weights)

When you route across multiple providers, LiteLLM's default fallbacks are sequential — not cost-sorted. One upstream failure can route all traffic to your most expensive model.

# The problem: fallbacks hit the most expensive model first
# config.yaml (BAD)
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      fallbacks: ["openai/gpt-4o-mini", "anthropic/claude-3.5-sonnet"]
      # If gpt-4o fails, it tries mini (cheap) then sonnet (expensive)
      # But if mini also fails, ALL traffic goes to sonnet

# The fix: sort fallbacks cheapest-first + set max_budget per team
# config.yaml (GOOD)
litellm_settings:
  max_budget: 100.0  # Daily budget cap in USD
  budget_duration: "1d"

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      allowed_fails: 3
      fallbacks: ["openai/gpt-4o-mini"]  # Only fall back to cheaper models
Enter fullscreen mode Exit fullscreen mode

Monitor per-provider spend daily. The LiteLLM UI shows cost breakdowns, but only if you've configured master_key and database_url properly.

Pitfall #5: Metric Blindness (Incomplete Prometheus Metrics)

LiteLLM's built-in Prometheus metrics don't cover per-provider latency percentiles or cost attribution. You're flying blind on the most important signals for production operations.

# What LiteLLM exposes by default:
# - litellm_requests_total
# - litellm_request_duration_seconds (aggregate, not per-provider)
# - litellm_spend_total (only if database is configured)

# What's MISSING:
# - Per-provider P95/P99 latency
# - Per-provider error rate
# - Per-team cost breakdown
# - Cache hit/miss ratio

# The fix: add a custom middleware to emit per-provider metrics
from litellm.integrations.custom_logger import CustomLogger
import prometheus_client as prom

provider_latency = prom.Histogram(
    'litellm_provider_latency_seconds',
    'Latency by provider',
    ['provider', 'model']
)

class PerProviderMetrics(CustomLogger):
    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        provider = kwargs.get("litellm_params", {}).get("custom_llm_provider", "unknown")
        model = kwargs.get("litellm_params", {}).get("model", "unknown")
        latency = (end_time - start_time).total_seconds()
        provider_latency.labels(provider=provider, model=model).observe(latency)
Enter fullscreen mode Exit fullscreen mode

Without per-provider metrics, you can't tell if DeepSeek is slow today or if OpenAI is throttling you. Add a custom logger to fill the gap.

The Solution: A 1-Page Survival Map

After hitting all 5 of these pitfalls in production (and losing too many weekends to debugging), I compiled everything into a single-page survival map. It covers:

  • All 5 deployment pitfalls with symptoms, root causes, and fixes
  • 3 hidden cost traps (retry amplification, embedding tax, idle-connection keep-alive)
  • A failure decision tree for any error code you'll see
  • A pre-launch security checklist
  • Copy-paste diagnostic commands

The full map is available here: https://payhip.com/b/S96bB

It's $9, no email signup, no affiliate. Just the thing I wish I had when I started.


Have you hit any of these pitfalls? Or did I miss something that's bitten you? Drop a comment — I'll be responding to everyone. You can find me at @ai-gateway-veteran on Reddit and X.

Top comments (0)