DEV Community: Hanlin Xiang

OpenClaw: My AI Home Base That Runs Everywhere

Hanlin Xiang — Thu, 25 Jun 2026 02:13:38 +0000

Six weeks ago, my AI workflow looked like this: ChatGPT for coding questions, Claude for writing, a Telegram bot for quick mobile queries, a Discord bot I half-built for my server, and a WhatsApp number I'd set up with some random API wrapper that broke every Tuesday. Five apps. Five different conversation histories. Five places where I'd explain who I am, what I'm working on, and what "that project" refers to.

I'd ask Claude to refactor a function, then switch to ChatGPT on my phone to ask a follow-up — and it had zero context. I'd paste the same code snippet into three different windows. My AI assistants knew nothing about each other. They were five strangers sharing one brain.

The breaking point came on a Saturday morning. I was at a coffee shop, phone only. I needed to check if a deployment went through, ask about a weird log line, and draft a Slack message to my team. That's three AI conversations across three apps, on a 6-inch screen, while my coffee got cold.

I thought: There has to be a single thing I can message from anywhere, and it just knows me.

That thing turned out to be OpenClaw.

What OpenClaw Actually Is

OpenClaw is a self-hosted AI agent gateway. It's not another chat app. It's not another AI model. It's the plumbing that connects every messaging platform you already use to a single AI agent that runs on your own hardware.

Here's the mental model: imagine a central hub sitting on your machine (or a VPS, or a Raspberry Pi — it doesn't care). That hub maintains live connections to WhatsApp, Telegram, Discord, Slack, Signal, iMessage, and a dozen other platforms simultaneously. When you send a message from any of those apps, it routes to one agent. One brain. One memory.

You (WhatsApp)  ──┐
You (Telegram)  ──┤
You (Discord)   ──┼──→  OpenClaw Gateway  ──→  AI Agent (your config, your tools, your memory)
You (Slack)     ──┤
You (iMessage)  ──┘

The Gateway is the control plane. It owns all the messaging sessions, routes inbound messages to the right agent, manages tool execution, handles memory, and streams responses back to whichever app you're using. You talk from WhatsApp, it answers on WhatsApp. You switch to Telegram mid-thought, it picks up the thread.

It's built by Peter Steinberger and the community, it's MIT-licensed, and it's one of the fastest-growing open-source AI agent projects on GitHub — which tells you two things: people want this, and it works.

The 5-Minute Install

I'm allergic to complex setups. If a tool needs a 40-minute tutorial to get running, I'm out. OpenClaw surprised me.

What you need:

Node.js 24 (recommended) or Node 22.19+
An API key from any model provider (OpenAI, Anthropic, Google, etc.)
5 minutes

Install and run:

npm install -g openclaw@latest
openclaw onboard --install-daemon

That's it. The onboarding wizard walks you through picking a model provider, pasting your API key, and configuring the Gateway. It installs as a daemon (launchd on macOS, systemd on Linux) so it stays running across reboots.

Verify it's up:

openclaw gateway status

Open the built-in dashboard:

openclaw dashboard

You're looking at a browser-based control UI. Type a message. Get a response. You now have a working AI agent.

The whole thing took me about 4 minutes, including the time I spent squinting at my API key to make sure I copied it right.

Connecting Your First Channel

The dashboard is fine for testing, but the real magic is talking to your agent from apps you already have open. I started with Telegram because it's the fastest to set up — you just need a bot token from @BotFather.

# In your OpenClaw config (openclaw onboard handles this, but here's the raw version)
channels:
  telegram:
    token: "your-bot-token-here"

Restart the gateway, send a Telegram message to your bot. It replies. Instantly. With the same context, same memory, same personality as the dashboard.

Then I connected Discord. Then WhatsApp. Each one took under 5 minutes. The docs have step-by-step guides for every channel:

Platform	Setup Time	What You Need
Telegram	3-5 min	Bot token from @botfather
Discord	5 min	Bot token from Discord Developer Portal
Slack	5 min	Slack app + bot token
WhatsApp	5 min	Phone number (via Baileys)
Signal	5 min	Signal CLI linked device
iMessage	10 min	macOS with BlueBubbles or similar
WebChat	0 min	Built into the dashboard

The full list includes Matrix, Google Chat, Microsoft Teams, IRC, LINE, Mattermost, WeChat, QQ, Feishu, Nostr, Twitch, and more. I haven't tried them all, but the ones I have just work.

One Agent, Every Platform

Here's what changed after a week of using OpenClaw: I stopped thinking about which app to open. I'd be on my laptop, hit a question, and just open Telegram (which was already in my dock). Or I'd be on my phone and fire off a WhatsApp message. The agent didn't care. It answered wherever I asked.

But the real payoff was context persistence. On Monday, I asked the agent about a database migration via the web dashboard. On Wednesday, I referenced "that migration" from Telegram. It knew exactly what I meant. Same agent. Same memory. Different app.

This sounds small. It's not. It's the difference between having a personal assistant and having five strangers who each know one slice of your life.

The way OpenClaw handles this is elegant: the Gateway maintains per-session, per-sender isolation. Your WhatsApp conversation is separate from your Discord conversation, but they share the same agent's long-term memory. The agent remembers facts, preferences, and context across all channels — but each conversation thread stays clean.

The Memory System

This is where OpenClaw went from "useful tool" to "I can't go back."

The agent has two layers of memory:

Session memory — the conversation you're having right now. It knows what you said 10 messages ago in this thread.

Long-term memory — facts, preferences, and context that persist across sessions and channels. Tell it once that you prefer TypeScript over JavaScript, and it remembers. Mention your team's deployment schedule, and it brings it up when relevant.

The memory system stores everything locally on your machine. Your data doesn't leave your machine unless you explicitly configure it to.

In practice, this means:

Week 1: I told the agent about my project structure, tech stack, and coding preferences.
Week 2: I asked it to review a PR. It already knew our conventions. No re-explaining.
Week 3: I asked for help debugging a race condition. It remembered the async patterns we'd discussed before and suggested a fix that matched our style.

It felt like onboarding a new team member who actually paid attention.

You can also manually manage memories:

# List stored memories
openclaw memory list

# Search memories
openclaw memory search "deployment schedule"

# Add a specific memory
openclaw memory add "Production deploys happen Tuesdays at 2pm UTC"

The agent also does memory maintenance during heartbeats — periodic check-ins where it reviews recent conversations, distills important facts, and updates its long-term memory. It's like watching someone review their journal and update their mental model. Daily files are raw notes; long-term memory is curated wisdom.

What surprised me most: the agent started noticing patterns I hadn't. After a few weeks, it pointed out that I always asked about deployment status on Monday mornings and suggested it could proactively check and report. I said yes. Now every Monday at 9 AM, I get a Telegram message with the staging and production status. I didn't build that workflow. The agent suggested it because it remembered.

Cron Jobs and Automation

One feature I didn't expect to use as much as I do: scheduled tasks. OpenClaw has a built-in cron system that lets the agent run tasks on a schedule.

# Check for new GitHub issues every morning at 8am
openclaw cron add --schedule "0 8 * * *" --message "Check for new GitHub issues in openclaw/openclaw and summarize any critical ones"

# Remind me about standup every weekday at 9:45am
openclaw cron add --schedule "45 9 * * 1-5" --message "Standup in 15 minutes. Here's what you worked on yesterday and what's planned for today."

The cron system integrates with the agent's full tool suite. A cron job can search the web, check APIs, run shell commands, and then deliver the result to whatever channel you want. It's like having a cron job that can think about what it finds.

I have about a dozen cron jobs running. Some are simple reminders. Others are complex — one checks three different monitoring dashboards, cross-references the results, and only messages me if something looks off. Building that as a traditional script would have taken an afternoon. I described it to the agent in two sentences.

Voice: The Underrated Feature

If you're on macOS or have an iOS/Android node paired, OpenClaw supports voice input and text-to-speech output. Speak your question, get a spoken answer. It integrates with TTS providers for high-quality voice output, and falls back to system TTS if you don't have a premium voice key.

I use this while cooking. Hands covered in flour, need to know if I can substitute baking powder for baking soda. "Hey, can I swap baking powder for baking soda in this recipe?" — spoken answer, no screen required.

It sounds like a small thing. It's not. Voice changes the relationship with the agent from "tool I use" to "assistant I talk to."

Tools: Giving the Agent Hands

An AI that just talks is a chatbot. An AI that does things is an agent. OpenClaw is firmly in the second category.

The agent ships with a built-in tool suite:

Shell execution — run commands, manage files, deploy code
Browser automation — browse websites, fill forms, take screenshots
Web search — search the web for current information
File operations — read, write, edit files on your system
Cron jobs — schedule recurring tasks
Canvas — a live, agent-driven visual workspace
Sub-agent orchestration — spawn specialized agents for complex tasks

Here's a real example from my workflow. I asked the agent (via Telegram, while walking to get lunch):

"Check if the staging deploy went through, and if there are any errors in the last hour of logs."

It ran the commands, parsed the output, and replied:

"Staging deploy completed at 14:32 UTC. Two warnings in the last hour — both are the known Redis connection timeout on startup. No new errors. Looks clean."

I didn't open a terminal. I didn't SSH into anything. I sent a message from my phone and got an answer.

Skills: Pluggable Capabilities

OpenClaw has a skill system through ClawHub — a registry of community-built capabilities. Skills are like plugins that teach the agent new tricks: GitHub integration, email management, calendar access, data analysis, and hundreds more.

Installing a skill:

openclaw skills install github

The agent now knows how to interact with GitHub — create issues, review PRs, check CI status — all through natural language.

Multi-Agent Routing

This one's for the power users. OpenClaw supports multi-agent routing — different channels or senders can be routed to different agents, each with their own workspace, personality, and tool access.

Why would you want this? A few scenarios:

Work vs. personal: Route your Slack messages to a "work agent" that knows your team's context. Route WhatsApp to a "personal agent" that manages your side projects.
Different expertise: Have a "coding agent" with full shell access for your DMs, and a "content agent" with writing tools for your Telegram channel.
Team sharing: If you expose the gateway (with proper security), different team members can hit different agents with isolated sessions.

Configuration is straightforward:

agents:
  routing:
    - match:
        channel: slack
        workspace: work
      agent: work-agent
    - match:
        channel: whatsapp
      agent: personal-agent

Each agent gets its own workspace directory, memory namespace, and tool permissions. They're completely isolated.

How It Compares

I've tried most of the alternatives. Here's my honest take:

vs. ChatGPT/Claude apps: These are great for one-off conversations. But they're siloed. No cross-platform continuity. No tool execution on your machine. No self-hosting. You're renting an assistant; with OpenClaw, you own it.

vs. Coze / Dify: These are workflow builders — visual drag-and-drop platforms for building AI pipelines. They're powerful but heavy. They're designed for building products, not for having a personal assistant. OpenClaw is opinionated: it's a single-user agent, not an app builder.

vs. n8n / Zapier: Automation platforms. Great for connecting APIs, but they're not conversational. You build workflows, not relationships. OpenClaw is an agent you talk to that happens to be able to run workflows.

vs. Building your own: You could wire up a Telegram bot, a Discord bot, and a WhatsApp integration yourself. I've done it. It takes weeks and you end up maintaining three codebases. OpenClaw is the "just works" version of that project.

vs. Open WebUI: Open WebUI is a polished frontend for talking to LLMs. It's excellent at what it does. But it's a UI layer — it doesn't execute tasks, manage memory across platforms, or connect to messaging apps. OpenClaw has an Open WebUI integration if you want both: OpenClaw as the agent brain, Open WebUI as the chat interface.

The key differentiator for me: OpenClaw is local-first. Your data stays on your machine. Your agent runs on your hardware. You're not sending your conversation history through someone else's servers (except to the AI model provider, which is unavoidable). For a personal assistant, that matters.

The Canvas: A Surprise Hit

Canvas is an experimental feature, but it's already useful. I didn't expect to use it — it sounded gimmicky — a visual workspace the agent can render to?

Then I asked it to diagram a system architecture. Instead of generating a wall of text, it rendered an interactive diagram in the Canvas. I could see it, manipulate it, and ask the agent to modify specific parts. It was like pair-designing with someone who could draw.

The Canvas is served from the Gateway's HTTP server, so it works in any browser. The agent can push HTML, CSS, and JavaScript to it, and you can interact with the rendered output. It's particularly good for:

System diagrams
Data visualizations
Interactive prototypes
Dashboards the agent builds on the fly

I've since used it for quick data visualizations where I'd normally reach for a Jupyter notebook. Faster, more conversational, and the output is immediately visible.

The Canvas also supports A2UI (Agent-to-UI) — a protocol where the agent can push structured UI components, not just raw HTML. Think of it as the agent building a mini-app in real time, tailored to exactly what you asked for. I used it to build a live monitoring dashboard for a project in about 30 seconds of conversation. No code from me. Just "show me the health of these services" and the agent rendered it.

Security: The Uncomfortable Conversation

Let's be real: giving an AI agent shell access to your machine is a big deal. OpenClaw takes this seriously, but you need to understand the trust model.

Default behavior: The agent runs with full access to your host for the main session. This is intentional — it's a personal assistant, and you're the only user.

Sandboxing: For non-main sessions (group chats, shared channels), you can enable Docker-based sandboxing:

agents:
  defaults:
    sandbox:
      mode: "non-main"

The sandbox restricts tool access — by default it allows file operations and session management but blocks browser, canvas, cron, and gateway commands.

Pairing: By default, unknown senders get a pairing code. The agent won't process their messages until you approve them. This prevents random people from messaging your WhatsApp bot and getting your agent to do things.

Remote access: If you expose the gateway beyond localhost, use Tailscale or an SSH tunnel. The docs have a detailed exposure runbook. Don't skip it.

My setup: the gateway runs on a VPS with Tailscale. I access it from my laptop and phone over the Tailnet. No public ports. No Docker (I'm the only user). It's been rock-solid for six weeks.

What I'd Change

No tool is perfect. Here's what I'd improve:

Initial channel setup could be smoother. The onboarding wizard is great for the gateway, but connecting channels still requires digging into platform-specific developer portals (Discord, Slack, etc.). A guided flow for each channel would help.
Mobile companion apps are maturing. iOS and Android "nodes" exist and work, but they're newer than the desktop experience. Voice wake on macOS is polished; mobile voice is catching up.
Documentation is extensive but sometimes scattered. The docs site is comprehensive, but I occasionally found myself jumping between three pages to answer one question. A "cookbook" section with common recipes would be useful.

None of these are dealbreakers. The core product is solid, and the community moves fast. I've seen features land in weeks that would take months in corporate projects.

Getting Started Today

If you've read this far, you probably relate to the "five apps, five strangers" problem. Here's my recommendation:

Install OpenClaw — npm install -g openclaw@latest && openclaw onboard --install-daemon
Connect one channel — Telegram is the fastest (3-5 minutes, one bot token)
Use it for a week — ask it things, let it learn your preferences, give it small tasks
Add more channels — WhatsApp, Discord, whatever you actually use
Explore skills — browse ClawHub and install what fits your workflow

The agent gets better as it learns about you. The first day is just a chatbot. By the end of the first week, it's an assistant. By the end of the first month, it's the one you reach for first.

The Bigger Picture

We're in a weird moment in AI. The models are incredible, but the experience of using them is fragmented. Every platform wants you in their app, their ecosystem, their walled garden. OpenClaw is a bet on a different model: you own the agent, you own the data, you choose the channels.

It's not trying to replace ChatGPT or Claude. It's the layer that makes them work together, on your terms, from wherever you happen to be.

Six weeks in, I've deleted three AI apps from my phone. I have one number, one bot, one agent — and it knows me across every screen I own.

That Saturday morning at the coffee shop? I opened WhatsApp, sent three messages, and had my answers before the barista called my name. One app. One brain. Coffee still hot.

And the best part? I didn't explain anything. The agent already knew.

OpenClaw is open source (MIT) and available at github.com/openclaw/openclaw. Docs at docs.openclaw.ai. Community on Discord.

Have you tried self-hosting an AI agent? I'd love to hear about your setup in the comments.

The $2,300 Weekend: When Fallback Routing Goes Wrong in AI Gateways

Hanlin Xiang — Tue, 23 Jun 2026 02:42:14 +0000

A single misconfigured fallback line turned a $40/month API bill into $2,300 in 48 hours. Here's what happened, why it's the most common LiteLLM mistake, and how to fix it before it happens to you.

What Happened

Last month, I set up LiteLLM Proxy to route traffic across multiple providers. My primary model was DeepSeek-V3 at $0.14/M tokens — cheap, fast, good enough for 90% of my traffic. As a fallback, I configured GPT-4o "just in case DeepSeek goes down."

Sounds reasonable, right? That's what I thought.

Friday night, DeepSeek started rate-limiting (429s). My fallback chain kicked in. Every single request that got a 429 rerouted to GPT-4o at $2.50/M input + $10/M output — 18x more expensive on input tokens alone, and over 70x on output**.

By Saturday morning, 5% of my traffic had fallen back to GPT-4o. By Sunday night, I had a $2,300 OpenAI bill.

The worst part? The gateway was working perfectly. No errors, no alerts, no downtime. The fallback did exactly what I configured it to do. The problem was the configuration itself.

Why It Happens

The anti-pattern is capability-based fallback — routing traffic to whatever model is "better" when the primary fails. It feels intuitive: if DeepSeek goes down, fall back to GPT-4o, which is more capable anyway.

But this creates a financial time bomb:

Cheap models fail more often — they're on shared infrastructure with tighter rate limits
Expensive models are always available — providers prioritize premium tiers
Every fallback = 10-20x cost increase — and there's no alert when it happens
429s come in bursts — when a provider rate limits, it rate limits everything

The result: your cheapest, highest-volume tier fails, and a tsunami of traffic hits your most expensive model. You won't notice until the billing email arrives Monday morning.

The Fix: Price-Tiered Fallback

The principle is simple: fallbacks go sideways, never upward.

Cheap model fails → fall back to another cheap model
Mid-tier model fails → fall back to another mid-tier model
Never fall up from cheap to expensive

Here's the production config I run now:

# config.yaml
model_list:
  # Tier 1: cheap models ($0.10-0.30/M tokens)
  - model_name: deepseek-chat
    litellm_params:
      model: deepseek/deepseek-chat
      api_key: os.environ/DEEPSEEK_API_KEY
      max_budget: 50        # $50/day hard limit per key

  - model_name: gemini-flash
    litellm_params:
      model: gemini/gemini-1.5-flash
      api_key: os.environ/GEMINI_API_KEY
      max_budget: 50

  - model_name: claude-haiku
    litellm_params:
      model: claude/claude-3-haiku
      api_key: os.environ/ANTHROPIC_API_KEY
      max_budget: 50

  # Tier 2: mid-tier models ($1-3/M tokens)
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      max_budget: 100       # $100/day hard limit

  - model_name: claude-sonnet
    litellm_params:
      model: claude/claude-3-5-sonnet
      api_key: os.environ/ANTHROPIC_API_KEY
      max_budget: 100

  - model_name: gemini-pro
    litellm_params:
      model: gemini/gemini-1.5-pro
      api_key: os.environ/GEMINI_API_KEY
      max_budget: 100

litellm_settings:
  num_retries: 2
  allowed_fails: 3        # circuit breaker: stop after 3 total failures
  cooldown_time: 60       # wait 60s before retrying a failed provider
  fallbacks:              # fallbacks use model_name from model_list above
    # Tier 1: cheap → cheap (NEVER cheap → expensive)
    - "deepseek-chat": ["gemini-flash", "claude-haiku"]
    - "gemini-flash": ["deepseek-chat", "claude-haiku"]

    # Tier 2: mid → mid
    - "gpt-4o": ["claude-sonnet", "gemini-pro"]
    - "claude-sonnet": ["gpt-4o", "gemini-pro"]

    # NEVER do this:
    # ✗ - "deepseek-chat": ["gpt-4o"]

  cache: true
  cache_params:
    type: "redis"
    host: "redis"
    port: 6379
    ttl: 3600              # 1h cache — catches ~30% of duplicate traffic

Key settings explained:

Setting	What it does	Why it matters
`allowed_fails: 3`	Circuit breaker — stops retrying after 3 total failures	Prevents retry storms from amplifying costs
`cooldown_time: 60`	Waits 60s before retrying a failed provider	Gives rate-limited providers time to recover
`num_retries: 2`	Retries on same model before fallback	Reduces unnecessary fallback triggers
`max_budget`	Per-key daily spending cap	Hard stop — even if everything goes wrong

How to Detect Fallback Abuse Before the Bill Arrives

You can't manage what you don't measure. Add this simple check to your monitoring:

Option 1: Log-based (no extra infra)

# Example: check fallback rate from LiteLLM logs (adjust fields to your version)
curl -s http://localhost:4000/logs \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY" \
  | jq '[.[] | select(.metadata.fallback_model != null)] | length' \
  | awk '{if($1 > 50) print "WARNING: High fallback rate: "$1" requests"}'

Option 2: Prometheus + Grafana (production)

# Fallback rate as % of total requests (metric names may vary by LiteLLM version)
sum(rate(litellm_fallback_count_total[5m]))
  /
sum(rate(litellm_total_requests[5m]))

Set an alert at 5% — if more than 1 in 20 requests is falling back, something's wrong with your primary provider.

Option 3: The $0 solution — daily budget alerts

Set max_budget on every key. LiteLLM will return a 429 when the budget is hit. Better to serve errors for 100 requests than to serve 10,000 requests on GPT-4o.

The Bigger Picture

Fallback routing is Pitfall #4 in my production survival map. After 6 months of running LiteLLM Proxy in production, I documented 5 deployment pitfalls and 3 cost traps:

503 on every request after adding a provider — model name mismatch
Costs 3× higher than expected — default fallback chain routes to expensive models
Keys rotated but old ones still work — key cache invalidation
Fallback routing bleeds your wallet dry ← this one
Streaming responses cut off mid-token — Nginx/Cloudflare buffering

Each of these cost me real money or real downtime. The full one-page reference card with all 5 pitfalls, 3 cost traps, a failure decision tree, and config templates is here:

👉 AI API Gateway Pitfall Map — $9

It's the page you print and pin next to your monitor — because when your gateway goes down at 2 AM, you won't be reading a 40-page guide.

Free resource: I also put together a Pre-Deployment Checklist — a 1-page PDF covering the 15 things to check before going live. Free download, no email required.

Independent reference. Not affiliated with LiteLLM, OpenAI, DeepSeek, or any provider named. 7-day money-back guarantee — if the map doesn't save you at least 1 hour of debugging, email for a full refund.

Tags: #litellm #ai #production #devops

The 5 Cost Traps That Will Quietly Bleed Your AI API Gateway Dry (And How to Fix Them)

Hanlin Xiang — Mon, 22 Jun 2026 01:47:21 +0000

In my last post, we talked about key cache invalidation — the silent production killer that turns your gateway into a 502 factory. Today I want to talk about something equally dangerous but far more insidious: cost traps.

These aren't bugs. They're not crashes. Your gateway runs fine. Your users are happy. Then finance sends you a Slack message: "Why did our OpenAI bill jump 4x last month?"

I've been running LiteLLM Proxy in production for multiple teams across three companies. Here are the five cost traps I've personally been burned by — each with the config that would have saved me thousands of dollars.

Trap #1: The Retry Spiral — When `num_retries=3` Actually Means 15

The Problem

LiteLLM's retry logic is smart. Too smart. When a request fails, it retries. When the retried request hits a fallback model and that fails, it retries again. If you've configured a fallback chain of 5 models with 3 retries each, a single user request can trigger up to 15 upstream API calls — and you pay for every single one, including the ones that errored out after consuming tokens.

Why It Happens

The default num_retries in LiteLLM is 3. Most teams set it and forget it. But retries multiply across your fallback chain. Here's the math:

Request → Model A fails → Retry A (1) → Retry A (2) → Retry A (3)
       → Fallback to Model B → Fails → Retry B (1) → Retry B (2) → Retry B (3)
       → Fallback to Model C → Fails → Retry C (1) → Retry C (2) → Retry C (3)

Total upstream calls: 9 retries + 3 initial = 12 billable calls for 1 user request

If Model C is GPT-4o and each retry consumes 2K input tokens before timing out, that's 24K tokens on a single failed request.

The Fix

Cap total attempts across the entire chain, not just per-model:

# litellm_config.yaml
litellm_settings:
  num_retries: 2           # per-model retries
  max_fallbacks: 2         # hard cap on fallback chain depth
  retry_after: 5           # respect 429 Retry-After headers
  allowed_fails: 3         # circuit breaker: after 3 fails, stop entirely

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      max_retries: 2       # override: fewer retries on expensive models

  - model_name: gpt-4o-fallback
    litellm_params:
      model: gpt-4o-mini   # cheap fallback, not another expensive model

The key insight: your fallback should be cheaper than your primary, not equally expensive. If GPT-4o fails, fall back to GPT-4o-mini, not to Claude Opus.

Trap #2: Fallback Chains That Funnel Money Into Premium Models

The Problem

This is the trap that cost me $2,300 in a single weekend. A well-meaning engineer configured a fallback chain that looked like this:

# THE EXPENSIVE WAY — do not do this
router_settings:
  fallbacks:
    - "gpt-4o-mini": ["gpt-4o"]
    - "gpt-4o": ["claude-3-5-sonnet"]
    - "claude-3-5-sonnet": ["claude-3-opus"]

The logic seemed sound: "If the cheap model fails, try the better one." But here's what actually happened: GPT-4o-mini was rate-limited during a traffic spike (429s everywhere), so every single request fell through to GPT-4o and then to Claude 3.5 Sonnet. For 6 hours, we were running 100% of our traffic on the most expensive models in the chain.

Why It Happens

Rate limits are per-model, not per-gateway. When you hit OpenAI's TPM limit on gpt-4o-mini, LiteLLM dutifully falls back. But if the traffic spike is caused by overall volume (not a model-specific outage), the fallback model gets the same volume that caused the 429 in the first place. You're not solving the problem — you're just paying 10x more to have it on a different model.

The Fix

Structure fallbacks by cost tier, not by capability tier:

# THE SMART WAY — fallback within price tier, not up
router_settings:
  fallbacks:
    # Tier 1: Cheap models (fallback to other cheap models)
    - "gpt-4o-mini": ["gemini-1.5-flash", "claude-3-haiku"]

    # Tier 2: Mid-tier models (fallback to other mid-tier)
    - "gpt-4o": ["claude-3-5-sonnet", "gemini-1.5-pro"]

    # NEVER fall up from cheap to expensive
    # If all cheap models fail, return an error, don't escalate

  # Add a cooldown so the same model isn't retried immediately
  cooldown_time: 60

Also add alerting. If your fallback rate exceeds 5% of total traffic, something is structurally wrong:

# prometheus metric in your LiteLLM custom callback
from litellm.integrations.custom_logger import CustomLogger
import litellm

class FallbackAlertLogger(CustomLogger):
    def log_pre_api_call(self, model, messages, kwargs):
        if kwargs.get("metadata", {}).get("fallback_idx", 0) > 0:
            # This is a fallback call, not the primary
            self.fallback_counter.inc()
            # Alert if fallback rate > 5%
            if self.fallback_counter._value.get() / self.total_counter._value.get() > 0.05:
                self.alert_webhook.send(
                    "⚠️ Fallback rate >5% — check rate limits on primary models"
                )

Trap #3: Zero Caching = Paying for the Same Answer 1,000 Times

The Problem

Most teams don't enable LiteLLM's built-in caching because "our prompts are dynamic." But in practice, a huge percentage of your traffic is near-identical: system prompts are the same, the first 500 tokens of user messages are often boilerplate, and many users ask the exact same questions.

I audited one team's traffic and found that 34% of their requests were exact duplicates of requests made in the last hour. They were paying OpenAI ~$400/day for identical completions.

Why It Happens

LiteLLM has Redis caching built in. But it's disabled by default, and the documentation buries it under "Advanced Settings." Most engineers set up the proxy, test it, ship it, and never circle back.

The Fix

Enable Redis caching with a sensible TTL. This is a 30-second config change that can cut your bill by 30-50%:

litellm_settings:
  cache: true
  cache_params:
    type: "redis"
    host: "your-redis-host"
    port: 6379
    namespace: "litellm_cache"

    # Cache settings
    ttl: 3600              # 1 hour for exact matches
    # For semantic caching (similar but not identical prompts):
    # semantic_cache: true
    # similarity_threshold: 0.8

  # Cache based on messages content, not just the full request
  cache_key_include_models: true   # don't share cache across models

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      cache: true           # enable per-model

For even bigger savings, use prompt caching with providers that support it (Claude, GPT-4o). LiteLLM supports this natively:

import litellm

# Enable prompt caching for Claude
response = litellm.completion(
    model="claude-3-5-sonnet",
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": "<long_system_prompt>", "cache_control": {"type": "ephemeral"}},
            {"type": "text", "text": user_input}
        ]}
    ]
)
# Claude charges 90% less for cached input tokens

Real numbers from my audit: After enabling Redis cache with a 1-hour TTL, that team went from $400/day to $180/day. A 55% reduction for a config change that took less than a minute.

Trap #4: No Per-Key Budget Limits — One Runaway Loop Can Bankrupt You

The Problem

An intern pushes a while True loop to a staging environment. It doesn't crash — it just calls your gateway 4,000 times per minute with a 4K-token prompt. By the time PagerDuty fires, you've spent $847 in 12 minutes.

This isn't hypothetical. This is a Tuesday.

Why It Happens

LiteLLM's default configuration has no budget enforcement. The max_budget field exists but most teams never configure it because they're focused on getting the gateway working, not on constraining it.

The Fix

Set budgets at three levels: per-key, per-team, and global:

# Per-virtual-key budget (when creating keys via /key/generate)
# This is your first line of defense

litellm_settings:
  # Global budget — emergency brake
  max_budget: 500          # $500/day global cap
  budget_duration: "1d"

  # Rate limiting
  rpm_limit: 1000          # requests per minute, global

general_settings:
  master_key: sk-1234
  database_url: "postgresql://..."

  # Enable budget tracking
  alerting: ["slack"]
  alerting_threshold: 0.8  # alert at 80% of budget

When creating virtual keys for teams or individual developers:

# Create a key with a $50 daily budget and 100 RPM
curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-1234" \
  -H "Content-Type: application/json" \
  -d '{
    "max_budget": 50,
    "budget_duration": "1d",
    "rpm_limit": 100,
    "tpm_limit": 50000,
    "models": ["gpt-4o-mini", "gpt-4o"],
    "metadata": {"team": "frontend"}
  }'

And set up a webhook to catch budget breaches:

# In your LiteLLM proxy config
litellm_settings:
  proxy_budget_respecting_alerting:
    - webhook_url: "https://hooks.slack.com/services/..."
      # This fires BEFORE the request is sent when a key is over budget
      # LiteLLM will return a 429 to the client, not forward to the provider

The intern's loop? With a $50/day key budget and 100 RPM limit, it would have been throttled after 100 calls and blocked entirely after $50. Total damage: about $0.80.

Trap #5: The Streaming Tax — You're Paying for Tokens You Never See

The Problem

Streaming mode (stream=True) is great for UX. Users see tokens appear in real-time. But here's what most teams don't realize: when a streaming request is interrupted mid-stream, you still pay for the entire generation.

User starts a request → GPT-4o begins streaming a 2,000-token response → user navigates away after 50 tokens → the client connection drops → but the upstream API call completes fully → you pay for all 2,000 tokens.

At scale, this is devastating. I've seen teams where 23% of their token spend was on tokens that no user ever saw because the client disconnected early.

Why It Happens

LiteLLM (and most API gateways) doesn't automatically cancel the upstream request when the client disconnects during streaming. The gateway is acting as a proxy — it's happily receiving tokens from OpenAI and trying to forward them, even though nobody's listening.

The Fix

Enable client disconnect detection and upstream cancellation:

litellm_settings:
  # Cancel upstream request when client disconnects during streaming
  stream_options:
    include_usage: true     # get token counts in the final chunk

  # Custom callback to track abandoned streams
  callbacks: stream_cost_logger

router_settings:
  # Close upstream connection when client disconnects
  streaming_client_disconnect: true   # LiteLLM 1.40+

If you're on an older version or need more control, add a custom middleware:

from litellm.proxy.custom_proxy_admin_logic import CustomProxyAdminLogic

class StreamCancellationMiddleware(CustomProxyAdminLogic):
    async def async_pre_call(self, user_api_key_dict, cache, data, call_type):
        if data.get("stream"):
            # Mark the start time
            data["metadata"] = data.get("metadata", {})
            data["metadata"]["stream_start_time"] = time.time()
        return data

    async def async_log_stream_event(self, logging_obj, response, start_time, end_time):
        # Log how many tokens were actually consumed vs delivered
        if hasattr(response, 'usage'):
            total_tokens = response.usage.get('completion_tokens', 0)
            # If stream ended early (client disconnect), log it
            if logging_obj.stream_connection_broken:
                self.metrics.abandoned_stream_tokens.inc(total_tokens)
                self.alert(
                    f"Abandoned stream: {total_tokens} tokens paid but undelivered"
                )

Also, consider setting max_tokens conservatively for streaming endpoints:

model_list:
  - model_name: gpt-4o-stream
    litellm_params:
      model: gpt-4o
      max_tokens: 1000        # cap generation length
      stream: true
      stream_options:
        include_usage: true

After implementing stream cancellation, that 23% wasted spend dropped to under 2%.

The Pattern Behind All Five Traps

Notice the theme: every one of these traps is a sensible default that becomes dangerous at scale. Retries are good — until they multiply across fallbacks. Fallbacks are good — until they funnel traffic to premium models. Caching is optional — until it's costing you 30% of your bill.

The fix is never "disable the feature." It's always "add constraints." Budgets, caps, cooldowns, TTLs. The gateway works for you, not the other way around.

If you're deploying LiteLLM or any AI API gateway, do a quick audit:

What's your effective retry multiplier? (num_retries × fallback_depth)
Does your fallback chain ever fall UP in price?
Is caching enabled? (If you can't answer this in 5 seconds, it's probably not)
Does every API key have a budget cap?
What percentage of your streaming tokens are actually delivered?

If any of these questions made you nervous, you might want to check out the AI API Gateway Pitfall Map — a one-page production survival guide I put together that covers these traps (and a few more) in a format you can print and pin above your desk. It's the checklist I wish I'd had before I learned these lessons the expensive way.

Have you hit any of these traps in production? Or found others I missed? Drop a comment — I'm collecting war stories for a follow-up post.

Tags: #litellm #ai #devops #costoptimization

🎁 Free: AI API Gateway Pre-Deployment Checklist

Before you ship, run through this 43-point checklist covering auth, cost control, caching, fallbacks, security, monitoring, and production readiness. It's free — grab it here:

👉 Free Pre-Deployment Checklist (PDF)

And if you want the full pitfall map with detailed fixes for each trap above, that's here: AI API Gateway Pitfall Map ($9)

The #3 Production Killer in Your LiteLLM Setup: Key Cache Invalidation (and How to Fix It)

Hanlin Xiang — Fri, 19 Jun 2026 14:18:28 +0000

This is the pitfall that cost me 3 hours at 2 AM. If you're running LiteLLM Proxy in production, it will hit you too — usually at the worst possible time.

What Happened

I run LiteLLM Proxy + New API in front of 18 provider channels. One night, I rotated an API key for a provider that had been flagged for unusual spending.

Standard procedure:

Generate new key in provider dashboard
Update config.yaml with new key
Run litellm --config config.yaml --reload

The reload succeeded. No errors. The config showed the new key. I went to sleep.

The next morning, the old key was still being used. Every single request was still authenticating with the rotated-out key. The provider's dashboard showed traffic from both keys — the new one (from config validation) and the old one (from actual API calls).

Why It Happens

LiteLLM caches API keys in-memory for performance. When you --reload, the config is reloaded, but the key store is not purged. The worker process holds the old keys in a dictionary that persists across config reloads.

This means:

config.yaml shows the new key ✅
litellm --model_cost_map shows the new key ✅
The actual HTTP requests use the old key ❌

You won't notice until the old key expires or is revoked — at which point every request to that provider starts returning 401, and your fallback chain kicks in, routing traffic to your most expensive model.

The Fix

Option 1: Purge the cache manually (no downtime)

curl -X POST http://localhost:4000/cache/purge \
  -H "Authorization: Bearer $LITELLM_MASTER_KEY"

This clears the in-memory key cache. The next request will pull the key from the freshly reloaded config.

Option 2: Use Redis for shared key state (recommended for multi-worker)

Set REDIS_HOST in your environment:

# docker-compose.yml
environment:
  - REDIS_HOST=redis://redis:6379
  - REDIS_CONNECTION_POOL_SIZE=5

With Redis, keys are stored externally. A config reload triggers a Redis key update, and all workers pick it up immediately. No stale keys.

Option 3: Restart the worker (downtime: 2-5 seconds)

docker restart litellm-proxy

Brute force, but guaranteed to work. Use this if you're in a hurry and can afford a brief blip.

How to Detect It Before Users Do

Add this to your monitoring — a simple script that checks whether the key in config matches the key actually being used:

# Check which key is being used for a specific model
curl -s http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $LITELLM_API_KEY" \
  -d '{"model": "openai/gpt-4o", "messages": [{"role": "user", "content": "test"}], "max_tokens": 1}' \
  | jq '.usage'

# Compare with the key in config
grep "api_key:" config.yaml | head -1

If the provider's response includes a x-api-key-id header (OpenAI does), you can verify which key was used without guessing.

The Bigger Picture

Key cache invalidation is Pitfall #3 in my production survival map. There are 4 more deployment pitfalls and 3 hidden cost traps that I documented after 6 months of running this stack:

503 on every request after adding a provider — model name mismatch
Costs 3× higher than expected — fallback chain hits expensive models by default
Keys rotated but old ones still work ← this one
Streaming responses cut off mid-token — Nginx/Cloudflare buffering
New API channels show "insufficient quota" with balance > 0 — weight = 0 by default

Each of these took me 1-2 hours to diagnose in production. The full one-page reference card with all 5 pitfalls, 3 cost traps, a failure decision tree, and a pre-launch security checklist is available here:

👉 AI API Gateway Pitfall Map — $9

It's the page you print and pin next to your monitor — because when your gateway goes down at 2 AM, you won't be reading a 40-page guide.

5 Pitfalls I Hit Running LiteLLM Proxy in Production (with a 1-page survival map)

Hanlin Xiang — Wed, 17 Jun 2026 12:09:44 +0000

I've spent the last 6 months running an 18-channel LLM gateway in production — LiteLLM Proxy backed by Redis and PostgreSQL, routing traffic across OpenAI, Anthropic, Google, DeepSeek, and several smaller providers. What started as a weekend project turned into a 24/7 operation serving multiple AI agents and internal tools.

This post covers the 5 pitfalls that hit me hardest, with real error examples and the fixes that worked. If you're running LiteLLM Proxy (or considering it), these are the things I wish someone had told me before I went to production.

Pitfall #1: Silent OOM (Memory Leak + No systemd MemoryMax)

LiteLLM has a known memory leak under high concurrency. Without a hard memory limit, the process will eat all available RAM until the kernel's OOM-killer takes it down — usually at 3 AM.

# The symptom: requests start timing out intermittently
# Check dmesg for the kill signal
import subprocess
result = subprocess.run(["dmesg", "-T"], capture_output=True, text=True)
for line in result.stdout.split("\n"):
    if "oom" in line.lower() and "litellm" in line.lower():
        print(f"FOUND: {line}")

# The fix: systemd unit with MemoryMax
# /etc/systemd/system/litellm.service
# [Service]
# ExecStart=/usr/bin/litellm --config /opt/litellm/config.yaml --port 4000
# MemoryMax=4G
# Restart=always
# RestartSec=10

The fix is simple but easy to miss: set MemoryMax=4G (or whatever your server can spare) in your systemd unit. The proxy will restart cleanly instead of being force-killed.

Pitfall #2: Key Cache Miss (OpenAI 8min Cache vs 24h Cache Key)

This was the single most painful bug I encountered. I rotated a provider API key through the config and ran litellm --reload. The config file updated, but LiteLLM caches keys in-memory. The old key kept getting used for hours.

# What happened: silent 401s that looked like "provider outage"
# The config was correct, but the in-memory cache wasn't purged

# WRONG: this only reloads config, not the key store
# litellm --reload

# RIGHT: purge the cache explicitly
import requests
requests.post(
    "http://localhost:4000/cache/purge",
    headers={"Authorization": "Bearer YOUR_MASTER_KEY"}
)

# Or just restart the worker entirely
# If using REDIS_HOST for shared state, flush that too:
# redis-cli -h $REDIS_HOST FLUSHDB

The key insight: --reload refreshes the config file but does NOT purge the in-memory key cache. You need to either hit /cache/purge or restart the worker. If you're using Redis for shared key state, flush that too.

Pitfall #3: Retry Storm (4xx Retries Cause Rate-Limit Avalanche)

LiteLLM retries num_retries=3 by default. A single failed call becomes 3x the token spend. Worse: on 4xx errors (which should NOT be retried), the retry logic can trigger rate-limit cascades.

# The problem: default config retries everything
# config.yaml (BAD)
# litellm_settings:
#   num_retries: 3  # This retries even 4xx errors!

# The fix: retry only 5xx, use fallbacks for 4xx
# config.yaml (GOOD)
litellm_settings:
  num_retries: 1
  retry_policy:
    InternalServerError:
      retries: 1
    RateLimitError:
      retries: 0  # Don't retry rate limits — use fallbacks instead

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      fallbacks: ["anthropic/claude-3.5-sonnet"]

Set num_retries: 1 for non-critical paths. Use fallbacks (cheapest-first) instead of retries for cost control. A retry storm on a rate-limited provider can 3x your spend in 5 minutes.

Pitfall #4: Cost Unobserved (Multi-Provider Routing Weights)

When you route across multiple providers, LiteLLM's default fallbacks are sequential — not cost-sorted. One upstream failure can route all traffic to your most expensive model.

# The problem: fallbacks hit the most expensive model first
# config.yaml (BAD)
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      fallbacks: ["openai/gpt-4o-mini", "anthropic/claude-3.5-sonnet"]
      # If gpt-4o fails, it tries mini (cheap) then sonnet (expensive)
      # But if mini also fails, ALL traffic goes to sonnet

# The fix: sort fallbacks cheapest-first + set max_budget per team
# config.yaml (GOOD)
litellm_settings:
  max_budget: 100.0  # Daily budget cap in USD
  budget_duration: "1d"

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      allowed_fails: 3
      fallbacks: ["openai/gpt-4o-mini"]  # Only fall back to cheaper models

Monitor per-provider spend daily. The LiteLLM UI shows cost breakdowns, but only if you've configured master_key and database_url properly.

Pitfall #5: Metric Blindness (Incomplete Prometheus Metrics)

LiteLLM's built-in Prometheus metrics don't cover per-provider latency percentiles or cost attribution. You're flying blind on the most important signals for production operations.

# What LiteLLM exposes by default:
# - litellm_requests_total
# - litellm_request_duration_seconds (aggregate, not per-provider)
# - litellm_spend_total (only if database is configured)

# What's MISSING:
# - Per-provider P95/P99 latency
# - Per-provider error rate
# - Per-team cost breakdown
# - Cache hit/miss ratio

# The fix: add a custom middleware to emit per-provider metrics
from litellm.integrations.custom_logger import CustomLogger
import prometheus_client as prom

provider_latency = prom.Histogram(
    'litellm_provider_latency_seconds',
    'Latency by provider',
    ['provider', 'model']
)

class PerProviderMetrics(CustomLogger):
    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        provider = kwargs.get("litellm_params", {}).get("custom_llm_provider", "unknown")
        model = kwargs.get("litellm_params", {}).get("model", "unknown")
        latency = (end_time - start_time).total_seconds()
        provider_latency.labels(provider=provider, model=model).observe(latency)

Without per-provider metrics, you can't tell if DeepSeek is slow today or if OpenAI is throttling you. Add a custom logger to fill the gap.

The Solution: A 1-Page Survival Map

After hitting all 5 of these pitfalls in production (and losing too many weekends to debugging), I compiled everything into a single-page survival map. It covers:

All 5 deployment pitfalls with symptoms, root causes, and fixes
3 hidden cost traps (retry amplification, embedding tax, idle-connection keep-alive)
A failure decision tree for any error code you'll see
A pre-launch security checklist
Copy-paste diagnostic commands

The full map is available here: https://payhip.com/b/S96bB

It's $9, no email signup, no affiliate. Just the thing I wish I had when I started.

Have you hit any of these pitfalls? Or did I miss something that's bitten you? Drop a comment — I'll be responding to everyone. You can find me at @ai-gateway-veteran on Reddit and X.

DEV Community: Hanlin Xiang

OpenClaw: My AI Home Base That Runs Everywhere

What OpenClaw Actually Is

The 5-Minute Install

Connecting Your First Channel

One Agent, Every Platform

The Memory System

Cron Jobs and Automation

Voice: The Underrated Feature

Tools: Giving the Agent Hands

Skills: Pluggable Capabilities

Multi-Agent Routing

How It Compares

The Canvas: A Surprise Hit

Security: The Uncomfortable Conversation

What I'd Change

Getting Started Today

The Bigger Picture

The $2,300 Weekend: When Fallback Routing Goes Wrong in AI Gateways

What Happened

Why It Happens

The Fix: Price-Tiered Fallback

How to Detect Fallback Abuse Before the Bill Arrives

The Bigger Picture

The 5 Cost Traps That Will Quietly Bleed Your AI API Gateway Dry (And How to Fix Them)

Trap #1: The Retry Spiral — When num_retries=3 Actually Means 15

The Problem

Why It Happens

The Fix

Trap #2: Fallback Chains That Funnel Money Into Premium Models

The Problem

Why It Happens

The Fix

Trap #3: Zero Caching = Paying for the Same Answer 1,000 Times

The Problem

Why It Happens

The Fix

Trap #4: No Per-Key Budget Limits — One Runaway Loop Can Bankrupt You

The Problem

Why It Happens

The Fix

Trap #5: The Streaming Tax — You're Paying for Tokens You Never See

The Problem

Why It Happens

The Fix

The Pattern Behind All Five Traps

🎁 Free: AI API Gateway Pre-Deployment Checklist

The #3 Production Killer in Your LiteLLM Setup: Key Cache Invalidation (and How to Fix It)

What Happened

Why It Happens

The Fix

How to Detect It Before Users Do

The Bigger Picture

5 Pitfalls I Hit Running LiteLLM Proxy in Production (with a 1-page survival map)

Pitfall #1: Silent OOM (Memory Leak + No systemd MemoryMax)

Pitfall #2: Key Cache Miss (OpenAI 8min Cache vs 24h Cache Key)

Pitfall #3: Retry Storm (4xx Retries Cause Rate-Limit Avalanche)

Pitfall #4: Cost Unobserved (Multi-Provider Routing Weights)

Pitfall #5: Metric Blindness (Incomplete Prometheus Metrics)

The Solution: A 1-Page Survival Map

Trap #1: The Retry Spiral — When `num_retries=3` Actually Means 15