Chief YAML Officer

Posted on Feb 6

How I Stopped Rewriting My Code Every Time I Switched LLM Providers

#python #llm #devops #openai

You know that feeling when you've just finished migrating your entire codebase from GPT-4 to Claude — rewriting API calls, fixing response parsing, updating streaming logic — and then Google drops a new Gemini with benchmarks that make everything else look like a calculator from 1995?

Yeah. I've been there. Multiple times.

After the third migration in six months, I sat down and thought: why am I rewriting integration code when the only thing that changes is which HTTP endpoint I'm hitting? The request is JSON. The response is JSON. The models all take messages in, text out. Why does switching providers feel like changing the engine on a moving car?

That's how I ended up looking at LM-Proxy — and it changed the way I architect LLM-powered applications. Let me walk you through what it is, why it matters, and how to set it up in under 5 minutes.

The Problem (or: Why Your LLM Integration Code Is a Ticking Time Bomb)

Here's the typical evolution of an LLM-powered project:

Month 1: "We'll just use OpenAI. One provider, one SDK, simple."

Month 3: "Claude is actually better for our summarization task. Let's add Anthropic too." You now have two different SDKs, two authentication flows, two response formats, and an if/else that haunts your dreams.

Month 5: "Gemini Flash is way cheaper for simple queries. Let's route the easy stuff there." Your routing logic now lives in three different files, you have provider-specific error handling everywhere, and your new developer spent two days just understanding which API key goes where.

Month 7: Someone asks "can we add a local model for sensitive data?" and you start updating your LinkedIn.

The root cause is simple: every LLM provider decided to invent their own API format, even though they all do fundamentally the same thing. OpenAI's format became the de facto standard, but Anthropic, Google, and others each have their own quirks, authentication schemes, and streaming protocols.

What you actually need is a reverse proxy — a single endpoint that speaks OpenAI's API format on the outside, but can talk to any provider on the inside.

Enter LM-Proxy

LM-Proxy is a lightweight, OpenAI-compatible HTTP proxy/gateway built with Python and FastAPI. Here's the pitch in one sentence:

Your app talks to one API (OpenAI format). LM-Proxy talks to everyone else.

It supports OpenAI, Anthropic, Google (AI Studio and Vertex AI), local PyTorch models, and anything that speaks OpenAI's API format — all through a single /v1/chat/completions endpoint. Requires Python 3.11+.

But "another LLM gateway" isn't what makes it interesting. Here's what does:

1. It's Truly Lightweight

No Kubernetes. No Redis. No Kafka. No 47-microservice architecture. It's a single Python package:

pip install lm-proxy

One TOML config file. One command to start:

lm-proxy

That's it. You can have it running on a $5/month VPS, inside a Docker container, or even embedded directly into your Python application as a library (since it's built on FastAPI, you can import and mount it as a sub-app). The entire core is minimal by design — no bloat, no enterprise-grade complexity for a problem that doesn't need it.

2. Configuration That Doesn't Require a PhD

Here's a complete, working config that routes GPT requests to OpenAI, Claude to Anthropic, and Gemini to Google:

host = "0.0.0.0"
port = 8000

[connections.openai]
api_type = "open_ai"
api_base = "https://api.openai.com/v1/"
api_key = "env:OPENAI_API_KEY"

[connections.anthropic]
api_type = "anthropic"
api_key = "env:ANTHROPIC_API_KEY"

[connections.google]
api_type = "google_ai_studio"
api_key = "env:GOOGLE_API_KEY"

[routing]
"gpt*" = "openai.*"
"claude*" = "anthropic.*"
"gemini*" = "google.*"
"*" = "openai.gpt-4o-mini"        # fallback

[groups.default]
api_keys = ["my-team-api-key-1", "my-team-api-key-2"]

Read that config top to bottom. You understood it in 30 seconds, didn't you? No YAML indentation nightmares, no 200-line JSON blobs — just clean TOML with obvious semantics. (YAML, JSON, and Python config formats are also supported, by the way.)

The env: prefix pulls secrets from environment variables (or .env files), so your API keys never touch version control.

3. Pattern-Based Routing (The Killer Feature)

The [routing] section is where the magic happens. Keys are glob patterns that match against the model name your client sends. The .* suffix means "pass the model name as-is to the provider." So when your client asks for claude-sonnet-4-5-20250929, LM-Proxy forwards exactly that to Anthropic's API. No mapping tables, no model ID translation files — it just works.

You can also pin patterns to specific models:

[routing]
"custom*" = "local.llama-7b"        # Any "custom*" request → local Llama
"gpt-3.5*" = "openai.gpt-3.5-turbo" # Pin to a specific model
"*" = "openai.gpt-4o-mini"          # Everything else → cheap fallback

This means your client code never changes. Want to try a new model? Update one line in the config. Want to A/B test two providers? Add a routing rule. Want to deprecate a model? Redirect the pattern to something else. Zero code changes.

4. Virtual API Keys and Access Control

This is the feature that makes LM-Proxy production-ready rather than just a toy.

LM-Proxy maintains two layers of API keys:

Virtual (Client) API Keys — what your users/services use to authenticate with the proxy
Provider (Upstream) API Keys — the real API keys for OpenAI, Anthropic, etc., which stay hidden

# Premium users get everything
[groups.premium]
api_keys = ["premium-key-1", "premium-key-2"]
allowed_connections = "*"

# Free tier gets OpenAI only
[groups.free]
api_keys = ["free-key-1"]
allowed_connections = "openai"

# Internal tools get local models only
[groups.internal]
api_keys = ["internal-key-1"]
allowed_connections = "local"

Your upstream API keys are never exposed to clients. You can rotate them without updating any client configuration. You can create granular access tiers — premium users get Claude Opus, free users get GPT-4o-mini, internal tools use local models. All managed in one config file.

It even supports external authentication — you can validate virtual API keys against Keycloak, Auth0, or any OIDC provider:

[api_key_check]
class = "lm_proxy.api_key_check.CheckAPIKeyWithRequest"
method = "POST"
url = "http://keycloak:8080/realms/master/protocol/openid-connect/userinfo"
response_as_user_info = true
use_cache = true
cache_ttl = 60

[api_key_check.headers]
Authorization = "Bearer {api_key}"

Your existing OAuth tokens become LLM API keys automatically. And if the built-in validators don't fit, you can write a custom one — just a Python function that takes an API key string and returns a group name.

5. Full Streaming Support

SSE streaming works out of the box. Your clients get real-time token-by-token responses regardless of which provider is actually generating them. The proxy handles the format translation transparently — Anthropic's streaming format becomes OpenAI-compatible SSE events before they reach your client.

6. Use It as a Library, Not Just a Server

LM-Proxy isn't just a standalone service — it's also an importable Python package. Since it's built on FastAPI, you can mount it inside your existing application, use it in integration tests, or compose it with other ASGI middleware. No need to run a separate process if you don't want to.

Real-World Setup: 5 Minutes to Production

Let me walk through a concrete scenario. You're building a SaaS product that uses LLMs. You want GPT-4o for complex reasoning, Claude Sonnet for long-document processing, and Gemini Flash for cheap classification.

Step 1: Install

pip install lm-proxy

Step 2: Create .env

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AI...

Step 3: Create config.toml

host = "0.0.0.0"
port = 8000

[connections.openai]
api_type = "open_ai"
api_base = "https://api.openai.com/v1/"
api_key = "env:OPENAI_API_KEY"

[connections.anthropic]
api_type = "anthropic"
api_key = "env:ANTHROPIC_API_KEY"

[connections.google]
api_type = "google_ai_studio"
api_key = "env:GOOGLE_API_KEY"

[routing]
"gpt*" = "openai.*"
"claude*" = "anthropic.*"
"gemini*" = "google.*"
"*" = "google.gemini-2.0-flash"

[groups.backend]
api_keys = ["backend-service-key"]
allowed_connections = "*"

[groups.frontend]
api_keys = ["frontend-widget-key"]
allowed_connections = "google"     # cheap models only

Step 4: Run

lm-proxy

Step 5: Use from your app

from openai import OpenAI

client = OpenAI(
    api_key="backend-service-key",
    base_url="http://localhost:8000/v1"
)

# Complex reasoning → GPT-4o
response = client.chat.completions.create(
    model="gpt-5.2",
    messages=[{"role": "user", "content": "Analyze this contract..."}]
)

# Long document → Claude
response = client.chat.completions.create(
    model="claude-opus-4-6",
    messages=[{"role": "user", "content": "Summarize this 100-page report..."}]
)

# Quick classification → Gemini Flash (cheap)
response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[{"role": "user", "content": "Is this email spam? ..."}]
)

One client. One base URL. Three providers. And tomorrow, when someone releases a model that's 50% cheaper, you change one line in config.toml and restart the server. Your application code stays the same.

How It Compares to LiteLLM

The obvious question: "How is this different from LiteLLM?"

Let's be honest — LiteLLM is the 800-pound gorilla in this space. With 33k+ GitHub stars, 2000+ supported LLMs, an AWS Marketplace listing, and features like spend tracking dashboards, guardrails, caching, rate limiting, SSO, and MCP integration, it's a full-blown enterprise platform. It also offers both a Python SDK and an HTTP proxy server, so architecturally it covers similar ground.

So why would you choose LM-Proxy instead?

The same reason you'd choose Flask over Django, or SQLite over PostgreSQL. Not everything needs the full enterprise stack.

	LM-Proxy	LiteLLM Proxy
Philosophy	Minimal core, extend as needed	Batteries included
Setup complexity	`pip install lm-proxy` + one TOML file	`pip install 'litellm[proxy]'` + YAML + optional DB
Dependencies	FastAPI + MicroCore (lightweight)	Heavier dependency tree
Config format	TOML / YAML / JSON / Python	YAML
Embeddable as library	First-class use case	Supported but discouraged by docs
Virtual keys + groups	Built-in	Built-in (more advanced)
OIDC / Keycloak auth	Built-in	Built-in (more advanced: JWT validation, role mapping, SSO UI)
Spend tracking, guardrails, caching	Not built-in (extensible via add-ons)	Built-in
Admin UI	No	Yes
Supported providers	All major providers + local models / embedded inference	All major providers

Choose LM-Proxy when: you want a lightweight, easy-to-embed proxy with a tiny footprint, Python-config extensibility, and you don't need 90% of the enterprise features. It's ideal for small teams, personal projects, or when you want a gateway you can fully understand by reading the source in an afternoon.

Choose LiteLLM when: you need enterprise-grade spend tracking, a management UI, dozens of integrations, guardrails, caching, and support for 100+ providers and integrations out of the box.

Other alternatives like Portkey lean even more enterprise. LM-Proxy intentionally occupies the "just enough gateway" niche — powerful enough for production, simple enough that your config file is the documentation.

What's Already There (and What I'd Love to See Next)

Credit where it's due — LM-Proxy already covers more ground than you might expect from a "lightweight" tool:

Structured logging — a pluggable logger system with JsonLogWriter and LogEntryTransformer (tracks tokens, duration, group, connection, remote address), plus the lm-proxy-db-connector add-on for writing logs to PostgreSQL, MySQL, SQLite, and other databases via SQLAlchemy
Load balancing — there's an example configuration that distributes requests randomly across multiple LLM servers using the Python config format
Request handlers — a middleware-like system for intercepting requests before they reach upstream providers, enabling cross-cutting concerns like auditing and header manipulation
Vertex AI support — Google Cloud's Vertex AI is supported alongside the simpler AI Studio API, with a dedicated config example

That said, the project is still evolving (latest release: v3.0.0), and a few things remain on my personal wishlist:

Usage analytics dashboard — the logging and DB infrastructure is solid, but a built-in UI for visualizing spend and usage would be the cherry on top
Wildcard model expansion — the expand_wildcards mode for /v1/models is planned but not yet implemented — for now you need to list models explicitly in the routing config
Automatic provider failover — if OpenAI returns 5xx, automatically reroute to Anthropic. Load balancing across instances of the same provider is there, but cross-provider failover would complete the picture

The extensibility-by-design philosophy means most of these can be added as add-ons without touching the core — and the DB connector already demonstrates this pattern nicely. The codebase is MIT licensed and small enough to read end-to-end in an afternoon.

TL;DR

If you're working with multiple LLM providers (or think you might in the future), stop writing provider-specific integration code. Set up LM-Proxy as a gateway, point all your services at it, and never think about API format differences again.