Let's say you're building an app that uses AI. You start with OpenAI. Then someone shows you Claude's coding abilities. Then DeepSeek releases a model that's 10x cheaper. Then Qwen drops something even better for your use case.
Suddenly you're managing 4 different SDKs, 4 billing dashboards, and 4 different API key rotation schedules. Sound familiar?
Here's how to build a dead-simple model router that lets you call any AI model through a single endpoint — in about 50 lines of code.
The Problem
Most AI-powered apps look like this after a few months:
if task == "coding":
response = anthropic.messages.create(model="claude-sonnet-4-20250514", ...)
elif task == "cheap_summary":
response = openai.chat.completions.create(model="gpt-4o-mini", ...)
elif task == "complex_reasoning":
response = deepseek.chat.completions.create(model="deepseek-v4", ...)
else:
response = openai.chat.completions.create(model="gpt-4o", ...)
This works until:
- A model goes down (no fallback)
- You want to A/B test models (need to rewrite routing)
- A new model launches that's better and cheaper (more if/else spaghetti)
The Solution: A Model Router
The key insight: most AI providers now support OpenAI-compatible APIs. Even Anthropic. Even DeepSeek. Even Qwen.
So why write provider-specific code at all?
import os
import httpx
from typing import Optional
MODELS = {
"gpt-4o": {"base_url": "https://api.openai.com/v1", "key_env": "OPENAI_API_KEY"},
"claude-sonnet-4-20250514": {"base_url": "https://api.anthropic.com/v1", "key_env": "ANTHROPIC_API_KEY"},
"deepseek-v4": {"base_url": "https://api.deepseek.com/v1", "key_env": "DEEPSEEK_API_KEY"},
"qwen-3.7-max": {"base_url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", "key_env": "QWEN_API_KEY"},
}
def chat_completion(model, messages, fallback_models=None, **kwargs):
models_to_try = [model] + (fallback_models or [])
for m in models_to_try:
if m not in MODELS: continue
config = MODELS[m]
api_key = os.getenv(config["key_env"])
if not api_key: continue
try:
response = httpx.post(
f"{config['base_url']}/chat/completions",
json={"model": m, "messages": messages, **kwargs},
headers={"Authorization": f"Bearer {api_key}"},
timeout=30.0
)
response.raise_for_status()
return response.json()
except Exception as e:
continue
raise Exception(f"All models failed: {models_to_try}")
~50 lines. Now you can call any model:
result = chat_completion(
model="claude-sonnet-4-20250514",
fallback_models=["gpt-4o", "deepseek-v4"],
messages=[{"role": "user", "content": "Explain quicksort"}]
)
print(result["choices"][0]["message"]["content"])
Add Cost Tracking
import time
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"deepseek-v4": {"input": 0.20, "output": 0.80},
"qwen-3.7-max": {"input": 0.10, "output": 0.40},
}
def chat_completion_with_cost(model, messages, **kwargs):
result = chat_completion(model, messages, **kwargs)
usage = result.get("usage", {})
cost = (usage.get("prompt_tokens",0) * PRICING[model]["input"] +
usage.get("completion_tokens",0) * PRICING[model]["output"]) / 1_000_000
with open("api_costs.log", "a") as f:
f.write(f"{time.time()},{model},{cost:.6f}\n")
return result
Going Further
- Rate limiting: Don't let one user burn through your quota
- Response streaming: SSE for real-time output
- Caching: Skip API for identical prompts
- Model benchmarking: Track latency and quality per model
For a managed solution with Stripe billing, team management, and a dashboard — check out FastAnchor. It's open-source (18k+ GitHub stars), so you're never locked in.
But if you're just starting? The 50-line router above works great. Ship first, optimize later.
Key Takeaways
- OpenAI-compatible is the universal protocol now
- Fallback gives you resilience with zero extra infra
- Log costs from day one
- Don't over-engineer
What's your multi-model stack look like? Drop a comment!
Top comments (0)