Let's say you're building an app that uses AI. You start with OpenAI. Then someone shows you Claude's coding abilities. Then DeepSeek releases a model that's 10x cheaper. Then Qwen drops something even better for your use case.
Suddenly you're managing 4 different SDKs, 4 billing dashboards, and 4 different API key rotation schedules. Sound familiar?
Here's how to build a dead-simple model router that lets you call any AI model through a single endpoint — in about 50 lines of code.
The Problem
Most AI-powered apps look like this after a few months:
if task == "coding":
response = anthropic.messages.create(model="claude-sonnet-4-20250514", ...)
elif task == "cheap_summary":
response = openai.chat.completions.create(model="gpt-4o-mini", ...)
elif task == "complex_reasoning":
response = deepseek.chat.completions.create(model="deepseek-v4", ...)
else:
response = openai.chat.completions.create(model="gpt-4o", ...)
This works until:
- A model goes down (no fallback)
- You want to A/B test models (need to rewrite routing)
- A new model launches that's better and cheaper (more if/else spaghetti)
The Solution: A Model Router
The key insight: most AI providers now support OpenAI-compatible APIs. Even Anthropic. Even DeepSeek. Even Qwen.
So why write provider-specific code at all?
import os
import httpx
from typing import Optional
MODELS = {
"gpt-4o": {"base_url": "https://api.openai.com/v1", "key_env": "OPENAI_API_KEY"},
"claude-sonnet-4-20250514": {"base_url": "https://api.anthropic.com/v1", "key_env": "ANTHROPIC_API_KEY"},
"deepseek-v4": {"base_url": "https://api.deepseek.com/v1", "key_env": "DEEPSEEK_API_KEY"},
"qwen-3.7-max": {"base_url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", "key_env": "QWEN_API_KEY"},
}
def chat_completion(model, messages, fallback_models=None, **kwargs):
models_to_try = [model] + (fallback_models or [])
for m in models_to_try:
if m not in MODELS: continue
config = MODELS[m]
api_key = os.getenv(config["key_env"])
if not api_key: continue
try:
response = httpx.post(
f"{config['base_url']}/chat/completions",
json={"model": m, "messages": messages, **kwargs},
headers={"Authorization": f"Bearer {api_key}"},
timeout=30.0
)
response.raise_for_status()
return response.json()
except Exception as e:
continue
raise Exception(f"All models failed: {models_to_try}")
~50 lines. Now you can call any model:
result = chat_completion(
model="claude-sonnet-4-20250514",
fallback_models=["gpt-4o", "deepseek-v4"],
messages=[{"role": "user", "content": "Explain quicksort"}]
)
print(result["choices"][0]["message"]["content"])
Add Cost Tracking
import time
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"deepseek-v4": {"input": 0.20, "output": 0.80},
"qwen-3.7-max": {"input": 0.10, "output": 0.40},
}
def chat_completion_with_cost(model, messages, **kwargs):
result = chat_completion(model, messages, **kwargs)
usage = result.get("usage", {})
cost = (usage.get("prompt_tokens",0) * PRICING[model]["input"] +
usage.get("completion_tokens",0) * PRICING[model]["output"]) / 1_000_000
with open("api_costs.log", "a") as f:
f.write(f"{time.time()},{model},{cost:.6f}\n")
return result
Going Further
- Rate limiting: Don't let one user burn through your quota
- Response streaming: SSE for real-time output
- Caching: Skip API for identical prompts
- Model benchmarking: Track latency and quality per model
For a managed solution with Stripe billing, team management, and a dashboard — check out FastAnchor. It's open-source (18k+ GitHub stars), so you're never locked in.
But if you're just starting? The 50-line router above works great. Ship first, optimize later.
Key Takeaways
- OpenAI-compatible is the universal protocol now
- Fallback gives you resilience with zero extra infra
- Log costs from day one
- Don't over-engineer
What's your multi-model stack look like? Drop a comment!
Top comments (1)
This minimal 50-line implementation is fantastic for getting a multi-model routing proof-of-concept up and running fast, but it only covers the happy-path routing logic — none of the observability guardrails we’ve been discussing show up here.
A tiny router snippet like this lacks critical production safeguards: no per-model cost tagging, no baseline recalibration triggers for config changes, no counters to track redundant calls blocked by routing fallbacks, and zero enforcement for blast-radius alert tiers. At small scale it works fine, but once multiple teams start requesting custom routing exceptions for their preferred models, the simple rule set quickly unravels into unwritten shadow policy.
Also, with dozens of distinct models under one router, keeping meta-evaluators version-locked to each model becomes nearly impossible in this stripped-down setup. Evaluator drift will skew all your unified quality signals without paired release controls baked in.
Have you extended this minimal base with any lightweight observability hooks to track per-model spend and avoid silent cost drift from routing weight adjustments?