Imagine being able to switch from GPT-4o to DeepSeek-V3 to Claude 3.5 Sonnet by changing a single string in your code. No new SDK. No new API key. No new integration.
This isn't a hypothetical — it's what a unified AI gateway gives you. And in this guide, I'll show you exactly how to build a production-ready model routing system.
What Is Model Routing?
Model routing is the practice of directing different types of requests to different LLM providers based on:
- Cost: Route simple queries to cheaper models
- Quality: Route complex queries to more capable models
- Latency: Route time-sensitive queries to faster models
- Availability: Fall back to alternative models during outages
- Compliance: Route data to models in specific regions
Without a unified gateway, implementing this requires maintaining separate integrations for each provider. With a gateway, it's a configuration change.
Architecture Overview
Here's the high-level architecture of a model routing system:
┌─────────────────┐
│ Your App │
│ (OpenAI SDK) │
────────┬────────┘
│
▼
┌─────────────────┐
│ AI Gateway │
│ (Single API) │
────────┬────────┘
│
┌────┴────┬────────────┬────────────┐
▼ ▼ ▼ ▼
┌───────┐ ┌────────┐ ┌─────────┐ ──────────┐
│OpenAI │ │Anthropic│ │DeepSeek │ │ Qwen/ │
│GPT-4o │ │Claude │ │V3/R1 │ │ Llama │
└───────┘ └────────┘ └─────────┘ └──────────┘
Your app talks to one endpoint. The gateway handles the rest.
Setting Up: The Foundation
Step 1: Choose Your Gateway
For this guide, I'm using AI Token Hub because:
- 200+ models including all major providers
- OpenAI-compatible API — works with existing SDKs
- Transparent pricing — pay-as-you-go, no monthly fees
- Interactive playground at aitoken.surge.sh/playground.html for testing
Get your API key at aitoken.surge.sh/register.html.
Step 2: Configure Your Client
from openai import OpenAI
client = OpenAI(
api_key="YOUR_AI_TOKEN_HUB_KEY",
base_url="https://aitoken.surge.sh/v1"
)
# That's it. You now have access to 200+ models.
Step 3: Verify Available Models
# List available models
models = client.models.list()
for model in models.data:
print(f"- {model.id}")
# Output includes:
# - openai/gpt-4o
# - anthropic/claude-3-5-sonnet
# - deepseek-ai/DeepSeek-V3
# - deepseek-ai/DeepSeek-R1
# - Qwen/Qwen3-32B
# - meta-llama/Llama-3.3-70B-Instruct
# - google/gemini-2.0-flash
# ... and 200+ more
Building the Router
Pattern 1: Simple Rule-Based Routing
The simplest approach — route based on query type:
ROUTING_RULES = {
"faq": {
"model": "deepseek-ai/DeepSeek-V3",
"max_tokens": 256,
"temperature": 0.3,
},
"code": {
"model": "deepseek-ai/DeepSeek-R1",
"max_tokens": 2048,
"temperature": 0.2,
},
"creative": {
"model": "openai/gpt-4o",
"max_tokens": 1024,
"temperature": 0.8,
},
"analysis": {
"model": "anthropic/claude-3-5-sonnet",
"max_tokens": 4096,
"temperature": 0.4,
},
}
def route_query(query_type: str, prompt: str) -> str:
config = ROUTING_RULES.get(query_type, ROUTING_RULES["faq"])
response = client.chat.completions.create(
model=config["model"],
messages=[{"role": "user", "content": prompt}],
max_tokens=config["max_tokens"],
temperature=config["temperature"],
)
return response.choices[0].message.content
Pattern 2: Complexity-Based Routing
Route based on estimated query complexity:
import re
def estimate_complexity(text: str) -> float:
"""Estimate query complexity (0.0 = simple, 1.0 = complex)."""
# Simple heuristics
word_count = len(text.split())
sentence_count = len(re.findall(r'[.!?]+', text))
question_count = text.count('?')
technical_terms = len(re.findall(r'\b(algorithm|optimize|architecture|implement|debug|refactor)\b', text.lower()))
# Normalize
complexity = min(1.0, (
(word_count / 100) * 0.3 +
(technical_terms / 5) * 0.4 +
(question_count / 3) * 0.3
))
return complexity
def get_model_by_complexity(complexity: float) -> str:
if complexity < 0.3:
return "deepseek-ai/DeepSeek-V3" # $0.27/M input
elif complexity < 0.6:
return "Qwen/Qwen3-32B" # $0.50/M input
elif complexity < 0.8:
return "deepseek-ai/DeepSeek-R1" # $0.55/M input
else:
return "openai/gpt-4o" # $2.50/M input
def smart_route(prompt: str) -> str:
complexity = estimate_complexity(prompt)
model = get_model_by_complexity(complexity)
print(f"Complexity: {complexity:.2f} → Model: {model}")
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=1024,
)
return response.choices[0].message.content
Pattern 3: Multi-Model Ensemble
For critical tasks, get responses from multiple models and pick the best:
def ensemble_query(prompt: str, models: list[str] = None) -> dict:
"""Query multiple models and return all responses."""
if models is None:
models = [
"deepseek-ai/DeepSeek-V3",
"deepseek-ai/DeepSeek-R1",
"openai/gpt-4o",
]
responses = {}
for model in models:
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=512,
temperature=0.7,
)
responses[model] = response.choices[0].message.content
except Exception as e:
responses[model] = f"Error: {e}"
return responses
# Usage
results = ensemble_query("Explain the CAP theorem in one paragraph.")
for model, response in results.items():
print(f"\n=== {model} ===")
print(response[:200])
Pattern 4: Automatic Fallback
Handle provider outages gracefully:
import time
FALLBACK_CHAIN = [
"deepseek-ai/DeepSeek-V3", # Primary (cheap, fast)
"Qwen/Qwen3-32B", # Secondary
"openai/gpt-4o", # Tertiary (expensive, reliable)
]
def query_with_fallback(prompt: str, max_retries: int = 3) -> tuple[str, str]:
"""Try models in order until one succeeds."""
for attempt, model in enumerate(FALLBACK_CHAIN[:max_retries]):
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=1024,
timeout=30,
)
content = response.choices[0].message.content
return content, model
except Exception as e:
print(f"Model {model} failed (attempt {attempt + 1}): {e}")
time.sleep(1) # Brief pause before retry
raise RuntimeError("All models failed")
# Usage
try:
response, model_used = query_with_fallback("What is the meaning of life?")
print(f"Got response from {model_used}: {response[:100]}...")
except RuntimeError as e:
print(f"All models unavailable: {e}")
Pattern 5: Cost-Optimized Batch Processing
For batch jobs, optimize for cost while meeting deadlines:
from concurrent.futures import ThreadPoolExecutor, as_completed
import time
def batch_process(prompts: list[str], budget_per_query: float = 0.001) -> list[str]:
"""Process a batch of prompts within budget constraints."""
# Cost per 1K tokens for each model
model_costs = {
"deepseek-ai/DeepSeek-V3": 0.00027, # $0.27/M tokens
"Qwen/Qwen3-32B": 0.00050, # $0.50/M tokens
"deepseek-ai/DeepSeek-R1": 0.00055, # $0.55/M tokens
"openai/gpt-4o": 0.00250, # $2.50/M tokens
}
results = []
def process_single(prompt: str) -> str:
# Choose model based on budget
if budget_per_query >= 0.00250:
model = "openai/gpt-4o"
elif budget_per_query >= 0.00055:
model = "deepseek-ai/DeepSeek-R1"
elif budget_per_query >= 0.00050:
model = "Qwen/Qwen3-32B"
else:
model = "deepseek-ai/DeepSeek-V3"
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=512,
)
return response.choices[0].message.content
# Process in parallel
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(process_single, p): p for p in prompts}
for future in as_completed(futures):
results.append(future.result())
return results
Production Considerations
1. Rate Limiting
import threading
class RateLimiter:
def __init__(self, max_requests: int, time_window: float):
self.max_requests = max_requests
self.time_window = time_window
self.requests = []
self.lock = threading.Lock()
def wait_if_needed(self):
with self.lock:
now = time.time()
# Remove old requests
self.requests = [t for t in self.requests if now - t < self.time_window]
if len(self.requests) >= self.max_requests:
sleep_time = self.time_window - (now - self.requests[0])
if sleep_time > 0:
time.sleep(sleep_time)
self.requests.append(now)
# Usage: 100 requests per minute
limiter = RateLimiter(100, 60)
def rate_limited_query(prompt: str) -> str:
limiter.wait_if_needed()
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
2. Caching
import hashlib
import json
from functools import lru_cache
def cache_key(model: str, prompt: str, max_tokens: int) -> str:
data = f"{model}:{prompt}:{max_tokens}"
return hashlib.sha256(data.encode()).hexdigest()
# Simple in-memory cache
_cache = {}
def cached_query(model: str, prompt: str, max_tokens: int = 1024, ttl: int = 3600) -> str:
key = cache_key(model, prompt, max_tokens)
if key in _cache:
cached_time, cached_response = _cache[key]
if time.time() - cached_time < ttl:
return cached_response
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
result = response.choices[0].message.content
_cache[key] = (time.time(), result)
return result
3. Logging and Cost Tracking
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("model_router")
def tracked_query(model: str, prompt: str, **kwargs) -> dict:
start_time = time.time()
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
**kwargs,
)
elapsed = time.time() - start_time
usage = response.usage
result = {
"model": model,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"total_tokens": usage.total_tokens,
"latency_ms": elapsed * 1000,
"content": response.choices[0].message.content,
}
logger.info(f"Query to {model}: {usage.total_tokens} tokens, {elapsed*1000:.0f}ms")
return result
Model Selection Guide
Here's a quick reference for choosing models:
| Task Type | Recommended Model | Cost (Input/M tokens) | Why |
|---|---|---|---|
| Simple Q&A | DeepSeek-V3 | $0.27 | Fast, cheap, accurate enough |
| Code generation | DeepSeek-R1 | $0.55 | Strong reasoning, good at code |
| Creative writing | GPT-4o | $2.50 | Best creativity and nuance |
| Long documents | Claude 3.5 Sonnet | $3.00 | 200K context window |
| Multilingual | Qwen3-32B | $0.50 | Excellent CJK support |
| Open-source | Llama 3.3 70B | $0.50 | Self-hostable option |
| Structured output | DeepSeek-V3 | $0.27 | Good at JSON/formatting |
| Complex reasoning | DeepSeek-R1 | $0.55 | Chain-of-thought specialist |
The Playground Advantage
Before committing to a model, test it in the AI Token Hub Playground. You can:
- Compare responses from multiple models side-by-side
- Test different prompts and parameters
- See real-time cost estimates
- No API key required for testing
This saved me hours of trial-and-error integration. Test first, integrate second.
Cost Calculator
Use the AI Token Hub Cost Calculator to estimate your savings before switching. Input your current usage and it shows:
- Current spend with your existing provider
- Projected spend with intelligent routing
- Breakdown by query type and model
- Monthly and annual projections
Conclusion
Model routing isn't about finding the "best" model — it's about finding the right model for each task. A unified gateway makes this trivial:
- One API key instead of N
- One integration instead of N
- One dashboard for all your AI spend
- Instant model switching — just change a string
Start with simple rule-based routing. Add complexity-based routing as you gather data. Implement fallbacks for reliability. Add caching for cost savings.
The result? Lower costs, better reliability, and the flexibility to adopt new models as they launch.
What routing patterns have you found most effective? Share your experiences in the comments! And if you're building a model routing system, check out AI Token Hub — the playground is perfect for testing your routing logic before going to production.
Happy routing! 🎯
Top comments (0)