How to Build a Multi-Model AI Router in 50 Lines of Code

#tutorial

Let's say you're building an app that uses AI. You start with OpenAI. Then someone shows you Claude's coding abilities. Then DeepSeek releases a model that's 10x cheaper. Then Qwen drops something even better for your use case.

Suddenly you're managing 4 different SDKs, 4 billing dashboards, and 4 different API key rotation schedules. Sound familiar?

Here's how to build a dead-simple model router that lets you call any AI model through a single endpoint — in about 50 lines of code.

The Problem

Most AI-powered apps look like this after a few months:

if task == "coding":
    response = anthropic.messages.create(model="claude-sonnet-4-20250514", ...)
elif task == "cheap_summary":
    response = openai.chat.completions.create(model="gpt-4o-mini", ...)
elif task == "complex_reasoning":
    response = deepseek.chat.completions.create(model="deepseek-v4", ...)
else:
    response = openai.chat.completions.create(model="gpt-4o", ...)

This works until:

A model goes down (no fallback)
You want to A/B test models (need to rewrite routing)
A new model launches that's better and cheaper (more if/else spaghetti)

The Solution: A Model Router

The key insight: most AI providers now support OpenAI-compatible APIs. Even Anthropic. Even DeepSeek. Even Qwen.

So why write provider-specific code at all?

import os
import httpx
from typing import Optional

MODELS = {
    "gpt-4o": {"base_url": "https://api.openai.com/v1", "key_env": "OPENAI_API_KEY"},
    "claude-sonnet-4-20250514": {"base_url": "https://api.anthropic.com/v1", "key_env": "ANTHROPIC_API_KEY"},
    "deepseek-v4": {"base_url": "https://api.deepseek.com/v1", "key_env": "DEEPSEEK_API_KEY"},
    "qwen-3.7-max": {"base_url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", "key_env": "QWEN_API_KEY"},
}

def chat_completion(model, messages, fallback_models=None, **kwargs):
    models_to_try = [model] + (fallback_models or [])
    for m in models_to_try:
        if m not in MODELS: continue
        config = MODELS[m]
        api_key = os.getenv(config["key_env"])
        if not api_key: continue
        try:
            response = httpx.post(
                f"{config['base_url']}/chat/completions",
                json={"model": m, "messages": messages, **kwargs},
                headers={"Authorization": f"Bearer {api_key}"},
                timeout=30.0
            )
            response.raise_for_status()
            return response.json()
        except Exception as e:
            continue
    raise Exception(f"All models failed: {models_to_try}")

~50 lines. Now you can call any model:

result = chat_completion(
    model="claude-sonnet-4-20250514",
    fallback_models=["gpt-4o", "deepseek-v4"],
    messages=[{"role": "user", "content": "Explain quicksort"}]
)
print(result["choices"][0]["message"]["content"])

Add Cost Tracking

import time

PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "deepseek-v4": {"input": 0.20, "output": 0.80},
    "qwen-3.7-max": {"input": 0.10, "output": 0.40},
}

def chat_completion_with_cost(model, messages, **kwargs):
    result = chat_completion(model, messages, **kwargs)
    usage = result.get("usage", {})
    cost = (usage.get("prompt_tokens",0) * PRICING[model]["input"] + 
            usage.get("completion_tokens",0) * PRICING[model]["output"]) / 1_000_000
    with open("api_costs.log", "a") as f:
        f.write(f"{time.time()},{model},{cost:.6f}\n")
    return result

Going Further

Rate limiting: Don't let one user burn through your quota
Response streaming: SSE for real-time output
Caching: Skip API for identical prompts
Model benchmarking: Track latency and quality per model

For a managed solution with Stripe billing, team management, and a dashboard — check out FastAnchor. It's open-source (18k+ GitHub stars), so you're never locked in.

But if you're just starting? The 50-line router above works great. Ship first, optimize later.

Key Takeaways

OpenAI-compatible is the universal protocol now
Fallback gives you resilience with zero extra infra
Log costs from day one
Don't over-engineer

What's your multi-model stack look like? Drop a comment!

DEV Community