How to Build a Multi-Model AI Router in 50 Lines of Code

#tutorial

Let's say you're building an app that uses AI. You start with OpenAI. Then someone shows you Claude's coding abilities. Then DeepSeek releases a model that's 10x cheaper. Then Qwen drops something even better for your use case.

Suddenly you're managing 4 different SDKs, 4 billing dashboards, and 4 different API key rotation schedules. Sound familiar?

Here's how to build a dead-simple model router that lets you call any AI model through a single endpoint — in about 50 lines of code.

The Problem

Most AI-powered apps look like this after a few months:

if task == "coding":
    response = anthropic.messages.create(model="claude-sonnet-4-20250514", ...)
elif task == "cheap_summary":
    response = openai.chat.completions.create(model="gpt-4o-mini", ...)
elif task == "complex_reasoning":
    response = deepseek.chat.completions.create(model="deepseek-v4", ...)
else:
    response = openai.chat.completions.create(model="gpt-4o", ...)

This works until:

A model goes down (no fallback)
You want to A/B test models (need to rewrite routing)
A new model launches that's better and cheaper (more if/else spaghetti)

The Solution: A Model Router

The key insight: most AI providers now support OpenAI-compatible APIs. Even Anthropic. Even DeepSeek. Even Qwen.

So why write provider-specific code at all?

import os
import httpx
from typing import Optional

MODELS = {
    "gpt-4o": {"base_url": "https://api.openai.com/v1", "key_env": "OPENAI_API_KEY"},
    "claude-sonnet-4-20250514": {"base_url": "https://api.anthropic.com/v1", "key_env": "ANTHROPIC_API_KEY"},
    "deepseek-v4": {"base_url": "https://api.deepseek.com/v1", "key_env": "DEEPSEEK_API_KEY"},
    "qwen-3.7-max": {"base_url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", "key_env": "QWEN_API_KEY"},
}

def chat_completion(model, messages, fallback_models=None, **kwargs):
    models_to_try = [model] + (fallback_models or [])
    for m in models_to_try:
        if m not in MODELS: continue
        config = MODELS[m]
        api_key = os.getenv(config["key_env"])
        if not api_key: continue
        try:
            response = httpx.post(
                f"{config['base_url']}/chat/completions",
                json={"model": m, "messages": messages, **kwargs},
                headers={"Authorization": f"Bearer {api_key}"},
                timeout=30.0
            )
            response.raise_for_status()
            return response.json()
        except Exception as e:
            continue
    raise Exception(f"All models failed: {models_to_try}")

~50 lines. Now you can call any model:

result = chat_completion(
    model="claude-sonnet-4-20250514",
    fallback_models=["gpt-4o", "deepseek-v4"],
    messages=[{"role": "user", "content": "Explain quicksort"}]
)
print(result["choices"][0]["message"]["content"])

Add Cost Tracking

import time

PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "deepseek-v4": {"input": 0.20, "output": 0.80},
    "qwen-3.7-max": {"input": 0.10, "output": 0.40},
}

def chat_completion_with_cost(model, messages, **kwargs):
    result = chat_completion(model, messages, **kwargs)
    usage = result.get("usage", {})
    cost = (usage.get("prompt_tokens",0) * PRICING[model]["input"] + 
            usage.get("completion_tokens",0) * PRICING[model]["output"]) / 1_000_000
    with open("api_costs.log", "a") as f:
        f.write(f"{time.time()},{model},{cost:.6f}\n")
    return result

Going Further

Rate limiting: Don't let one user burn through your quota
Response streaming: SSE for real-time output
Caching: Skip API for identical prompts
Model benchmarking: Track latency and quality per model

For a managed solution with Stripe billing, team management, and a dashboard — check out FastAnchor. It's open-source (18k+ GitHub stars), so you're never locked in.

But if you're just starting? The 50-line router above works great. Ship first, optimize later.

Key Takeaways

OpenAI-compatible is the universal protocol now
Fallback gives you resilience with zero extra infra
Log costs from day one
Don't over-engineer

What's your multi-model stack look like? Drop a comment!

Top comments (1)

FastAnchor_io • Jun 17

This minimal 50-line implementation is fantastic for getting a multi-model routing proof-of-concept up and running fast, but it only covers the happy-path routing logic — none of the observability guardrails we’ve been discussing show up here.
A tiny router snippet like this lacks critical production safeguards: no per-model cost tagging, no baseline recalibration triggers for config changes, no counters to track redundant calls blocked by routing fallbacks, and zero enforcement for blast-radius alert tiers. At small scale it works fine, but once multiple teams start requesting custom routing exceptions for their preferred models, the simple rule set quickly unravels into unwritten shadow policy.
Also, with dozens of distinct models under one router, keeping meta-evaluators version-locked to each model becomes nearly impossible in this stripped-down setup. Evaluator drift will skew all your unified quality signals without paired release controls baked in.
Have you extended this minimal base with any lightweight observability hooks to track per-model spend and avoid silent cost drift from routing weight adjustments?