DEV Community

LYX19951121
LYX19951121

Posted on

How to Build a Multi-Model AI Router in 50 Lines of Code

Let's say you're building an app that uses AI. You start with OpenAI. Then someone shows you Claude's coding abilities. Then DeepSeek releases a model that's 10x cheaper. Then Qwen drops something even better for your use case.

Suddenly you're managing 4 different SDKs, 4 billing dashboards, and 4 different API key rotation schedules. Sound familiar?

Here's how to build a dead-simple model router that lets you call any AI model through a single endpoint — in about 50 lines of code.


The Problem

Most AI-powered apps look like this after a few months:

if task == "coding":
    response = anthropic.messages.create(model="claude-sonnet-4-20250514", ...)
elif task == "cheap_summary":
    response = openai.chat.completions.create(model="gpt-4o-mini", ...)
elif task == "complex_reasoning":
    response = deepseek.chat.completions.create(model="deepseek-v4", ...)
else:
    response = openai.chat.completions.create(model="gpt-4o", ...)
Enter fullscreen mode Exit fullscreen mode

This works until:

  • A model goes down (no fallback)
  • You want to A/B test models (need to rewrite routing)
  • A new model launches that's better and cheaper (more if/else spaghetti)

The Solution: A Model Router

The key insight: most AI providers now support OpenAI-compatible APIs. Even Anthropic. Even DeepSeek. Even Qwen.

So why write provider-specific code at all?

import os
import httpx
from typing import Optional

MODELS = {
    "gpt-4o": {"base_url": "https://api.openai.com/v1", "key_env": "OPENAI_API_KEY"},
    "claude-sonnet-4-20250514": {"base_url": "https://api.anthropic.com/v1", "key_env": "ANTHROPIC_API_KEY"},
    "deepseek-v4": {"base_url": "https://api.deepseek.com/v1", "key_env": "DEEPSEEK_API_KEY"},
    "qwen-3.7-max": {"base_url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1", "key_env": "QWEN_API_KEY"},
}

def chat_completion(model, messages, fallback_models=None, **kwargs):
    models_to_try = [model] + (fallback_models or [])
    for m in models_to_try:
        if m not in MODELS: continue
        config = MODELS[m]
        api_key = os.getenv(config["key_env"])
        if not api_key: continue
        try:
            response = httpx.post(
                f"{config['base_url']}/chat/completions",
                json={"model": m, "messages": messages, **kwargs},
                headers={"Authorization": f"Bearer {api_key}"},
                timeout=30.0
            )
            response.raise_for_status()
            return response.json()
        except Exception as e:
            continue
    raise Exception(f"All models failed: {models_to_try}")
Enter fullscreen mode Exit fullscreen mode

~50 lines. Now you can call any model:

result = chat_completion(
    model="claude-sonnet-4-20250514",
    fallback_models=["gpt-4o", "deepseek-v4"],
    messages=[{"role": "user", "content": "Explain quicksort"}]
)
print(result["choices"][0]["message"]["content"])
Enter fullscreen mode Exit fullscreen mode

Add Cost Tracking

import time

PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "deepseek-v4": {"input": 0.20, "output": 0.80},
    "qwen-3.7-max": {"input": 0.10, "output": 0.40},
}

def chat_completion_with_cost(model, messages, **kwargs):
    result = chat_completion(model, messages, **kwargs)
    usage = result.get("usage", {})
    cost = (usage.get("prompt_tokens",0) * PRICING[model]["input"] + 
            usage.get("completion_tokens",0) * PRICING[model]["output"]) / 1_000_000
    with open("api_costs.log", "a") as f:
        f.write(f"{time.time()},{model},{cost:.6f}\n")
    return result
Enter fullscreen mode Exit fullscreen mode

Going Further

  • Rate limiting: Don't let one user burn through your quota
  • Response streaming: SSE for real-time output
  • Caching: Skip API for identical prompts
  • Model benchmarking: Track latency and quality per model

For a managed solution with Stripe billing, team management, and a dashboard — check out FastAnchor. It's open-source (18k+ GitHub stars), so you're never locked in.

But if you're just starting? The 50-line router above works great. Ship first, optimize later.


Key Takeaways

  1. OpenAI-compatible is the universal protocol now
  2. Fallback gives you resilience with zero extra infra
  3. Log costs from day one
  4. Don't over-engineer

What's your multi-model stack look like? Drop a comment!

Top comments (0)