aitoken-hub

Posted on Jul 2

The Complete Guide to LLM Model Routing: From OpenAI to DeepSeek in One Line

#llm #api #ai

Imagine being able to switch from GPT-4o to DeepSeek-V3 to Claude 3.5 Sonnet by changing a single string in your code. No new SDK. No new API key. No new integration.

This isn't a hypothetical — it's what a unified AI gateway gives you. And in this guide, I'll show you exactly how to build a production-ready model routing system.

What Is Model Routing?

Model routing is the practice of directing different types of requests to different LLM providers based on:

Cost: Route simple queries to cheaper models
Quality: Route complex queries to more capable models
Latency: Route time-sensitive queries to faster models
Availability: Fall back to alternative models during outages
Compliance: Route data to models in specific regions

Without a unified gateway, implementing this requires maintaining separate integrations for each provider. With a gateway, it's a configuration change.

Architecture Overview

Here's the high-level architecture of a model routing system:

┌─────────────────┐
│  Your App       │
│  (OpenAI SDK)   │
────────┬────────┘
         │
         ▼
┌─────────────────┐
│  AI Gateway     │
│  (Single API)   │
────────┬────────┘
         │
    ┌────┴────┬────────────┬────────────┐
    ▼         ▼            ▼            ▼
┌───────┐ ┌────────┐ ┌─────────┐ ──────────┐
│OpenAI │ │Anthropic│ │DeepSeek │ │ Qwen/    │
│GPT-4o │ │Claude  │ │V3/R1    │ │ Llama    │
└───────┘ └────────┘ └─────────┘ └──────────┘

Your app talks to one endpoint. The gateway handles the rest.

Setting Up: The Foundation

Step 1: Choose Your Gateway

For this guide, I'm using AI Token Hub because:

200+ models including all major providers
OpenAI-compatible API — works with existing SDKs
Transparent pricing — pay-as-you-go, no monthly fees
Interactive playground at aitoken.surge.sh/playground.html for testing

Get your API key at aitoken.surge.sh/register.html.

Step 2: Configure Your Client

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_AI_TOKEN_HUB_KEY",
    base_url="https://aitoken.surge.sh/v1"
)

# That's it. You now have access to 200+ models.

Step 3: Verify Available Models

# List available models
models = client.models.list()
for model in models.data:
    print(f"- {model.id}")

# Output includes:
# - openai/gpt-4o
# - anthropic/claude-3-5-sonnet
# - deepseek-ai/DeepSeek-V3
# - deepseek-ai/DeepSeek-R1
# - Qwen/Qwen3-32B
# - meta-llama/Llama-3.3-70B-Instruct
# - google/gemini-2.0-flash
# ... and 200+ more

Building the Router

Pattern 1: Simple Rule-Based Routing

The simplest approach — route based on query type:

ROUTING_RULES = {
    "faq": {
        "model": "deepseek-ai/DeepSeek-V3",
        "max_tokens": 256,
        "temperature": 0.3,
    },
    "code": {
        "model": "deepseek-ai/DeepSeek-R1",
        "max_tokens": 2048,
        "temperature": 0.2,
    },
    "creative": {
        "model": "openai/gpt-4o",
        "max_tokens": 1024,
        "temperature": 0.8,
    },
    "analysis": {
        "model": "anthropic/claude-3-5-sonnet",
        "max_tokens": 4096,
        "temperature": 0.4,
    },
}

def route_query(query_type: str, prompt: str) -> str:
    config = ROUTING_RULES.get(query_type, ROUTING_RULES["faq"])

    response = client.chat.completions.create(
        model=config["model"],
        messages=[{"role": "user", "content": prompt}],
        max_tokens=config["max_tokens"],
        temperature=config["temperature"],
    )

    return response.choices[0].message.content

Pattern 2: Complexity-Based Routing

Route based on estimated query complexity:

import re

def estimate_complexity(text: str) -> float:
    """Estimate query complexity (0.0 = simple, 1.0 = complex)."""

    # Simple heuristics
    word_count = len(text.split())
    sentence_count = len(re.findall(r'[.!?]+', text))
    question_count = text.count('?')
    technical_terms = len(re.findall(r'\b(algorithm|optimize|architecture|implement|debug|refactor)\b', text.lower()))

    # Normalize
    complexity = min(1.0, (
        (word_count / 100) * 0.3 +
        (technical_terms / 5) * 0.4 +
        (question_count / 3) * 0.3
    ))

    return complexity

def get_model_by_complexity(complexity: float) -> str:
    if complexity < 0.3:
        return "deepseek-ai/DeepSeek-V3"    # $0.27/M input
    elif complexity < 0.6:
        return "Qwen/Qwen3-32B"             # $0.50/M input
    elif complexity < 0.8:
        return "deepseek-ai/DeepSeek-R1"    # $0.55/M input
    else:
        return "openai/gpt-4o"              # $2.50/M input

def smart_route(prompt: str) -> str:
    complexity = estimate_complexity(prompt)
    model = get_model_by_complexity(complexity)

    print(f"Complexity: {complexity:.2f} → Model: {model}")

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1024,
    )

    return response.choices[0].message.content

Pattern 3: Multi-Model Ensemble

For critical tasks, get responses from multiple models and pick the best:

def ensemble_query(prompt: str, models: list[str] = None) -> dict:
    """Query multiple models and return all responses."""

    if models is None:
        models = [
            "deepseek-ai/DeepSeek-V3",
            "deepseek-ai/DeepSeek-R1",
            "openai/gpt-4o",
        ]

    responses = {}
    for model in models:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=512,
                temperature=0.7,
            )
            responses[model] = response.choices[0].message.content
        except Exception as e:
            responses[model] = f"Error: {e}"

    return responses

# Usage
results = ensemble_query("Explain the CAP theorem in one paragraph.")
for model, response in results.items():
    print(f"\n=== {model} ===")
    print(response[:200])

Pattern 4: Automatic Fallback

Handle provider outages gracefully:

import time

FALLBACK_CHAIN = [
    "deepseek-ai/DeepSeek-V3",     # Primary (cheap, fast)
    "Qwen/Qwen3-32B",              # Secondary
    "openai/gpt-4o",               # Tertiary (expensive, reliable)
]

def query_with_fallback(prompt: str, max_retries: int = 3) -> tuple[str, str]:
    """Try models in order until one succeeds."""

    for attempt, model in enumerate(FALLBACK_CHAIN[:max_retries]):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=1024,
                timeout=30,
            )
            content = response.choices[0].message.content
            return content, model

        except Exception as e:
            print(f"Model {model} failed (attempt {attempt + 1}): {e}")
            time.sleep(1)  # Brief pause before retry

    raise RuntimeError("All models failed")

# Usage
try:
    response, model_used = query_with_fallback("What is the meaning of life?")
    print(f"Got response from {model_used}: {response[:100]}...")
except RuntimeError as e:
    print(f"All models unavailable: {e}")

Pattern 5: Cost-Optimized Batch Processing

For batch jobs, optimize for cost while meeting deadlines:

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def batch_process(prompts: list[str], budget_per_query: float = 0.001) -> list[str]:
    """Process a batch of prompts within budget constraints."""

    # Cost per 1K tokens for each model
    model_costs = {
        "deepseek-ai/DeepSeek-V3": 0.00027,   # $0.27/M tokens
        "Qwen/Qwen3-32B": 0.00050,            # $0.50/M tokens
        "deepseek-ai/DeepSeek-R1": 0.00055,   # $0.55/M tokens
        "openai/gpt-4o": 0.00250,             # $2.50/M tokens
    }

    results = []

    def process_single(prompt: str) -> str:
        # Choose model based on budget
        if budget_per_query >= 0.00250:
            model = "openai/gpt-4o"
        elif budget_per_query >= 0.00055:
            model = "deepseek-ai/DeepSeek-R1"
        elif budget_per_query >= 0.00050:
            model = "Qwen/Qwen3-32B"
        else:
            model = "deepseek-ai/DeepSeek-V3"

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512,
        )
        return response.choices[0].message.content

    # Process in parallel
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(process_single, p): p for p in prompts}
        for future in as_completed(futures):
            results.append(future.result())

    return results

Production Considerations

1. Rate Limiting

import threading

class RateLimiter:
    def __init__(self, max_requests: int, time_window: float):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = []
        self.lock = threading.Lock()

    def wait_if_needed(self):
        with self.lock:
            now = time.time()
            # Remove old requests
            self.requests = [t for t in self.requests if now - t < self.time_window]

            if len(self.requests) >= self.max_requests:
                sleep_time = self.time_window - (now - self.requests[0])
                if sleep_time > 0:
                    time.sleep(sleep_time)

            self.requests.append(now)

# Usage: 100 requests per minute
limiter = RateLimiter(100, 60)

def rate_limited_query(prompt: str) -> str:
    limiter.wait_if_needed()
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V3",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

2. Caching

import hashlib
import json
from functools import lru_cache

def cache_key(model: str, prompt: str, max_tokens: int) -> str:
    data = f"{model}:{prompt}:{max_tokens}"
    return hashlib.sha256(data.encode()).hexdigest()

# Simple in-memory cache
_cache = {}

def cached_query(model: str, prompt: str, max_tokens: int = 1024, ttl: int = 3600) -> str:
    key = cache_key(model, prompt, max_tokens)

    if key in _cache:
        cached_time, cached_response = _cache[key]
        if time.time() - cached_time < ttl:
            return cached_response

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
    )
    result = response.choices[0].message.content

    _cache[key] = (time.time(), result)
    return result

3. Logging and Cost Tracking

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("model_router")

def tracked_query(model: str, prompt: str, **kwargs) -> dict:
    start_time = time.time()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs,
    )

    elapsed = time.time() - start_time
    usage = response.usage

    result = {
        "model": model,
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "total_tokens": usage.total_tokens,
        "latency_ms": elapsed * 1000,
        "content": response.choices[0].message.content,
    }

    logger.info(f"Query to {model}: {usage.total_tokens} tokens, {elapsed*1000:.0f}ms")

    return result

Model Selection Guide

Here's a quick reference for choosing models:

Task Type	Recommended Model	Cost (Input/M tokens)	Why
Simple Q&A	DeepSeek-V3	$0.27	Fast, cheap, accurate enough
Code generation	DeepSeek-R1	$0.55	Strong reasoning, good at code
Creative writing	GPT-4o	$2.50	Best creativity and nuance
Long documents	Claude 3.5 Sonnet	$3.00	200K context window
Multilingual	Qwen3-32B	$0.50	Excellent CJK support
Open-source	Llama 3.3 70B	$0.50	Self-hostable option
Structured output	DeepSeek-V3	$0.27	Good at JSON/formatting
Complex reasoning	DeepSeek-R1	$0.55	Chain-of-thought specialist

The Playground Advantage

Before committing to a model, test it in the AI Token Hub Playground. You can:

Compare responses from multiple models side-by-side
Test different prompts and parameters
See real-time cost estimates
No API key required for testing

This saved me hours of trial-and-error integration. Test first, integrate second.

Cost Calculator

Use the AI Token Hub Cost Calculator to estimate your savings before switching. Input your current usage and it shows:

Current spend with your existing provider
Projected spend with intelligent routing
Breakdown by query type and model
Monthly and annual projections

Conclusion

Model routing isn't about finding the "best" model — it's about finding the right model for each task. A unified gateway makes this trivial:

One API key instead of N
One integration instead of N
One dashboard for all your AI spend
Instant model switching — just change a string

Start with simple rule-based routing. Add complexity-based routing as you gather data. Implement fallbacks for reliability. Add caching for cost savings.

The result? Lower costs, better reliability, and the flexibility to adopt new models as they launch.

What routing patterns have you found most effective? Share your experiences in the comments! And if you're building a model routing system, check out AI Token Hub — the playground is perfect for testing your routing logic before going to production.

Happy routing! 🎯

DEV Community