DEV Community

aitoken-hub
aitoken-hub

Posted on

The Complete Guide to LLM Model Routing: From OpenAI to DeepSeek in One Line

Imagine being able to switch from GPT-4o to DeepSeek-V3 to Claude 3.5 Sonnet by changing a single string in your code. No new SDK. No new API key. No new integration.

This isn't a hypothetical — it's what a unified AI gateway gives you. And in this guide, I'll show you exactly how to build a production-ready model routing system.

What Is Model Routing?

Model routing is the practice of directing different types of requests to different LLM providers based on:

  • Cost: Route simple queries to cheaper models
  • Quality: Route complex queries to more capable models
  • Latency: Route time-sensitive queries to faster models
  • Availability: Fall back to alternative models during outages
  • Compliance: Route data to models in specific regions

Without a unified gateway, implementing this requires maintaining separate integrations for each provider. With a gateway, it's a configuration change.

Architecture Overview

Here's the high-level architecture of a model routing system:

┌─────────────────┐
│  Your App       │
│  (OpenAI SDK)   │
────────┬────────┘
         │
         ▼
┌─────────────────┐
│  AI Gateway     │
│  (Single API)   │
────────┬────────┘
         │
    ┌────┴────┬────────────┬────────────┐
    ▼         ▼            ▼            ▼
┌───────┐ ┌────────┐ ┌─────────┐ ──────────┐
│OpenAI │ │Anthropic│ │DeepSeek │ │ Qwen/    │
│GPT-4o │ │Claude  │ │V3/R1    │ │ Llama    │
└───────┘ └────────┘ └─────────┘ └──────────┘
Enter fullscreen mode Exit fullscreen mode

Your app talks to one endpoint. The gateway handles the rest.

Setting Up: The Foundation

Step 1: Choose Your Gateway

For this guide, I'm using AI Token Hub because:

  • 200+ models including all major providers
  • OpenAI-compatible API — works with existing SDKs
  • Transparent pricing — pay-as-you-go, no monthly fees
  • Interactive playground at aitoken.surge.sh/playground.html for testing

Get your API key at aitoken.surge.sh/register.html.

Step 2: Configure Your Client

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_AI_TOKEN_HUB_KEY",
    base_url="https://aitoken.surge.sh/v1"
)

# That's it. You now have access to 200+ models.
Enter fullscreen mode Exit fullscreen mode

Step 3: Verify Available Models

# List available models
models = client.models.list()
for model in models.data:
    print(f"- {model.id}")

# Output includes:
# - openai/gpt-4o
# - anthropic/claude-3-5-sonnet
# - deepseek-ai/DeepSeek-V3
# - deepseek-ai/DeepSeek-R1
# - Qwen/Qwen3-32B
# - meta-llama/Llama-3.3-70B-Instruct
# - google/gemini-2.0-flash
# ... and 200+ more
Enter fullscreen mode Exit fullscreen mode

Building the Router

Pattern 1: Simple Rule-Based Routing

The simplest approach — route based on query type:

ROUTING_RULES = {
    "faq": {
        "model": "deepseek-ai/DeepSeek-V3",
        "max_tokens": 256,
        "temperature": 0.3,
    },
    "code": {
        "model": "deepseek-ai/DeepSeek-R1",
        "max_tokens": 2048,
        "temperature": 0.2,
    },
    "creative": {
        "model": "openai/gpt-4o",
        "max_tokens": 1024,
        "temperature": 0.8,
    },
    "analysis": {
        "model": "anthropic/claude-3-5-sonnet",
        "max_tokens": 4096,
        "temperature": 0.4,
    },
}

def route_query(query_type: str, prompt: str) -> str:
    config = ROUTING_RULES.get(query_type, ROUTING_RULES["faq"])

    response = client.chat.completions.create(
        model=config["model"],
        messages=[{"role": "user", "content": prompt}],
        max_tokens=config["max_tokens"],
        temperature=config["temperature"],
    )

    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Complexity-Based Routing

Route based on estimated query complexity:

import re

def estimate_complexity(text: str) -> float:
    """Estimate query complexity (0.0 = simple, 1.0 = complex)."""

    # Simple heuristics
    word_count = len(text.split())
    sentence_count = len(re.findall(r'[.!?]+', text))
    question_count = text.count('?')
    technical_terms = len(re.findall(r'\b(algorithm|optimize|architecture|implement|debug|refactor)\b', text.lower()))

    # Normalize
    complexity = min(1.0, (
        (word_count / 100) * 0.3 +
        (technical_terms / 5) * 0.4 +
        (question_count / 3) * 0.3
    ))

    return complexity

def get_model_by_complexity(complexity: float) -> str:
    if complexity < 0.3:
        return "deepseek-ai/DeepSeek-V3"    # $0.27/M input
    elif complexity < 0.6:
        return "Qwen/Qwen3-32B"             # $0.50/M input
    elif complexity < 0.8:
        return "deepseek-ai/DeepSeek-R1"    # $0.55/M input
    else:
        return "openai/gpt-4o"              # $2.50/M input

def smart_route(prompt: str) -> str:
    complexity = estimate_complexity(prompt)
    model = get_model_by_complexity(complexity)

    print(f"Complexity: {complexity:.2f} → Model: {model}")

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1024,
    )

    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Pattern 3: Multi-Model Ensemble

For critical tasks, get responses from multiple models and pick the best:

def ensemble_query(prompt: str, models: list[str] = None) -> dict:
    """Query multiple models and return all responses."""

    if models is None:
        models = [
            "deepseek-ai/DeepSeek-V3",
            "deepseek-ai/DeepSeek-R1",
            "openai/gpt-4o",
        ]

    responses = {}
    for model in models:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=512,
                temperature=0.7,
            )
            responses[model] = response.choices[0].message.content
        except Exception as e:
            responses[model] = f"Error: {e}"

    return responses

# Usage
results = ensemble_query("Explain the CAP theorem in one paragraph.")
for model, response in results.items():
    print(f"\n=== {model} ===")
    print(response[:200])
Enter fullscreen mode Exit fullscreen mode

Pattern 4: Automatic Fallback

Handle provider outages gracefully:

import time

FALLBACK_CHAIN = [
    "deepseek-ai/DeepSeek-V3",     # Primary (cheap, fast)
    "Qwen/Qwen3-32B",              # Secondary
    "openai/gpt-4o",               # Tertiary (expensive, reliable)
]

def query_with_fallback(prompt: str, max_retries: int = 3) -> tuple[str, str]:
    """Try models in order until one succeeds."""

    for attempt, model in enumerate(FALLBACK_CHAIN[:max_retries]):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=1024,
                timeout=30,
            )
            content = response.choices[0].message.content
            return content, model

        except Exception as e:
            print(f"Model {model} failed (attempt {attempt + 1}): {e}")
            time.sleep(1)  # Brief pause before retry

    raise RuntimeError("All models failed")

# Usage
try:
    response, model_used = query_with_fallback("What is the meaning of life?")
    print(f"Got response from {model_used}: {response[:100]}...")
except RuntimeError as e:
    print(f"All models unavailable: {e}")
Enter fullscreen mode Exit fullscreen mode

Pattern 5: Cost-Optimized Batch Processing

For batch jobs, optimize for cost while meeting deadlines:

from concurrent.futures import ThreadPoolExecutor, as_completed
import time

def batch_process(prompts: list[str], budget_per_query: float = 0.001) -> list[str]:
    """Process a batch of prompts within budget constraints."""

    # Cost per 1K tokens for each model
    model_costs = {
        "deepseek-ai/DeepSeek-V3": 0.00027,   # $0.27/M tokens
        "Qwen/Qwen3-32B": 0.00050,            # $0.50/M tokens
        "deepseek-ai/DeepSeek-R1": 0.00055,   # $0.55/M tokens
        "openai/gpt-4o": 0.00250,             # $2.50/M tokens
    }

    results = []

    def process_single(prompt: str) -> str:
        # Choose model based on budget
        if budget_per_query >= 0.00250:
            model = "openai/gpt-4o"
        elif budget_per_query >= 0.00055:
            model = "deepseek-ai/DeepSeek-R1"
        elif budget_per_query >= 0.00050:
            model = "Qwen/Qwen3-32B"
        else:
            model = "deepseek-ai/DeepSeek-V3"

        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=512,
        )
        return response.choices[0].message.content

    # Process in parallel
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = {executor.submit(process_single, p): p for p in prompts}
        for future in as_completed(futures):
            results.append(future.result())

    return results
Enter fullscreen mode Exit fullscreen mode

Production Considerations

1. Rate Limiting

import threading

class RateLimiter:
    def __init__(self, max_requests: int, time_window: float):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = []
        self.lock = threading.Lock()

    def wait_if_needed(self):
        with self.lock:
            now = time.time()
            # Remove old requests
            self.requests = [t for t in self.requests if now - t < self.time_window]

            if len(self.requests) >= self.max_requests:
                sleep_time = self.time_window - (now - self.requests[0])
                if sleep_time > 0:
                    time.sleep(sleep_time)

            self.requests.append(now)

# Usage: 100 requests per minute
limiter = RateLimiter(100, 60)

def rate_limited_query(prompt: str) -> str:
    limiter.wait_if_needed()
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-V3",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

2. Caching

import hashlib
import json
from functools import lru_cache

def cache_key(model: str, prompt: str, max_tokens: int) -> str:
    data = f"{model}:{prompt}:{max_tokens}"
    return hashlib.sha256(data.encode()).hexdigest()

# Simple in-memory cache
_cache = {}

def cached_query(model: str, prompt: str, max_tokens: int = 1024, ttl: int = 3600) -> str:
    key = cache_key(model, prompt, max_tokens)

    if key in _cache:
        cached_time, cached_response = _cache[key]
        if time.time() - cached_time < ttl:
            return cached_response

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,
    )
    result = response.choices[0].message.content

    _cache[key] = (time.time(), result)
    return result
Enter fullscreen mode Exit fullscreen mode

3. Logging and Cost Tracking

import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("model_router")

def tracked_query(model: str, prompt: str, **kwargs) -> dict:
    start_time = time.time()

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs,
    )

    elapsed = time.time() - start_time
    usage = response.usage

    result = {
        "model": model,
        "prompt_tokens": usage.prompt_tokens,
        "completion_tokens": usage.completion_tokens,
        "total_tokens": usage.total_tokens,
        "latency_ms": elapsed * 1000,
        "content": response.choices[0].message.content,
    }

    logger.info(f"Query to {model}: {usage.total_tokens} tokens, {elapsed*1000:.0f}ms")

    return result
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Here's a quick reference for choosing models:

Task Type Recommended Model Cost (Input/M tokens) Why
Simple Q&A DeepSeek-V3 $0.27 Fast, cheap, accurate enough
Code generation DeepSeek-R1 $0.55 Strong reasoning, good at code
Creative writing GPT-4o $2.50 Best creativity and nuance
Long documents Claude 3.5 Sonnet $3.00 200K context window
Multilingual Qwen3-32B $0.50 Excellent CJK support
Open-source Llama 3.3 70B $0.50 Self-hostable option
Structured output DeepSeek-V3 $0.27 Good at JSON/formatting
Complex reasoning DeepSeek-R1 $0.55 Chain-of-thought specialist

The Playground Advantage

Before committing to a model, test it in the AI Token Hub Playground. You can:

  • Compare responses from multiple models side-by-side
  • Test different prompts and parameters
  • See real-time cost estimates
  • No API key required for testing

This saved me hours of trial-and-error integration. Test first, integrate second.

Cost Calculator

Use the AI Token Hub Cost Calculator to estimate your savings before switching. Input your current usage and it shows:

  • Current spend with your existing provider
  • Projected spend with intelligent routing
  • Breakdown by query type and model
  • Monthly and annual projections

Conclusion

Model routing isn't about finding the "best" model — it's about finding the right model for each task. A unified gateway makes this trivial:

  1. One API key instead of N
  2. One integration instead of N
  3. One dashboard for all your AI spend
  4. Instant model switching — just change a string

Start with simple rule-based routing. Add complexity-based routing as you gather data. Implement fallbacks for reliability. Add caching for cost savings.

The result? Lower costs, better reliability, and the flexibility to adopt new models as they launch.


What routing patterns have you found most effective? Share your experiences in the comments! And if you're building a model routing system, check out AI Token Hub — the playground is perfect for testing your routing logic before going to production.

Happy routing! 🎯

Top comments (0)