DEV Community

Mattias chaw
Mattias chaw

Posted on

Cutting Your LLM API Costs by 80% With Smart Model Routing

Cutting Your LLM API Costs by 80% With Smart Model Routing

Most developers default to GPT-4 or Claude for every single request. That's like using a sledgehammer to hang a picture frame. Here's how to route intelligently and save thousands.

The Problem: One-Model-Fits-All Is Expensive

If you're sending every query to the most expensive model, you're burning money. A simple "What's 2+2?" doesn't need a frontier model. But "Analyze this legal contract for liability clauses" absolutely does.

The trick? Smart model routing ??automatically picking the right model for each request based on complexity.

Why Chinese Models Changed the Game in 2026

Chinese LLMs like DeepSeek-V3, GLM-4, and Qwen have reached parity with Western frontier models on most benchmarks ??at 5-10x lower cost. The catch used to be access: you needed a Chinese phone number, Alipay account, and patience for the Great Firewall.

Not anymore. Services like aiwave.live now provide OpenAI-compatible API access to 50+ Chinese AI models with a single API key. No VPN, no Chinese phone number, no Alipay. Just an API key and standard REST calls.

Building a Cost-Aware Router in Python

Let's build a practical router that picks models based on task complexity:

import openai
import re

client = openai.OpenAI(
    api_key="your-api-key",
    base_url="https://aiwave.live/v1"
)

# Cost per 1M tokens (in USD)
MODEL_COSTS = {
    "deepseek-chat":      {"input": 0.14, "output": 0.28},
    "glm-4-flash":       {"input": 0.07, "output": 0.14},
    "deepseek-reasoner": {"input": 0.55, "output": 2.19},
    "qwen-max":          {"input": 0.40, "output": 1.20},
}

def classify_complexity(prompt: str) -> str:
    """Classify prompt complexity to pick the right model."""
    prompt_lower = prompt.lower()

    # Simple tasks ??math, basic QA, formatting
    simple_patterns = [
        r'^\d+[\s\+\-\*\/]',  # arithmetic
        r'(translate|summarize|format)',
        r'(what is|who is|when did)',
    ]

    # Complex tasks ??reasoning, coding, analysis
    complex_patterns = [
        r'(analyze|design|architect|implement)',
        r'(code|function|debug|refactor)',
        r'(legal|contract|compliance)',
        r'(compare|evaluate|assess)',
    ]

    for pattern in simple_patterns:
        if re.search(pattern, prompt_lower):
            return "simple"

    for pattern in complex_patterns:
        if re.search(pattern, prompt_lower):
            return "complex"

    # Longer prompts usually need more capability
    if len(prompt) > 2000:
        return "complex"

    return "medium"

def route_model(complexity: str) -> str:
    """Route to the most cost-effective model for the complexity level."""
    routing = {
        "simple":  "glm-4-flash",       # Cheapest
        "medium":  "deepseek-chat",     # Balanced
        "complex": "deepseek-reasoner",  # Best reasoning
    }
    return routing.get(complexity, "deepseek-chat")

def smart_completion(prompt: str, **kwargs) -> str:
    """Get a completion using the optimal model for the task."""
    complexity = classify_complexity(prompt)
    model = route_model(complexity)
    cost = MODEL_COSTS[model]

    print(f"Complexity: {complexity} | Model: {model} | ${cost['input']}/1M tokens")

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs
    )
    return response.choices[0].message.content

# Usage
print(smart_completion("What is the capital of France?"))
# Routes to glm-4-flash ($0.07/1M) instead of a $5/1M frontier model

print(smart_completion("Debug this async Python function with a race condition..."))
# Routes to deepseek-reasoner for advanced reasoning
Enter fullscreen mode Exit fullscreen mode

The Math: Real Cost Comparison

Let's say your app processes 10,000 requests/day with an average of 1,000 input + 500 output tokens:

Approach Daily Cost Monthly Cost
GPT-4o (all requests) $105.00 $3,150
Smart routing (our system) $12.80 $384
Savings $92.20/day $2,766/month

That's an 87% cost reduction with negligible quality loss for simple queries.

Adding Fallback Logic

Production systems need fallbacks. Here's how to handle rate limits and downtime:

import time

FALLBACK_CHAIN = {
    "glm-4-flash":       ["deepseek-chat", "qwen-max"],
    "deepseek-chat":     ["qwen-max", "deepseek-reasoner"],
    "deepseek-reasoner": ["qwen-max"],
}

def completion_with_fallback(prompt: str, **kwargs):
    complexity = classify_complexity(prompt)
    primary_model = route_model(complexity)
    chain = [primary_model] + FALLBACK_CHAIN.get(primary_model, [])

    for model in chain:
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                **kwargs
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"Warning: {model} failed: {e}, trying next...")
            time.sleep(0.5)

    raise RuntimeError("All models in fallback chain failed")
Enter fullscreen mode Exit fullscreen mode

Deploying This in Production

A few tips from running this setup in production:

  1. Log your routing decisions. Track which complexity bucket each request falls into. You'll likely find 60-70% of your traffic is "simple" ??that's huge savings.

  2. Cache aggressively. Simple Q&A responses rarely change. A Redis cache layer can eliminate 30-40% of API calls entirely.

  3. Monitor quality. Sample 1% of "simple" responses and compare them against what a frontier model would produce. Adjust your classifier if needed.

  4. Let users override. Power users sometimes want the big guns. A ?model=deepseek-reasoner query param gives them control.

Measuring ROI

After deploying smart routing with aiwave.live as our API gateway, we tracked these results over 30 days:

  • Monthly API spend: dropped from $3,150 to approximately $380
  • Average response latency: improved 40% (Chinese models are fast)
  • User satisfaction: unchanged (A/B tested for 2 weeks)

The biggest insight? Over 65% of our traffic was classified as "simple" and could be handled by models costing less than $0.10 per million tokens.

Conclusion

Smart model routing isn't rocket science, but it requires three things:

  1. Access to models at different price and performance tiers ??aiwave.live gives you 50+ Chinese models behind a single OpenAI-compatible API
  2. A classifier that picks the right model per request
  3. Fallback logic for production reliability

The 80% cost reduction isn't theoretical ??it's what happens when you stop sending "Hello, world!" to a frontier model.


Found this useful? Check out aiwave.live ??OpenAI-compatible API access to 50+ Chinese AI models (DeepSeek, GLM, Qwen, and more) with no Chinese phone number required. New accounts get $5 in free credits.

Top comments (0)