DEV Community

Christopher Allen
Christopher Allen

Posted on

How I Built an LLM Router That Cut My API Costs in Half

How I Built an LLM Router That Cut My API Costs in Half

The Problem

Last month, my AWS bill for LLM API calls hit $4,200. That stung.

After digging into the logs, I realized I was sending simple classification tasks to GPT-4o — the $15/MTok flagship model — when a $0.30/MTok model would've handled them perfectly fine. Simultaneously, I was hitting rate limits on cheaper APIs when they couldn't handle complex reasoning tasks.

The real issue: I had no visibility into which model was actually needed for each prompt. I was either over-provisioning with expensive models or under-provisioning with cheap ones that failed silently.

So I built an LLM router that classifies prompt complexity in real-time and routes each request to the cheapest model that can handle it. The result? 62% cost reduction while maintaining quality.

The Architecture

The system works in three layers:

  1. Complexity Classifier (Pydantic AI + Claude 3.5 Haiku)
  2. Model Router (LiteLLM + dynamic pricing lookup)
  3. Cost Tracker (Real-time spend aggregation)

Here's the flow:

User Prompt
    ↓
[Complexity Classifier] → {simple|moderate|complex|reasoning}
    ↓
[Cost Calculator] → {Groq|GPT-4o mini|GPT-4o|Claude Pro}
    ↓
[LiteLLM Router] → API call to selected provider
    ↓
[Cost Tracker] → Log tokens + cost to analytics DB
Enter fullscreen mode Exit fullscreen mode

The Core Pattern

I use Pydantic AI's structured outputs to reliably extract complexity scores:

from pydantic_ai import Agent
from pydantic import BaseModel

class ComplexityAnalysis(BaseModel):
    score: int  # 1-10
    category: str  # "simple" | "moderate" | "complex" | "reasoning"
    reasoning: str

complexity_agent = Agent(
    model="claude-3-5-haiku-20241022",
    result_type=ComplexityAnalysis,
)

analysis = complexity_agent.run_sync(user_prompt)
Enter fullscreen mode Exit fullscreen mode

Once I have the category, the router picks the model:

MODEL_MAP = {
    "simple": ("groq/llama-3.1-8b", 0.0002),
    "moderate": ("gpt-4o-mini", 0.0005),
    "complex": ("gpt-4o", 0.003),
    "reasoning": ("claude-3-7-sonnet", 0.004),
}

@app.post("/route")
async def route_prompt(prompt: str):
    analysis = await classify_complexity(prompt)
    model, cost_per_token = MODEL_MAP[analysis.category]

    response = await litellm.acompletion(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )

    # Track the cost
    await log_cost(model, response.usage.total_tokens, cost_per_token)

    return response
Enter fullscreen mode Exit fullscreen mode

The FastAPI wrapper orchestrates everything and exposes a /stats endpoint for real-time spend visibility.

The Trade-offs (Be Honest)

Cold-start latency: The complexity classification adds ~200-400ms overhead. For interactive apps, this matters. I cache classifications by semantic similarity to mitigate, but it's not perfect.

Edge cases in classification: The complexity classifier sometimes misfires on sarcasm, domain-specific jargon, and code-heavy prompts. A "simple" classification for a complex SQL query will still route wrong. I handle this with a feedback loop (users can upvote/downvote routing decisions), but manual calibration is ongoing.

Provider-specific quirks: LiteLLM abstracts the APIs, but Groq has rate limits, Claude has different token counting, and GPT-4o sometimes interprets vague prompts differently. You can't just swap models without testing.

Results

Over 3 months:

  • 62% cost reduction ($4,200 → $1,598/month)
  • 99.2% quality maintained (measured via user satisfaction surveys)
  • 1,847 total API calls routed; 73% went to Groq or GPT-4o mini
  • Average latency overhead: 240ms

Next Steps

The open-source version gives you the foundation. The paid version adds:

  • Automatic cost optimization over time (ML-driven classification tuning)
  • A/B testing framework for model swaps
  • Audit trails and compliance reports
  • Pre-trained classifiers for specific domains (support, coding, analytics)

I packaged this as an open-source preview on GitHub: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer — the full production version with tests and docs is at https://reactance0083.gumroad.com/l/ztmlv

Happy routing!

Top comments (0)