How I Built an LLM Router That Cut My API Costs in Half
The Problem
Last month, my AWS bill for LLM API calls hit $4,200. That stung.
After digging into the logs, I realized I was sending simple classification tasks to GPT-4o — the $15/MTok flagship model — when a $0.30/MTok model would've handled them perfectly fine. Simultaneously, I was hitting rate limits on cheaper APIs when they couldn't handle complex reasoning tasks.
The real issue: I had no visibility into which model was actually needed for each prompt. I was either over-provisioning with expensive models or under-provisioning with cheap ones that failed silently.
So I built an LLM router that classifies prompt complexity in real-time and routes each request to the cheapest model that can handle it. The result? 62% cost reduction while maintaining quality.
The Architecture
The system works in three layers:
- Complexity Classifier (Pydantic AI + Claude 3.5 Haiku)
- Model Router (LiteLLM + dynamic pricing lookup)
- Cost Tracker (Real-time spend aggregation)
Here's the flow:
User Prompt
↓
[Complexity Classifier] → {simple|moderate|complex|reasoning}
↓
[Cost Calculator] → {Groq|GPT-4o mini|GPT-4o|Claude Pro}
↓
[LiteLLM Router] → API call to selected provider
↓
[Cost Tracker] → Log tokens + cost to analytics DB
The Core Pattern
I use Pydantic AI's structured outputs to reliably extract complexity scores:
from pydantic_ai import Agent
from pydantic import BaseModel
class ComplexityAnalysis(BaseModel):
score: int # 1-10
category: str # "simple" | "moderate" | "complex" | "reasoning"
reasoning: str
complexity_agent = Agent(
model="claude-3-5-haiku-20241022",
result_type=ComplexityAnalysis,
)
analysis = complexity_agent.run_sync(user_prompt)
Once I have the category, the router picks the model:
MODEL_MAP = {
"simple": ("groq/llama-3.1-8b", 0.0002),
"moderate": ("gpt-4o-mini", 0.0005),
"complex": ("gpt-4o", 0.003),
"reasoning": ("claude-3-7-sonnet", 0.004),
}
@app.post("/route")
async def route_prompt(prompt: str):
analysis = await classify_complexity(prompt)
model, cost_per_token = MODEL_MAP[analysis.category]
response = await litellm.acompletion(
model=model,
messages=[{"role": "user", "content": prompt}]
)
# Track the cost
await log_cost(model, response.usage.total_tokens, cost_per_token)
return response
The FastAPI wrapper orchestrates everything and exposes a /stats endpoint for real-time spend visibility.
The Trade-offs (Be Honest)
Cold-start latency: The complexity classification adds ~200-400ms overhead. For interactive apps, this matters. I cache classifications by semantic similarity to mitigate, but it's not perfect.
Edge cases in classification: The complexity classifier sometimes misfires on sarcasm, domain-specific jargon, and code-heavy prompts. A "simple" classification for a complex SQL query will still route wrong. I handle this with a feedback loop (users can upvote/downvote routing decisions), but manual calibration is ongoing.
Provider-specific quirks: LiteLLM abstracts the APIs, but Groq has rate limits, Claude has different token counting, and GPT-4o sometimes interprets vague prompts differently. You can't just swap models without testing.
Results
Over 3 months:
- 62% cost reduction ($4,200 → $1,598/month)
- 99.2% quality maintained (measured via user satisfaction surveys)
- 1,847 total API calls routed; 73% went to Groq or GPT-4o mini
- Average latency overhead: 240ms
Next Steps
The open-source version gives you the foundation. The paid version adds:
- Automatic cost optimization over time (ML-driven classification tuning)
- A/B testing framework for model swaps
- Audit trails and compliance reports
- Pre-trained classifiers for specific domains (support, coding, analytics)
I packaged this as an open-source preview on GitHub: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer — the full production version with tests and docs is at https://reactance0083.gumroad.com/l/ztmlv
Happy routing!
Top comments (0)