Wade Allen

Posted on May 25 • Edited on May 27

Build an LLM Router with pydantic-ai: Route Prompts to the Cheapest Model

#python #fastapi #ai #llm

Why LLM Routing Matters

Every LLM-powered application has the same hidden problem: you're using one model for every task, even though tasks vary wildly in complexity.

A simple "classify this as spam or not spam" prompt doesn't need Claude Sonnet or GPT-4o. A /usr/bin/bash.04/MTok model handles it at 99% accuracy. But a complex multi-step reasoning task absolutely needs the flagship model, and getting cheap is just slow failure.

The result: you're either wasting money on over-provisioning, or getting silent failures from under-provisioning. Usually both at the same time, on different parts of your pipeline.

LLM routing solves this by classifying each prompt's complexity before routing it to the cheapest model that can actually handle it.

The Architecture

The multi-LLM cost optimizer I built uses three layers:

Complexity Classifier (Pydantic AI + Claude Haiku)
Model Router (LiteLLM + dynamic pricing lookup)
Cost Tracker (Real-time spend logging)

The key insight: the classifier (using a cheap fast model) pays for itself when it prevents expensive routing on simple tasks.

The Core Pattern

Pydantic AI structured outputs are what make the classification reliable:

Without structured outputs, you are back to parsing free-text, and the classifier becomes another source of bugs. With Pydantic AI, you get a typed object back or an exception - no ambiguity.

The router then picks the model based on the classified category:

The Real Trade-offs

Classification latency adds overhead. The complexity classifier runs before every routed call - around 200-400ms depending on the model. For interactive apps, cache classifications by semantic similarity so repeated similar prompts skip the classifier.

Edge cases are real. Code-heavy prompts, domain-specific jargon, and ambiguous short prompts are where classifiers misfire. Build a feedback loop to log misclassifications so you can tune the routing thresholds over time.

Cheap models fail silently. A simple model routing a task it cannot handle won't throw an error - it will just give you a worse answer. Add output validation downstream, not just routing logic upstream.

Cold-start cost. LiteLLM manages provider connections. First call to a new provider has connection overhead. Warm up your most-used routes at startup.

When to Use This Pattern

This pattern is high-value when:

You have mixed workloads: classification, summarization, generation, reasoning
Your API costs are already meaningful and growing
You have multiple providers available (Anthropic, OpenAI, Groq all supported)
You want a single FastAPI endpoint that handles routing transparently

It adds complexity, so a single-model setup is fine when workloads are homogeneous or costs are still low.

The Template

I packaged this as a drop-in FastAPI + pydantic-ai template that you can have running in under 10 minutes. It includes the complexity classifier, LiteLLM router, cost tracker, and a /stats endpoint for real-time spend visibility.

Get it at: https://reactance0083.gumroad.com/l/ztmlv

If you have questions about the routing logic or want to adapt it to a specific use case, open an issue on the GitHub repo: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer

Top comments (2)

Harjot Singh • May 31

Building the router yourself with pydantic-ai is the right call - the structured-output guarantee is exactly what makes routing safe, because a cheap model's answer is only usable if you can validate it conforms to a schema before you trust it. Routing + typed outputs is a great pairing: route to save money, validate to stay safe. Half the cost is a strong result for something you control end to end.

The detail that decides how well a router like this holds up: what does it route ON? Routing on prompt length or keyword heuristics is easy but leaky; routing on actual task difficulty (even a tiny classifier scoring "hard/easy" first) is what keeps quality steady as the prompt mix shifts. This is the same architecture under Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - a routing layer in front of multiple tiers, with typed/validated handoffs between agents, which is what holds a full build at ~$3 flat. Great hands-on writeup, and pydantic-ai is a smart base for it. What signal does your router use to pick the model - heuristic, classifier, or a cheap model self-assessing difficulty first?

Wade Allen • Jun 1

yeah exactly, you nailed it - the validation layer is what actually lets you get away with aggressive cost cutting without betting everything on model quality. the classifier-first approach is what i've been iterating on too, running a tiny 3b model just to bucket tasks before routing to gpt4-mini or claude-haiku. it's an extra inference call but honestly the cost delta is so small compared to sending everything to the expensive model that it basically prints money. curious what difficulty signals you've found work best in practice, been torn between actual reasoning depth vs token count predictions?