The Problem: Every Prompt Costs Money, But Not Every Prompt Needs GPT-4
You're running an AI system in production. Some requests need Claude's reasoning depth. Others are simple classification tasks that Groq can handle in milliseconds for a fraction of the cost. The trap most teams fall into: they pick one model and stick with it.
Here's the math that breaks you. Your engineering team sends a 50-token prompt asking "Is this email spam?" to GPT-4o. Cost: ~$0.015. Groq does the same thing for ~$0.0001. Scale that to 100k prompts per day across your product, and you're hemorrhaging $1,400 per month on decisions that don't need advanced reasoning.
The other trap: building classification logic by hand. You write heuristics to detect "simple" vs. "complex" prompts. You maintain these rules. Six months later, edge cases pile up. You're debugging prompts that should have routed to Claude but hit GPT-4o instead. Now you're paying for wrong answers.
The real problem isn't picking models. It's routing each individual prompt to the minimum sufficient model without building fragile classification logic. You need infrastructure that learns what "complex" means from actual patterns in your data, then makes routing decisions automatically.
The Approach: Structured Classification + Intelligent Routing
The solution uses three layers working together:
Layer 1: Complexity Classification - Instead of writing rules, you use a lightweight model (Claude Haiku) to analyze each incoming prompt and assign a complexity score. Haiku is cheap enough that the classification cost is negligible compared to the routing savings. It outputs structured JSON describing the prompt's requirements: reasoning depth needed, external knowledge, multi-step logic, etc.
Layer 2: Model Selection - Based on the classification output, you select the cheapest model that meets the requirements. This is a simple lookup table initially, but it becomes data-driven. You track which models succeed at which complexity levels, then optimize the routing rules over time.
Layer 3: Execution + Tracking - You execute the prompt on the selected model, track the cost, track the latency, and feed that back into your cost database. This becomes your ground truth for future optimizations.
Why pydantic-ai over alternatives? Langchain's flexible but opaque. Building with plain requests/openai libraries means you're implementing routing logic yourself. Pydantic-ai's structured outputs are type-safe, they integrate cleanly with Pydantic models for validation, and the agent-based pattern maps perfectly to this workflow: an agent that evaluates complexity, an agent that routes, an agent that executes.
The key design decision: make the complexity classification itself cheap and fast. If it costs more to classify a prompt than to just run it on the cheapest model, the system fails. By using Haiku for classification and caching classification patterns, you ensure that the overhead is minimal.
The Central Code Pattern: Classification-Driven Routing
Here's the pattern that makes this work:
from pydantic import BaseModel
from pydantic_ai import Agent
from enum import Enum
import litellm
class ComplexityLevel(str, Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
class PromptAnalysis(BaseModel):
complexity: ComplexityLevel
reasoning_required: bool
external_knowledge_needed: bool
estimated_tokens: int
reasoning: str
# Agent 1: Classify complexity
classifier_agent = Agent(
'claude-3-5-haiku-20241022',
result_type=PromptAnalysis,
)
@classifier_agent.system_prompt
def classifier_system():
return """You are a prompt complexity analyzer. Classify the incoming prompt.
Return structured JSON with complexity level and reasoning."""
# Routing map: complexity -> list of models (ordered by cost)
MODEL_ROUTING = {
ComplexityLevel.SIMPLE: ['groq/llama-3.1-8b', 'gpt-4o-mini', 'claude-3-5-haiku-20241022'],
ComplexityLevel.MODERATE: ['gpt-4o-mini', 'claude-3-5-sonnet-20241022'],
ComplexityLevel.COMPLEX: ['claude-3-5-sonnet-20241022', 'gpt-4-turbo'],
}
async def classify_and_route(user_prompt: str) -> tuple[PromptAnalysis, str]:
"""Classify prompt, select model, return both."""
# Step 1: Classify
analysis = await classifier_agent.run(user_prompt)
# Step 2: Select model from routing map
candidate_models = MODEL_ROUTING[analysis.data.complexity]
selected_model = candidate_models[0] # Cheapest first
# Step 3: Log for cost tracking
log_routing_decision(
prompt=user_prompt,
complexity=analysis.data.complexity,
selected_model=selected_model,
estimated_cost=estimate_cost(selected_model, analysis.data.estimated_tokens)
)
return analysis.data, selected_model
async def execute_with_routing(user_prompt: str) -> dict:
"""Full pipeline: classify, route, execute."""
analysis, model = await classify_and_route(user_prompt)
# Execute on selected model using litellm for unified interface
response = litellm.completion(
model=model,
messages=[{"role": "user", "content": user_prompt}],
)
# Track actual cost
log_execution(
model=model,
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
actual_cost=calculate_actual_cost(model, response.usage)
)
return {
"analysis": analysis,
"selected_model": model,
"response": response.choices[0].message.content
}
What each part does: The classifier agent uses Haiku (cheap) to produce structured PromptAnalysis output. The routing map is the decision tree: given complexity, pick the cheapest model that can handle it. The execution step uses litellm as a unified interface to multiple providers (Claude, GPT-4o, Groq all speak the same API through litellm). The logging is critical: without it, you can't optimize the routing over time.
Why this pattern works: it separates concerns cleanly. Classification logic is isolated from routing logic. Model selection is deterministic and auditable. Cost tracking is built in from the start, not bolted on later.
Integration: Making It Real with FastAPI
Here's where this lives in production:
from fastapi import FastAPI, HTTPException
from fastapi.responses import JSONResponse
app = FastAPI()
@app.post("/ask")
async def ask_with_routing(request: dict) -> JSONResponse:
"""Main endpoint: user sends a prompt, gets routed response."""
try:
user_prompt = request.get("prompt")
if not user_prompt:
raise HTTPException(status_code=400, detail="prompt required")
result = await execute_with_routing(user_prompt)
# Return response + metadata for frontend
return JSONResponse({
"response": result["response"],
"metadata": {
"complexity": result["analysis"].complexity,
"model_used": result["selected_model"],
"estimated_cost": result["analysis"].estimated_tokens * 0.00001, # approximate
}
})
except Exception as e:
log_error(e, user_prompt)
raise HTTPException(status_code=500, detail=str(e))
@app.get("/costs/today")
async def get_daily_costs() -> dict:
"""Real-time cost tracking endpoint."""
costs = fetch_costs_from_db(date=datetime.now().date())
return {
"total_spend": costs['total'],
"by_model": costs['by_model'],
"by_complexity": costs['by_complexity'],
"savings_vs_always_gpt4o": costs['total'] - (costs['total_prompts'] * 0.015)
}
The data flow: user sends a prompt to /ask. FastAPI receives it, calls classify_and_route, executes on the selected model, logs everything to a database (PostgreSQL or similar), returns the response with metadata. A background worker periodically aggregates costs. The /costs/today endpoint gives real-time visibility into spending patterns.
One gotcha worth knowing: litellm has rate limits per provider. If you route 1000 simple prompts to Groq simultaneously, you'll hit their API limits. You need a queue (Bull, Celery, or just asyncio.Queue) in front of execution to throttle by provider, not just by request count. Without it, your routing optimizer works great until you scale, then you're retrying failed requests that now route to expensive fallback models.
Tradeoffs and When NOT to Use This
This approach adds complexity. You're running at least one extra LLM call per user prompt (the classifier), which adds 200-500ms latency. For real-time chat applications, this might be unacceptable.
You also need to maintain the routing map. As new models launch or pricing changes, you update MODEL_ROUTING. This is manageable but not automatic.
The classifier itself can be wrong. A prompt might be marked "simple" when it actually needs complex reasoning, and you'll get a wrong answer. You need monitoring and a fallback system (if the simple model fails, retry on a more capable model).
When to choose something simpler: If your prompts are homogeneous (they're all the same type), just pick one good model and move on. If you have fewer than 10k prompts per month, the optimization overhead costs more than the savings. If latency is critical and sub-200ms matters, this adds too much latency.
When this approach wins: you have diverse prompt types, cost is a serious constraint, and you can afford 200-500ms classification overhead. You're running thousands of prompts per month across different complexity levels.
Get the Full Implementation
I packaged this as an open-source template on GitHub: https://github.com/Reactance0083/pydantic-ai-multi-llm-cost-optimizer
The scaffold gives you the architecture and core patterns. It's a good foundation but lacks production details: error handling for rate limits, caching of classifications, database schema design, monitoring dashboards.
The full production version with tests, retry logic, cached classifications, and example dashboards is on Gumroad: https://reactance0083.gumroad.com/l/ztmlv
What specific part of multi-model routing is causing pain in your system? Are you currently over-provisioning to certain models, or is latency from classification a blocker for your use case? Drop a comment and let's talk through your constraints.
Top comments (0)