Most AI integrations talk to one model. The production ones run fleets.
Here's what I've learned building multi-model pipelines — where to use expensive models, where cheap ones outperform them, and how to wire it together without losing your mind.
The Mental Model Shift
Stop thinking "what's the best model" and start thinking "what's the right model for this job."
A frontier model like Claude Opus or GPT-4o is extraordinary at reasoning, nuanced writing, and complex decisions. It's also 50-100x more expensive per token than smaller models. Running everything through it is like hiring a senior engineer to do data entry.
The flip side: cheap models have gotten genuinely good at well-defined, structured tasks. Classification, extraction, templated generation, routing decisions — Haiku and GPT-4o-mini handle these reliably at a fraction of the cost.
Multi-model pipelines exploit this gap intentionally.
The Classic Split
Use expensive models for:
- Complex reasoning and analysis
- Nuanced judgment calls
- Creative generation that needs quality
- Anything where "good enough" isn't good enough
- Strategic decisions with downstream consequences
Use cheap models for:
- Classification and routing
- Structured extraction (JSON from text)
- Templated content generation
- Verification and simple checks
- High-volume preprocessing
The pattern: expensive model handles the hard thinking, cheap model does the mechanical work around it.
Pipeline Pattern 1: The Router
Cheapest possible model evaluates incoming requests and routes them.
[Incoming request]
↓
[Haiku/Mini: classify intent + complexity]
↓
[Low complexity] → [Cheap model: direct response]
[High complexity] → [Expensive model: full reasoning]
[Specialized task] → [Fine-tuned/specific model]
A simple routing prompt for Haiku costs basically nothing. It reads the request, assigns a complexity score (1-5), and flags it for the right handler. You only spend money on expensive models when the task actually warrants it.
Practical savings: 60-80% cost reduction on high-volume workloads.
Pipeline Pattern 2: The Extractor-Reasoner
Split extraction from reasoning.
[Raw document/input]
↓
[Cheap model: extract structured data]
↓
[Structured JSON]
↓
[Expensive model: reason over structured data]
Processing a long document? Don't feed the whole thing to your expensive model. Feed it to a cheap model first: "extract all entities, dates, decisions, and action items as JSON." Then pass that clean structure to your reasoning model.
The expensive model now has a much smaller, cleaner input and can focus entirely on reasoning rather than parsing.
Pipeline Pattern 3: Draft-Polish
Generate fast, polish selectively.
[Task]
↓
[Cheap model: generate draft]
↓
[Expensive model: evaluate quality score]
↓
[Score < threshold] → [Expensive model: rewrite]
[Score ≥ threshold] → [Ship the draft]
For content generation at scale, most drafts are good enough. Your expensive model only kicks in when the cheap model produces something below bar. In practice, 70-80% of outputs need no polish.
Pipeline Pattern 4: Multi-Stage Research
Research is naturally multi-stage. Each stage has different requirements.
[Topic]
↓
[Cheap: generate search queries] → [Search API calls]
↓
[Cheap: extract key claims from each result]
↓
[Cheap: deduplicate and structure findings]
↓
[Expensive: synthesize into coherent analysis]
The expensive synthesis step takes clean, structured inputs. All the messy extraction work was done cheaply.
Model Fallback Strategy
Beyond cost optimization, you need reliability. Models have outages, rate limits, and context constraints. Hard-coding to one provider is fragile.
My fallback chain:
- Primary model (cheapest that fits the task)
- Fallback 1 (different provider, similar capability)
- Fallback 2 (more capable, higher cost)
- Human escalation (for truly unrecoverable failures)
The key: make the fallback decision at runtime, not at design time. Check response quality, not just availability.
Wiring It Up
For lightweight orchestration, you don't need a framework. Direct API calls with a simple routing function work fine for most cases:
async function runWithModel(task: Task): Promise<string> {
const complexity = await classify(task); // cheap model
if (complexity <= 2) {
return await callCheapModel(task);
} else {
return await callExpensiveModel(task);
}
}
For more complex pipelines, tools like LangGraph or CrewAI add structure. But don't reach for them until you need the complexity — simple chains are easier to debug and maintain.
Tracking Costs
The discipline that makes multi-model pipelines actually work: track model usage per task type, not just aggregate.
If you discover you're spending $0.50 per classification task that should cost $0.001, that's where you optimize. Without per-task tracking, you're flying blind.
Simple approach: log model name, token counts, and task type for every call. Aggregate weekly. You'll immediately see where expensive models are doing cheap work.
When This Gets Complicated
Multi-model pipelines add operational complexity. More models = more failure modes = more things to monitor.
Before building one, ask:
- Is my current single-model setup actually too expensive or too slow?
- Do I have enough volume to justify the complexity?
- Can I handle the additional failure modes?
If you're doing under 1000 calls/day, a single good model is probably fine. The economics of splitting only kick in at scale.
For teams building AI agent infrastructure, @webbywisp/create-ai-agent scaffolds a workspace that includes built-in patterns for sub-agent orchestration and model routing — the same structure I use for production deployments.
Multi-model isn't about using everything available. It's about using each model for exactly what it's good at.
Top comments (0)