The Routing Pattern: How Smart Teams Actually Use Fast vs Capable Models

#llm #agents #ai

Everyone building agents hits the same wall eventually.

You start with the most capable model. It handles everything beautifully. But the costs add up fast. You switch to a faster, cheaper model. Now you are missing edge cases.

The answer is not choosing one. It is building infrastructure that routes intelligently between them.

The Pattern in Practice

Teams shipping agents at scale converged on a three-tier approach:

Tier 1: Fast triage. A lightweight model handles the initial request. It classifies intent, extracts entities, and decides if escalation is needed.

Tier 2: Capable execution. Complex reasoning, code generation, and multi-step planning get routed to the heavyweight.

Tier 3: Human review. Anything that falls through the cracks surfaces for manual handling.

Why This Works

The cost difference is dramatic. A fast model might cost 0.10 per million tokens. A capable model can run 10x or 20x that. If 80 percent of requests can be handled by tier 1, you cut your inference bill by the same margin.

But cost is not the only factor. Speed matters. Users notice latency. A routing system lets you give instant responses for simple queries while reserving the slow model for work that actually requires it.

The Implementation Detail Most Miss

The routing logic itself needs to be cheap. If you burn half a second deciding which model to use, you have defeated the purpose.

The best routers I have seen use simple heuristics: request length, keyword matching, confidence thresholds from the fast model. Not another ML model. A few if statements that run in milliseconds.

Progressive summarization helps too. Instead of feeding the capable model your entire context, summarize down to what matters. The model does less work, responds faster, costs less.