DEV Community

Nate Voss
Nate Voss

Posted on

Model Routing: 3 Things I Learned Sending Tasks to the Cheapest Model That Actually Works

Everyone benchmarks models. Sonnet beats Haiku on reasoning. Opus beats Sonnet. Haiku is fastest. These things are all true.

But benchmarking and deploying are different games. At scale, the difference between Haiku at $0.80/million tokens and Sonnet at $3/million tokens isn't academic. It's $400+ monthly on a mid-size application. The trap is paying for capability you don't actually need because you never measured what you do need.

I built a router to answer one question: which tasks in my actual workflow could run on the cheapest model without failing? The answer surprised me. And I learned that the real value isn't the savings. It's the forcing function. You can't implement routing without auditing exactly where your complexity lives.

3 Things I Learned

1. Your Intuition About Task Complexity Is Backwards

You think something needs Sonnet. Your gut says: "this requires reasoning, obviously expensive model."

So I measured. Content classification? Haiku handles 95% of real requests. Writing summaries? 88%. Extracting structured data? 92%. The edge cases that needed Sonnet were smaller than I'd guessed. And they were always the same types of edge cases.

Here's the pattern I found: obvious cases are really obvious to Haiku. Spam detection, data validation, simple extractions. Haiku nails these. The failures cluster in a small, identifiable category: ambiguous cases where the human answer is ambiguous. That's when you need Sonnet's nuance.

But you don't know your edge case percentage until you try. Guessing leaves money on the table.

2. You Need Observability Before Routing Saves Anything

The instinct is to build the router first. "Let's write logic that detects complex requests and routes to Sonnet."

This is backward. You need to measure first. Log every task with both Haiku and Sonnet responses side-by-side. Compare them. Find the patterns.

Real questions to answer:

  • When did Haiku refuse a task that Sonnet handled?
  • How often do their answers differ, and which one was right?
  • Was Haiku just uncertain, or actually wrong?

This requires instrumenting your inference layer. It takes a week. But you can't optimize what you can't see. Most teams skip this and build routers on intuition, which is why their routers are fragile.

3. Routing Rules Should Be Dumb, Not Smart

The temptation: build a classifier that predicts task complexity. Input length heuristics, keyword matching, embedding similarity. Something sophisticated.

Don't. Use a simple rule: "If the model reports low confidence, escalate to Sonnet."

This separates the decision from the task. Haiku tells you when it's uncertain. That's a signal you can act on immediately, without needing to predict the future.

The dumb rule wins because:

  • It adapts as your tasks change (no retraining)
  • It's testable (you can verify the confidence threshold)
  • It fails safely (escalation costs more but works)

The smart rule loses because routing logic becomes load-bearing infrastructure. Requires constant tuning. Breaks when your data distribution shifts.

How It Works

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function classifyWithFallback(text, confidenceThreshold = 0.7) {
 // First pass: try Haiku (cheap, fast)
 const haikuResponse = await client.messages.create({
 model: "claude-3-5-haiku-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const haikuResult = JSON.parse(haikuResponse.content[0].text);

 // Log all Haiku decisions (even successes)
 // You're building a dataset of "when does Haiku work?"
 console.log({
 text: text.slice(0, 50),
 model: "haiku",
 classification: haikuResult.classification,
 confidence: haikuResult.confidence,
 tokensUsed:
 haikuResponse.usage.input_tokens +
 haikuResponse.usage.output_tokens
 });

 // If Haiku is unsure, escalate to Sonnet
 if (haikuResult.confidence < confidenceThreshold) {
 const sonnetResponse = await client.messages.create({
 model: "claude-3-5-sonnet-20241022",
 max_tokens: 100,
 messages: [
 {
 role: "user",
 content: `Classify this text as: safe, unsafe, or review-needed. Return JSON with {classification, confidence}.

Text: "${text}"`
 }
 ]
 });

 const sonnetResult = JSON.parse(sonnetResponse.content[0].text);
 console.log({
 text: text.slice(0, 50),
 model: "sonnet",
 escalatedFrom: "haiku",
 classification: sonnetResult.classification,
 tokensUsed:
 sonnetResponse.usage.input_tokens +
 sonnetResponse.usage.output_tokens
 });

 return sonnetResult;
 }

 return haikuResult;
}
Enter fullscreen mode Exit fullscreen mode

That's it. Run both models in parallel during development and log the results. In production, start with Haiku, escalate on low confidence. As your logs accumulate, you'll see exactly which tasks need expensive models and which don't.

The Math

Haiku: $0.80 per 1M input tokens
Sonnet: $3 per 1M input tokens

Scenario: 1M requests/month, 200 tokens average
- All Sonnet: 1M × 200 tokens = $600
- 95% Haiku: (950k × 200) Haiku + (50k × 200) Sonnet = $152 + $30 = $182
- Savings: $418/month

At enterprise scale (100M requests/month): $41,800/month saved by routing to the cheapest viable model.
Enter fullscreen mode Exit fullscreen mode

The cost difference compounds. Small routing decisions get multiplied across thousands of requests.

One Common Pitfall

You'll build a sophisticated router and wonder why it doesn't move the needle. Usually because:

  • You spent three months on routing logic, but you spend one week validating it
  • The escalation threshold is too aggressive ("if anything looks hard, use Sonnet")
  • You're routing on heuristics, not observed behavior

The fix: measure first, always. Log both models' responses in parallel before committing to either one. You'll find that the obvious cases are really obvious, and the edge cases are smaller than you think.

When Routing Actually Works

Build it if:

  • You have >100k requests/month (smaller volume doesn't justify overhead)
  • Your requests fall into clusters (some are cheap tasks, some are hard)
  • You can measure ground truth (compare Haiku vs Sonnet, track which was right)

Don't build it if:

  • <10k requests/month (infrastructure overhead isn't worth it)
  • Every request is unique and complex (no pattern to exploit)
  • You need 99.9% accuracy (can't tolerate Haiku failures)

The Real Win

The cost savings are real. But the bigger win is the audit itself. Building a router forces you to measure exactly where your complexity actually lives. Most teams overthink what they need because they never measure. The router is just the excuse to finally look.

Top comments (0)