Stop Paying for Reasoning: A Decision Tree for Choosing the Right Model Across 5 Task Classes
Running GPT-4o on every task is like hiring a senior engineer to sort your inbox. Most ML teams wire all inference calls to the same frontier model and call it "safe." It's not safe — it's a budget leak.
The Cost Reality
On a 1,000-sample extraction task from financial documents:
- Quantized Llama-3 70B (Q4_K_M): F1 = 0.91, ~$0.003/request
- GPT-4o: F1 = 0.94, ~$0.12/request
That's a 40x cost difference for a 3-point F1 gap.
The 5-Node Decision Tree
Route tasks based on four signals:
- Input token count (< 500?)
- Output determinism (JSON/enum expected?)
- Reasoning depth score (1–5 scale)
- Latency SLA (< 200ms P95?)
Results
Routing a 10-step ReAct loop cut cost per loop from $1.47 to $0.18. Accuracy delta was under 3%.
Stop optimizing cost-per-token. Optimize cost-per-correct-answer.
Top comments (0)