DEV Community

Cover image for Stop Paying for Reasoning: A Decision Tree for Choosing the Right Model Across 5 Task Classes
Mohit Verma
Mohit Verma

Posted on • Originally published at aiwithmohit.hashnode.dev

Stop Paying for Reasoning: A Decision Tree for Choosing the Right Model Across 5 Task Classes

Running GPT-4o on every task is like hiring a senior engineer to sort your inbox.

Most ML teams wire all inference calls to the same frontier model and call it "safe." It's not safe. It's a budget leak.

Here's the math that changed how I build pipelines:

A typical customer support system has two dominant task types — classification ("is this billing or technical?") and structured extraction ("pull the order ID"). Together they account for ~60% of inference calls.

Neither needs chain-of-thought reasoning. Neither benefits from a 200B+ parameter model pondering an order number.

Yet both get routed to GPT-4o by default.

I benchmarked this directly. On a 1,000-sample extraction task from financial documents:

  • Quantized Llama-3 70B (Q4_K_M): F1 = 0.91, ~$0.003/request
  • GPT-4o: F1 = 0.94, ~$0.12/request

That's a 40x cost difference for a 3-point F1 gap. In most production systems, 0.91 F1 is more than sufficient.


The 5-Node Decision Tree Framework

The framework I use now is a 5-node decision tree that routes tasks based on four signals:

  1. Input token count (< 500?)
  2. Output determinism (JSON/enum expected?)
  3. Reasoning depth score (1–5 scale)
  4. Latency SLA (< 200ms P95?)
def route_task(prompt: str, output_schema: dict | None, latency_sla_ms: int) -> str:
    """
    Returns the model tier to use for a given task.
    Tiers: 'tier1' | 'tier2' | 'tier3'
    """
    token_count = estimate_tokens(prompt)          # lightweight tokenizer
    reasoning_depth = score_reasoning_depth(prompt) # keyword + heuristic classifier
    is_structured = output_schema is not None
    is_latency_sensitive = latency_sla_ms < 200

    if token_count < 500 and is_structured and reasoning_depth <= 2:
        return "tier1"  # Haiku / quantized Llama — ~$0.003/request

    if reasoning_depth <= 3 and not is_latency_sensitive:
        return "tier2"  # Mid-tier — ~$0.01–0.03/request

    return "tier3"      # Frontier model only — ~$0.10–0.15/request
Enter fullscreen mode Exit fullscreen mode

The 5 Task Classes

Tier 1 — Classification & Tool Execution

Models: Haiku / quantized Llama (Q4_K_M)

  • Binary or multi-class classification
  • Structured extraction (JSON, enums)
  • Tool call routing in agentic pipelines
  • Cost: ~$0.003/request
{
  "task": "extract_order_id",
  "tier": "tier1",
  "model": "claude-haiku-3",
  "output_schema": {
    "order_id": "string",
    "customer_id": "string",
    "issue_type": "billing | technical | shipping | other"
  }
}
Enter fullscreen mode Exit fullscreen mode

Tier 2 — Summarization & Transformation

Models: Mid-tier (e.g., GPT-4o-mini, Haiku with larger context)

  • Document summarization
  • Format conversion
  • Translation
  • Cost: ~$0.01–0.03/request

Tier 3 — Multi-step Reasoning

Models: Frontier only (GPT-4o, Claude Sonnet, Gemini 1.5 Pro)

  • Complex analysis requiring chain-of-thought
  • Code generation with debugging
  • Multi-document synthesis
  • Cost: ~$0.10–0.15/request

The Routing Classifier

The routing classifier itself runs on a Haiku-class model. Its cost is roughly 0.1% of the savings it generates. It pays for itself on the first routed request.

The classifier evaluates:

  • Token count of the incoming prompt
  • Presence of structured output schema
  • Keyword signals for reasoning depth
  • Latency requirements from the request metadata
REASONING_KEYWORDS = [
    "analyze", "compare", "synthesize", "debug", "explain why",
    "step by step", "chain of thought", "evaluate", "critique"
]

def score_reasoning_depth(prompt: str) -> int:
    """
    Returns a 1–5 reasoning depth score.
    1 = pure classification/extraction
    5 = deep multi-step reasoning required
    """
    prompt_lower = prompt.lower()
    keyword_hits = sum(1 for kw in REASONING_KEYWORDS if kw in prompt_lower)
    token_count = estimate_tokens(prompt)

    base_score = 1
    base_score += min(keyword_hits, 2)          # max +2 from keywords
    base_score += 1 if token_count > 1000 else 0 # long prompts skew complex
    base_score += 1 if token_count > 3000 else 0 # very long = almost certainly tier3

    return min(base_score, 5)
Enter fullscreen mode Exit fullscreen mode

Real Production Numbers

One number from our agentic pipeline at QEval: routing a 10-step ReAct loop — frontier model only for planning, Haiku for tool execution — cut cost per loop from $1.47 to $0.18. Accuracy delta was under 3%.

# Before routing: all steps on GPT-4o
# 10 steps × ~$0.147/step = $1.47/loop

# After routing:
# 2 planning steps × $0.12  = $0.24
# 8 tool steps    × $0.003 = $0.024
# 1 routing call  × $0.003 = $0.003
# Total                     = $0.267  → real-world measured: $0.18 with caching
Enter fullscreen mode Exit fullscreen mode

The mental shift that matters: stop optimizing cost-per-token. Optimize cost-per-correct-answer.


Implementation Checklist

  • [ ] Audit your top 5 inference call types by volume
  • [ ] Score each on reasoning depth (1–5)
  • [ ] Identify which are classification/extraction (Tier 1 candidates)
  • [ ] Build a lightweight routing classifier
  • [ ] A/B test Tier 1 model vs frontier on your actual data
  • [ ] Measure F1 delta — if < 5 points, route to Tier 1

References

  1. Stanford HAI 2025 AI Index Report
  2. Sebastian Raschka: State of LLM Reasoning and Inference Scaling
  3. NVIDIA Post-Training Quantization for LLMs
  4. ScaleMindLabs: KV Cache Compression FP8/INT4
  5. VAST Data — 2026: The Year of AI Inference

If you're building routing logic for agentic pipelines or wrestling with inference cost at scale, I'd love to compare notes — find me on LinkedIn. I share production AI/ML architecture insights regularly, and I'm always curious what thresholds and signals others are using in their own routing classifiers.

Top comments (0)