Mohit Verma

Posted on Mar 29 • Originally published at aiwithmohit.hashnode.dev

Stop Paying for Reasoning: A Decision Tree for Choosing the Right Model Across 5 Task Classes

#ai #llm #machinelearning #productivity

Running GPT-4o on every task is like hiring a senior engineer to sort your inbox.

Most ML teams wire all inference calls to the same frontier model and call it "safe." It's not safe. It's a budget leak.

Here's the math that changed how I build pipelines:

A typical customer support system has two dominant task types — classification ("is this billing or technical?") and structured extraction ("pull the order ID"). Together they account for ~60% of inference calls.

Neither needs chain-of-thought reasoning. Neither benefits from a 200B+ parameter model pondering an order number.

Yet both get routed to GPT-4o by default.

I benchmarked this directly. On a 1,000-sample extraction task from financial documents:

Quantized Llama-3 70B (Q4_K_M): F1 = 0.91, ~$0.003/request
GPT-4o: F1 = 0.94, ~$0.12/request

That's a 40x cost difference for a 3-point F1 gap. In most production systems, 0.91 F1 is more than sufficient.

The 5-Node Decision Tree Framework

The framework I use now is a 5-node decision tree that routes tasks based on four signals:

Input token count (< 500?)
Output determinism (JSON/enum expected?)
Reasoning depth score (1–5 scale)
Latency SLA (< 200ms P95?)

def route_task(prompt: str, output_schema: dict | None, latency_sla_ms: int) -> str:
    """
    Returns the model tier to use for a given task.
    Tiers: 'tier1' | 'tier2' | 'tier3'
    """
    token_count = estimate_tokens(prompt)          # lightweight tokenizer
    reasoning_depth = score_reasoning_depth(prompt) # keyword + heuristic classifier
    is_structured = output_schema is not None
    is_latency_sensitive = latency_sla_ms < 200

    if token_count < 500 and is_structured and reasoning_depth <= 2:
        return "tier1"  # Haiku / quantized Llama — ~$0.003/request

    if reasoning_depth <= 3 and not is_latency_sensitive:
        return "tier2"  # Mid-tier — ~$0.01–0.03/request

    return "tier3"      # Frontier model only — ~$0.10–0.15/request

The 5 Task Classes

Tier 1 — Classification & Tool Execution

Models: Haiku / quantized Llama (Q4_K_M)

Binary or multi-class classification
Structured extraction (JSON, enums)
Tool call routing in agentic pipelines
Cost: ~$0.003/request

{
  "task": "extract_order_id",
  "tier": "tier1",
  "model": "claude-haiku-3",
  "output_schema": {
    "order_id": "string",
    "customer_id": "string",
    "issue_type": "billing | technical | shipping | other"
  }
}

Tier 2 — Summarization & Transformation

Models: Mid-tier (e.g., GPT-4o-mini, Haiku with larger context)

Document summarization
Format conversion
Translation
Cost: ~$0.01–0.03/request

Tier 3 — Multi-step Reasoning

Models: Frontier only (GPT-4o, Claude Sonnet, Gemini 1.5 Pro)

Complex analysis requiring chain-of-thought
Code generation with debugging
Multi-document synthesis
Cost: ~$0.10–0.15/request

The Routing Classifier

The routing classifier itself runs on a Haiku-class model. Its cost is roughly 0.1% of the savings it generates. It pays for itself on the first routed request.

The classifier evaluates:

Token count of the incoming prompt
Presence of structured output schema
Keyword signals for reasoning depth
Latency requirements from the request metadata

REASONING_KEYWORDS = [
    "analyze", "compare", "synthesize", "debug", "explain why",
    "step by step", "chain of thought", "evaluate", "critique"
]

def score_reasoning_depth(prompt: str) -> int:
    """
    Returns a 1–5 reasoning depth score.
    1 = pure classification/extraction
    5 = deep multi-step reasoning required
    """
    prompt_lower = prompt.lower()
    keyword_hits = sum(1 for kw in REASONING_KEYWORDS if kw in prompt_lower)
    token_count = estimate_tokens(prompt)

    base_score = 1
    base_score += min(keyword_hits, 2)          # max +2 from keywords
    base_score += 1 if token_count > 1000 else 0 # long prompts skew complex
    base_score += 1 if token_count > 3000 else 0 # very long = almost certainly tier3

    return min(base_score, 5)

Real Production Numbers

One number from our agentic pipeline at QEval: routing a 10-step ReAct loop — frontier model only for planning, Haiku for tool execution — cut cost per loop from $1.47 to $0.18. Accuracy delta was under 3%.

# Before routing: all steps on GPT-4o
# 10 steps × ~$0.147/step = $1.47/loop

# After routing:
# 2 planning steps × $0.12  = $0.24
# 8 tool steps    × $0.003 = $0.024
# 1 routing call  × $0.003 = $0.003
# Total                     = $0.267  → real-world measured: $0.18 with caching

The mental shift that matters: stop optimizing cost-per-token. Optimize cost-per-correct-answer.

Implementation Checklist

[ ] Audit your top 5 inference call types by volume
[ ] Score each on reasoning depth (1–5)
[ ] Identify which are classification/extraction (Tier 1 candidates)
[ ] Build a lightweight routing classifier
[ ] A/B test Tier 1 model vs frontier on your actual data
[ ] Measure F1 delta — if < 5 points, route to Tier 1

References

If you're building routing logic for agentic pipelines or wrestling with inference cost at scale, I'd love to compare notes — find me on LinkedIn. I share production AI/ML architecture insights regularly, and I'm always curious what thresholds and signals others are using in their own routing classifiers.

DEV Community