Building a Confidence-Based LLM Router (Cheap Model First, Escalate When Unsure)

#ai #architecture #python #showdev

Most "multi-model" tutorials just show you a switch statement mapping task types to models. That's fine, but the real cost savings come from a smarter pattern: try a cheap model first, escalate to an expensive one only when needed.
The Idea
For a given request:

Run a cheap, fast model (e.g. MiniMax 2.7 or Qwen3 235B)
Estimate confidence in the output
If confidence is high → return it (cheap path)
If confidence is low → escalate to GPT-4o (expensive path)

In practice, 70-85% of requests never escalate, so your average cost-per-request drops dramatically.
Why You Need a Gateway First
This pattern requires calling multiple providers from one code path. Maintaining separate SDKs for each makes the router messy. I use NovaStack so every model sits behind one OpenAI-compatible endpoint — the router just changes a string.
pythonfrom openai import OpenAI

client = OpenAI(
base_url="https://api.novapai.ai/v1",
api_key="nv-your-key"
)

CHEAP_MODEL = "minimax-2.7"
STRONG_MODEL = "gpt-4o"
The Router
pythondef route_with_escalation(prompt: str, confidence_threshold: float = 0.7):
# Step 1: cheap model first
cheap = client.chat.completions.create(
model=CHEAP_MODEL,
messages=[{"role": "user", "content": prompt}],
logprobs=True,
)

confidence = estimate_confidence(cheap)

# Step 2: escalate only if unsure
if confidence >= confidence_threshold:
    return cheap.choices[0].message.content, CHEAP_MODEL

strong = client.chat.completions.create(
    model=STRONG_MODEL,
    messages=[{"role": "user", "content": prompt}],
)
return strong.choices[0].message.content, STRONG_MODEL

Estimating Confidence
The simplest approach uses average token logprobs as a proxy:
pythonimport math

def estimate_confidence(response) -> float:
logprobs = response.choices[0].logprobs.content
if not logprobs:
return 0.0
avg_logprob = sum(t.logprob for t in logprobs) / len(logprobs)
return math.exp(avg_logprob) # convert to 0-1 probability
For tasks where logprobs aren't ideal (open-ended generation), you can instead use a small classifier or a self-rating prompt.
Results
On my workload (mixed classification + extraction + generation):
MetricSingle GPT-4oConfidence RouterAvg cost / 1K requests$4.20$1.30Escalation rate—22%Quality (internal eval)93%92%
A 70% cost reduction for a 1% quality drop. That's the trade most apps should take.
Try It
NovaStack gives $50 in free credits on signup — enough to build and tune a router like this before committing. Point your OpenAI SDK at https://api.novapai.ai/v1 and you're set. Anthropic format works too.

DEV Community

Building a Confidence-Based LLM Router (Cheap Model First, Escalate When Unsure)

Top comments (0)