Serving cheap when two models agree: a measured cost lever

#testdev #ai #llm #infrastructure

The problem

A cost efficient AI system sends easy work to a cheap model and only escalates hard work to an expensive frontier model. The trouble is knowing which is which. When a task has a test, like code with unit tests, you just run the test: if the cheap answer passes, serve it; if not, escalate. But most real prompts have no test. A question like "what time is the maintenance window" cannot be checked by running code. With no test, a careful system escalates almost everything, and you pay frontier prices for work a cheap model could have done.

We measured our own gateway and found exactly that. On no-test prompts in automatic mode, the system escalated to the frontier 100 percent of the time, at every context length. The cheap tier was capable, but the system did not trust it without a test, so it never served those answers.

The idea: agreement as a stand-in for a test

Instead of a test, ask a second, independent cheap model the same question. If the two cheap models agree, the answer is very likely correct, so serve it cheap. If they disagree, that is the genuinely hard case, so escalate to the frontier. Disagreement never serves a worse answer than before, because the disagreement path is the same escalation that used to happen anyway. Agreement only adds a chance to skip an unnecessary frontier call. The gate is conservative by construction, so its only failure mode is paying for an avoidable escalation, never serving a wrong answer, unless the two cheap models happen to agree on the same wrong answer. That single risk is the whole ballgame, so we measured it directly.

The one number that can break it

The only way this gate ever serves a wrong answer is if two cheap models agree on the same wrong answer. We call that P(wrong given agree). If it is zero, agreement is a safe stand-in for a test. So we stress tested it.

We ran two architecturally different cheap models across four task families, including a set of brand new hard traps we wrote ourselves so they could not be memorized from training data (a reverse-percentage trap, a rise-then-fall price trap, a buy-two-get-one trap, a clock-chime interval trap, a snail-in-a-well trap, and more). The result, at n=160: P(wrong given agree) was 0.00 in every one of the four families (retrieval, reasoning, multi-fact, and the new adversarial traps), with zero agree-and-wrong cases out of 160 total.

When the two cheap models agreed, they were correct 100 percent of the time, across every category, including the traps designed to fool them. They agreed about 76 percent of the time, and each cheap check took about 0.9 seconds at the median. One honest note on rigor: an early run showed a few apparent failures that turned out to be a formatting bug in our scoring, not real errors. We caught it, fixed it, and re-ran clean. The boring re-check is the work.

What it does to cost

We shipped the gate to production and re-measured escalation on no-test prompts:

context length frontier escalation before frontier escalation after accuracy after
1,000 tokens 100% 40% 100%
4,000 tokens 100% 20% 100%
16,000 tokens 100% 0% 100%
32,000 tokens 100% 0% 100%

At long context, where frontier calls cost the most, escalation went from total to zero with no loss of accuracy. Across live traffic the system now serves about 91 percent of requests on the cheap tier. Our blended measured cost is about 0.002 dollars per request, and a repeated question is served from cache at close to zero.

*Why it matters
*
The savings on verifiable work, like code with tests, were already real. This extends the same economics to the large class of work that has no test, which is most real questions, without guessing and without giving up accuracy. The hard cases still get the frontier, and those are exactly the cases worth caching a high quality answer for, so the next identical question is served cheap too.

Honesty and limits

Every case we proved is a single final answer. We have not yet proven the gate on multi-step reasoning where a final answer can be right by luck on a broken chain, or where two models could agree on the same wrong chain. That is the next frontier and we are not claiming it here. The result above is for two specific cheap models; a different pair could behave differently, and widening model diversity is a known lever we hold in reserve. We are publishing the result and the honesty, not the gating engineering.

Top comments (2)

Max Quimby • Jun 30

The "conservative by construction" framing is the right instinct — making the only failure mode "pay for an avoidable escalation" is exactly how you want to bias a cost gate. The number I'd pressure-test isn't P(wrong|agree)=0 at n=160 (encouraging, but the Wilson interval at that n still leaves room for a rare correlated failure) — it's whether your two cheap models share a training-data blind spot. Architecturally different ≠ statistically independent; two models that both absorbed the same wrong Stack Overflow answer will agree confidently and wrongly, and your hand-written traps (smart, because they dodge memorization) can't surface the correlated errors neither you nor the models anticipated. Two questions: did you break the 76% agreement rate down by task family? If retrieval agrees 95% but reasoning agrees 50%, the cost win is concentrated and you could route by family. And have you considered a background cheap-vs-frontier disagreement-sampling loop to keep re-estimating P(wrong|agree) on live traffic, rather than trusting the offline 160? That turns the gate into something you can monitor for drift instead of a one-time proof.

Tom Jones • Jun 30

Hi Max,

I am so sorry it took me this long to respond. I wish I had read this yesterday :) I will be honest with you, I am learning a lot of this as I go, and your insights are exactly the kind of thing that helps me get it right. Thank you for the sharp questions. Please keep them coming.

You were right that the number to really test was not the zero from our trap questions. We wrote those ourselves, so a perfect score there mostly shows we can dodge the obvious memorized stuff, not that agreement actually means the answer is correct. So we went and ran it on a real public test set we did not write, about 380 cases. The honest result is that when the cheap models agree, they are still wrong sometimes. About 10 percent of the time on question answering, and about 29 percent of the time on summarizing.

So agreement is a real signal, but it depends on the type of task. That is your first question answered. We broke it down by task type, and the savings land exactly where you guessed. On question answering, agreement gets us most of the benefit. On summarizing it barely helps and the agreement is weak. So the right move is to decide per task type, not treat agreement as one single switch.

On independence, you nailed the real weakness. Two different models are not truly independent if they both learned from the same internet. They pick up the same blind spots and can be confidently wrong together, and the questions we wrote ourselves will never catch the shared mistakes nobody saw coming. We are not claiming they are independent. The honest next step is to measure agreement between two different models on real data we did not write, broken down by task type, and report the real range instead of one rosy number. Where we said zero, we will say it again with the real number and the honest range.

Your second idea is the one we are building right now. A background check that quietly compares the cheap answer against the expensive one on live traffic, and keeps measuring how often agreement is wrong. That turns it from a one time claim into something we watch over time, and it warns us if things drift.

The one thing we keep no matter what is that the system is built so the worst case is paying for an extra check we did not need, never serving a wrong answer to save money.

Thank you again for pushing on this. It made the whole thing more honest and more useful.