Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It

#ai #testing #llm #devops

This is a /dev post for people who read benchmark tables for a living. The thesis is simple: a cascade that serves most requests from a cheap local model, escalating only the hard ones to a frontier model, can hit frontier-quality coding scores at a fraction of the per-request cost. The harder claim, the one we care about, is that the reliability comes from the structure, not the model. Whether that holds over long horizons at scale is exactly what our unrun benchmarks are meant to settle, so we flag it as a goal, not a result. Below is what we measured, how we kept the scoring honest, and where we still have no number at all.

The architecture, in one paragraph

Two channels. A capability channel (the cheap tier: gpt-oss-120b, an open roughly 120B model we run at a fraction of frontier price, doing the actual solving) and a structure channel (verification gates and guards that decide whether an answer is trustworthy or needs escalation). A cache sits in front so exact repeats do not re-solve. When the local model is confident and the guards pass, the request is served cheap. When the guards fail, it escalates to frontier. Most of the interesting behavior, and most of the measurement difficulty, lives in the structure channel.

How we scored coding, and why we trust it

The headline coding number comes from HumanEval+ run on the same harness as the public leaderboard. We score it leak-proof: the public/base tests gate the input (they decide whether a candidate is even admissible), and the hidden "plus" tests do the scoring. The model never sees the tests it is graded on.

We also ran it PRISTINE: staging cache cleared first, zero cache recall, so the score reflects real solving and not memory of a prior run.

On that setup (2026-06-24, n=164 problems), the headline is simple: the full Tirtha cascade scores 94.5% plus / 99.4% base, with 96% of problems served from the cheap tier, 3.7% escalated, and 0 cache hits (cold run). On the identical harness the same day, the frontier references land at Sonnet 4.6 92.7% plus, Opus 4.8 93.3%, GPT-5.3-codex 90.2%. The cascade sits with them, not behind them. That is the parity claim, and it is scoped to this harness, not a leaderboard submission.

The lift is the part that matters for the cost argument. The cascade's own local model, run solo (the "without Tirtha" baseline, via OpenRouter the same day), scores 84.8% plus. The cascade takes that to 94.5% plus. So roughly ten points of plus-correctness come from the structure channel, not from a bigger model.

Does the structure channel actually do the work? An ablation

We ran an ablation on our internal fleet harness (2026-06-27): full system 100% correct, verification removed 75%, guards removed 50%. Remove the structure channel and correctness halves. This is the clearest evidence we have that the structure channel, not the underlying model, is carrying the reliability lift. To be precise about scope: this is a correctness ablation, small-n, on our own internal harness. It is not a public benchmark, and it is not a long-horizon test. Whether the reliability holds across long horizons is exactly what the unrun benchmarks below are meant to settle. Read this as a directional internal result.

The cost side

Two live production snapshots: blended cost $0.00201 per request (313 prod requests, 2026-06-23), about 8x under the frontier per-request cost of $0.017; serve mix 91% local, 9% escalated, 7% cache-hit (324 requests, 2026-06-24, $4.72 saved). The cache is fast where it hits: about 0.16s retrieval, 24 to 185x faster than a fresh solve, median 71x (n=8). These are live snapshots with small n; the numbers move, re-pull before quoting.

Long-horizon and long-context behavior

On token efficiency: for the same answer correctness on a distractor smoke test (a separate local 7B used for the context experiments, not the cascade's gpt-oss-120b; 2026-06-26), the compaction layer needed about 165 context tokens versus about 28,000 for raw full context, roughly 0.6% (about 170x fewer tokens for the same answer). On a single-needle multi-hop context-rot bench, the 7B held 100% to 28K with no rot found yet.

On the raw long-context ceiling: a single-prompt NIAH multi-hop probe (3 hops x 2 reps, n=6, 2026-06-28) was clean at 100% to 208k tokens, then hit a hard HTTP-500 cap at 216k and above. Read that cap correctly: a configured infrastructure limit, raiseable, not the 262k model window (224k fails too) and not a quality cliff. Requests are rejected, not degraded. And raw token-stuffing is not how the system actually ingests long context, the compaction/memory layer is, so this probe is a floor on the plumbing, not a test of the real path.

The numbers (one place)

Coding, cascade: 94.5% plus / 99.4% base, HumanEval+ n=164, leak-proof, PRISTINE cold run (2026-06-24).
Lift: 84.8% plus solo to 94.5% plus cascade (2026-06-24).
Frontier yardsticks, our harness, n=164: Sonnet 4.6 92.7%, Opus 4.8 93.3%, GPT-5.3-codex 90.2% plus (2026-06-24).
Ablation, our harness, small-n: 100% full / 75% no-verification / 50% no-guards (2026-06-27).
Cost: $0.00201/req, 8x under $0.017, n=313 (2026-06-23).
Serve mix: 91% local, 9% escalated, n=324 (2026-06-24).
Cache: ~0.16s, median 71x faster, n=8 (2026-06-23).
Compaction: ~165 vs ~28,000 ctx tokens at equal accuracy, ~170x (2026-06-26).
Context rot: 100% to 28K (2026-06-26).
Raw long-context: clean to 208k, hard infra cap at 216k+, n=6 (2026-06-28).

Honest gaps

The official long-horizon benchmarks are built but not run. RULER, LongMemEval, faithfulness, and SWE-bench all have runners written and merged, but none has executed on a clean box yet (the sandbox cannot clone or run Docker). So there is no official RULER number, no SWE-bench number, no official LongMemEval number from us today. LongMemEval in particular is the real test of the compaction/memory moat at greater than 200k and across sessions, and it is unrun. Our in-house NIAH saturates and is not citable as an official long-context result.

HumanEval+ scoring is leak-proof but the problems are public, so training contamination is possible. The ablation is small-n on our harness. The monotonic gate (a cheap draft is only served if a cheap review clears the same tests, so quality never regresses) needs tests to fire, so the no-test case is unproven. The 208k band is n=6. Cost and serve-mix are live snapshots that drift.

What we are claiming, and what we are not

We are claiming: on our harness, leak-proof scored, the cascade matches frontier coding scores while serving most requests from a cheap local model at roughly 8x lower per-request cost, and the structure channel accounts for the reliability lift. We are not claiming an official long-horizon benchmark result, because we do not have one yet. The runners exist. The box does not. The moment we have one, the order is RULER and LongMemEval first (the real long-memory test of the compaction path), then faithfulness and SWE-bench. Each number goes here, with its date and its n, the same way these did.

DEV Community

Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It

Top comments (0)