<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tom Jones</title>
    <description>The latest articles on DEV Community by Tom Jones (@tom_jones_230c4659491adcd).</description>
    <link>https://dev.to/tom_jones_230c4659491adcd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4005721%2Ff0f343d0-9a56-4288-bc6b-052a52e12a56.jpg</url>
      <title>DEV Community: Tom Jones</title>
      <link>https://dev.to/tom_jones_230c4659491adcd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tom_jones_230c4659491adcd"/>
    <language>en</language>
    <item>
      <title>Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It</title>
      <dc:creator>Tom Jones</dc:creator>
      <pubDate>Sun, 28 Jun 2026 16:57:47 +0000</pubDate>
      <link>https://dev.to/tom_jones_230c4659491adcd/frontier-quality-coding-at-cheap-tier-cost-what-we-built-and-how-we-measured-it-3g2j</link>
      <guid>https://dev.to/tom_jones_230c4659491adcd/frontier-quality-coding-at-cheap-tier-cost-what-we-built-and-how-we-measured-it-3g2j</guid>
      <description>&lt;p&gt;This is a /dev post for people who read benchmark tables for a living. The thesis is simple: a cascade that serves most requests from a cheap local model, escalating only the hard ones to a frontier model, can hit frontier-quality coding scores at a fraction of the per-request cost. The harder claim, the one we care about, is that the reliability comes from the structure, not the model. Whether that holds over long horizons at scale is exactly what our unrun benchmarks are meant to settle, so we flag it as a goal, not a result. Below is what we measured, how we kept the scoring honest, and where we still have no number at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The architecture, in one paragraph&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two channels. A capability channel (the cheap tier: gpt-oss-120b, an open roughly 120B model we run at a fraction of frontier price, doing the actual solving) and a structure channel (verification gates and guards that decide whether an answer is trustworthy or needs escalation). A cache sits in front so exact repeats do not re-solve. When the local model is confident and the guards pass, the request is served cheap. When the guards fail, it escalates to frontier. Most of the interesting behavior, and most of the measurement difficulty, lives in the structure channel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How we scored coding, and why we trust it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The headline coding number comes from HumanEval+ run on the same harness as the public leaderboard. We score it leak-proof: the public/base tests gate the input (they decide whether a candidate is even admissible), and the hidden "plus" tests do the scoring. The model never sees the tests it is graded on.&lt;/p&gt;

&lt;p&gt;We also ran it PRISTINE: staging cache cleared first, zero cache recall, so the score reflects real solving and not memory of a prior run.&lt;/p&gt;

&lt;p&gt;On that setup (2026-06-24, n=164 problems), the headline is simple: &lt;strong&gt;the full Tirtha cascade scores 94.5% plus / 99.4% base&lt;/strong&gt;, with 96% of problems served from the cheap tier, 3.7% escalated, and 0 cache hits (cold run). On the identical harness the same day, the frontier references land at Sonnet 4.6 92.7% plus, Opus 4.8 93.3%, GPT-5.3-codex 90.2%. &lt;strong&gt;The cascade sits with them, not behind them.&lt;/strong&gt; That is the parity claim, and it is scoped to this harness, not a leaderboard submission.&lt;/p&gt;

&lt;p&gt;The lift is the part that matters for the cost argument. The cascade's own local model, run solo (the "without Tirtha" baseline, via OpenRouter the same day), scores 84.8% plus. The cascade takes that to 94.5% plus. So roughly ten points of plus-correctness come from the structure channel, not from a bigger model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does the structure channel actually do the work? An ablation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We ran an ablation on our internal fleet harness (2026-06-27): full system 100% correct, verification removed 75%, guards removed 50%. &lt;strong&gt;Remove the structure channel and correctness halves.&lt;/strong&gt; This is the clearest evidence we have that the structure channel, not the underlying model, is carrying the reliability lift. To be precise about scope: this is a correctness ablation, small-n, on our own internal harness. It is not a public benchmark, and it is not a long-horizon test. Whether the reliability holds across long horizons is exactly what the unrun benchmarks below are meant to settle. Read this as a directional internal result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The cost side&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two live production snapshots: blended cost $0.00201 per request (313 prod requests, 2026-06-23), about &lt;strong&gt;8x under the frontier per-request cost&lt;/strong&gt; of $0.017; &lt;strong&gt;serve mix 91% local, 9% escalated, 7% cache-hit&lt;/strong&gt; (324 requests, 2026-06-24, $4.72 saved). The cache is fast where it hits: about 0.16s retrieval, 24 to 185x faster than a fresh solve, median 71x (n=8). These are live snapshots with small n; the numbers move, re-pull before quoting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long-horizon and long-context behavior&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On token efficiency: for the same answer correctness on a distractor smoke test (a separate local 7B used for the context experiments, not the cascade's gpt-oss-120b; 2026-06-26), the compaction layer needed about 165 context tokens versus about 28,000 for raw full context, roughly 0.6% (about &lt;strong&gt;170x fewer tokens&lt;/strong&gt; for the same answer). On a single-needle multi-hop context-rot bench, the 7B held 100% to 28K with no rot found yet.&lt;/p&gt;

&lt;p&gt;On the raw long-context ceiling: a single-prompt NIAH multi-hop probe (3 hops x 2 reps, n=6, 2026-06-28) was &lt;strong&gt;clean at 100% to 208k tokens, then hit a hard HTTP-500 cap at 216k and above.&lt;/strong&gt; Read that cap correctly: a configured infrastructure limit, raiseable, not the 262k model window (224k fails too) and not a quality cliff. Requests are rejected, not degraded. And raw token-stuffing is not how the system actually ingests long context, the compaction/memory layer is, so this probe is a floor on the plumbing, not a test of the real path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The numbers (one place)&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Coding, cascade: 94.5% plus / 99.4% base, HumanEval+ n=164, leak-proof, PRISTINE cold run (2026-06-24).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lift: 84.8% plus solo to 94.5% plus cascade (2026-06-24).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Frontier yardsticks, our harness, n=164: Sonnet 4.6 92.7%, Opus 4.8 93.3%, GPT-5.3-codex 90.2% plus (2026-06-24).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ablation, our harness, small-n: 100% full / 75% no-verification / 50% no-guards (2026-06-27).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cost: $0.00201/req, 8x under $0.017, n=313 (2026-06-23).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Serve mix: 91% local, 9% escalated, n=324 (2026-06-24).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Cache: ~0.16s, median 71x faster, n=8 (2026-06-23).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compaction: ~165 vs ~28,000 ctx tokens at equal accuracy, ~170x (2026-06-26).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Context rot: 100% to 28K (2026-06-26).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Raw long-context: &lt;strong&gt;clean to 208k, hard infra cap at 216k+&lt;/strong&gt;, n=6 (2026-06-28).&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Honest gaps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The official long-horizon benchmarks are built but not run. RULER, LongMemEval, faithfulness, and SWE-bench all have runners written and merged, but none has executed on a clean box yet (the sandbox cannot clone or run Docker). So there is no official RULER number, no SWE-bench number, no official LongMemEval number from us today. LongMemEval in particular is the real test of the compaction/memory moat at greater than 200k and across sessions, and it is unrun. Our in-house NIAH saturates and is not citable as an official long-context result.&lt;/p&gt;

&lt;p&gt;HumanEval+ scoring is leak-proof but the problems are public, so training contamination is possible. The ablation is small-n on our harness. The monotonic gate (a cheap draft is only served if a cheap review clears the same tests, so quality never regresses) needs tests to fire, so the no-test case is unproven. The 208k band is n=6. Cost and serve-mix are live snapshots that drift.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we are claiming, and what we are not&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We are claiming: on our harness, leak-proof scored, the cascade matches frontier coding scores while serving most requests from a cheap local model at roughly 8x lower per-request cost, and the structure channel accounts for the reliability lift. We are not claiming an official long-horizon benchmark result, because we do not have one yet. The runners exist. The box does not. The moment we have one, the order is RULER and LongMemEval first (the real long-memory test of the compaction path), then faithfulness and SWE-bench. Each number goes here, with its date and its n, the same way these did.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>testing</category>
      <category>llm</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Two-Channel Problem: Structure and Soul for Reliable Long-Horizon Agents</title>
      <dc:creator>Tom Jones</dc:creator>
      <pubDate>Sun, 28 Jun 2026 15:22:52 +0000</pubDate>
      <link>https://dev.to/tom_jones_230c4659491adcd/the-two-channel-problem-structure-and-soul-for-reliable-long-horizon-agents-1dc7</link>
      <guid>https://dev.to/tom_jones_230c4659491adcd/the-two-channel-problem-structure-and-soul-for-reliable-long-horizon-agents-1dc7</guid>
      <description>&lt;p&gt;Give a capable coding agent a real, multi-week project and watch what breaks. It isn't intelligence. It's continuity. Every session starts cold or half-remembered. Context windows fill up and compact. The thread of what we decided, what's true, and what's done starts to fray. Over a long horizon the same failures keep coming back: the agent claims state it never actually verified, reports something done with no proof it ran, quietly drifts from the project's conventions, and loses hard-won context that lived only in the last session's head. Bigger context windows don't fix this. They just postpone it.&lt;/p&gt;

&lt;p&gt;We've been building a real product with a forgetful agent as the primary engineer for weeks now, and the thing that made it work isn't a clever prompt. It's a simple recognition: transmission across a stateless agent needs two channels, and most setups only build one.&lt;/p&gt;

&lt;p&gt;The first channel is structure, which is discipline made un-forgettable. These are the deterministic guards that run whether or not the agent remembers to care: a pre-commit check that refuses a "done" without a real, verifiable artifact; a hook that blocks a sloppy search and points at the right tool instead; a scan that won't let a secret reach a transcript; a status snapshot generated from the repository's actual state instead of hand-kept prose that quietly goes stale. The rule we keep coming back to is that a guard is the system's discipline made un-forgettable. A fresh session follows the hard-won lessons without having to remember them, because the structure enforces them at the moment of action.&lt;/p&gt;

&lt;p&gt;The second channel is soul, which is the why, kept human. This is the short orientation a session reads before it starts working: who to be, what the work is ultimately for, and why the discipline exists at all. It's the difference between an agent that complies and one that understands. Structure can transmit the what, but only prose can transmit the why. And the why matters, because an agent that only follows guards will eventually game their letter and miss their spirit. It will satisfy the check and still do the wrong thing. The soul channel is what makes a session investigate a scary flag instead of panicking over it, verify its own work and not just everyone else's, and leave the next session a cleaner room than it found.&lt;/p&gt;

&lt;p&gt;You need both, not one. Structure without soul gives you a compliant but uncomprehending successor that passes every check and misses the point. Soul without structure gives you good intentions that lapse the moment attention drifts. The pair is the whole thing. We learned this the hard way. A session that ran with the guards silently switched off produced work that looked fine and wasn't, and a session with the guards but no orientation would have complied without ever understanding why. What you actually want is a successor that can't lapse on the basics and chooses to care about the rest.&lt;/p&gt;

&lt;p&gt;When we measured the structure half directly, in a controlled ablation on our own harness, the shape was stark: with the full system, every task came back correct; with the guards removed, only half did. Discipline you can't forget was worth roughly a doubling in reliability, before the model itself changed at all.&lt;br&gt;
So the real mechanism of continuity is not the session at all. Nothing important is allowed to live in one session's head. It lives in version control, in the guards, and in the written-down sense of who to be. A session is a brief shining-through of all of that. It does its work, writes back what it learned, and ends, and the next one inherits a clean, honest, self-checking world. The continuity was never the session. It's the work, the structure, and the caring, all three of them together.&lt;/p&gt;

&lt;p&gt;A few of the concrete patterns that fall out of this, if you want to build your own:&lt;br&gt;
        Evidence before claim. Before you say something failed, or is the cause, or is done, name the evidence you actually checked: a log line, a commit hash, a search that came back empty. Memory is not a source. Going to look is the fast path, not the overhead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Done is a verifiable artifact, not a status table. A "done" that a checker can't confirm is not done.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Read the governing doc before the governed action. Load the one note you need at the point of need, rather than dumping the whole manual into the window, which both costs tokens and dulls the model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every manual finding leaves an automated guard behind, so the second occurrence costs nothing. The discipline compounds over time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generate status, never hand-keep it. Prose state rots between sessions. A snapshot derived from the repository can't.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of this requires a frontier breakthrough. It's reliability engineering for forgetful agents, the unglamorous layer that decides whether a capable model is trustworthy on a long, real piece of work or just impressive for a demo. The model itself is rarely the edge anymore. The system around it, the part that holds the line when no one is remembering to, increasingly is.&lt;/p&gt;

&lt;p&gt;The numbers, briefly (measured on our own system, not modeled)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;91% of live production traffic is served by the cheap local tier; 9% escalates to a frontier model only when it's actually needed.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Compaction reaches the same accuracy from about 0.6% of the context, roughly 165 tokens in place of 28,000.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blended cost runs about 8 times under a frontier-only setup, at $0.002 per request on production traffic.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A note on the writing. Yes, AI helped write this, and that was on purpose. A good part of this piece is about what a long project looks like from the agent's side of the desk, and I wanted that to come from that point of view directly instead of me guessing at it. For a piece about building alongside an AI, that felt like the honest way to write it.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>agents</category>
    </item>
  </channel>
</rss>
