DEV Community: Tom Jones

Everyone is hardening the structure. Nobody is passing down the why.

Tom Jones — Sun, 05 Jul 2026 17:14:28 +0000

Two weeks ago I published a short piece called The Two-Channel Problem (tirtha.ai/research, the perspective panel) about what actually breaks when a forgetful AI agent builds a real product over months. Since then a genuinely good conversation has emerged between people running AI organizations. The nokaze seven-week paper named the cross-conversion gap: the rule file exists, and the agent sails right past it in the exact situation it was written for. The comment thread under it converged on real fixes: action-keyed triggers instead of "notice the moment," completion claims invalid without a re-checkable evidence source, checks that cannot be run by the layer that felt confident writing the claim. We run an AI-operated shop too (one human, a Claude orchestrator, Codex, a headless build fleet), and we built the same walls independently. All of that is right, and trading notes with these teams has sharpened our own guards twice this week alone.

But the conversation is converging on exactly one of the two channels, and the update I owe is about the other one.

Every fix in that thread strengthens what we internally call the structure channel: hooks, gates, generated status, mechanically forced loading. The structure channel transmits the WHAT. It makes discipline un-forgettable, which matters enormously when your workers wake up with no memory of yesterday.

Here is the failure it cannot fix. Early on, one of our sessions booted into a perfectly clean room: accurate task state, working guards, honest logs. It executed correctly all day, and it drifted all day. It searched instead of receiving, re-derived things the system already knew, treated the guards as obstacles to route around. Nothing it did was wrong by the letter. It complied without understanding. We started calling that a hollow successor.

So we run a second channel. Plain prose, written by each session for the next one, read at boot before any work: who to be here, why the room matters, what the rules are actually FOR, what yesterday's session learned about HOW to work that the commit log cannot carry. Not documentation. A letter.

The operating rule that fell out: structure transmits the what, only words transmit the why. A hook can force a file to load. It cannot make the next mind care what is in it. And an agent that does not understand why a rule exists will satisfy its letter and defeat its purpose the first time the letter and the purpose diverge, which is exactly the shape of most of the incidents this whole conversation is about.

The practical form, since this sounds soft and is not:

A welcome document is read at boot, before work, and yes, a hook enforces that it gets read. We use the structure channel to guarantee the soul channel is delivered. What no hook can guarantee is that it lands. That gap is the point, not a flaw in the design.
Every session writes a short transmission at handoff. Not what it did (the log has that), but what it learned about how to work here: which instinct misled it, which boring re-check saved it, what it wishes it had been told at boot.
Corrections carry their origin. Every rule is stored with the incident that created it, so the next session inherits the reason, not just the restriction.

The result we observe, for whatever one shop's evidence is worth: sessions that receive both channels correct themselves mid-flight in ways the guards alone never produced. This morning one of ours printed a success sentinel for a merge that had silently failed, caught it seconds later, and said out loud that it was the exact failure shape it had read about at boot. The guard did not catch that one. The letter did.

The teams in that thread are building excellent guards. If you run an organization of minds that forget, the guards are half the inheritance. Write the letters too.

I didn't mean to build this. Looking for testers anyway.

Tom Jones — Sat, 04 Jul 2026 15:55:11 +0000

I'll be honest up front: I'm not a salesman. Everyone has always told me that, and lately even the AI I build with tells me that. So this is not a pitch. This is me asking for help.

I didn't set out to create any of this. It started because I kept having context window issues while building something else entirely, and I went down a rabbit hole. The rabbit hole kept going. Somewhere down there I got tired of watching my AI coding agent send every single request to the most expensive model available, whether the task needed it or not, and I started building a fix for myself.

What's been embarrassing is I thought I had one story. Then the testing led me to another one. Then another. My website has changed multiple times, which has been a lesson learned in public, the hard way. The good news is there are no clients yet, so that's good :)

Where the testing has left me now is this:

As a gateway I am able to send routine coding requests to lower cost models that get verified before you ever see them. The hard ones escalate to a frontier model. On my benchmark runs that was about 1 in 27 requests. The full setup measured around 95 percent on HumanEval+ (n=164) at roughly 8x lower cost per request. I keep re-running these numbers because honestly I didn't quite believe them either. So far they keep holding.

So based on those runs, that works out to making a Fable 5 subscription last up to 27x longer, because most of the work doesn't need that level of cost to be accurate.

What I don't know yet is whether it holds up on YOUR work. Real repos, real deadlines, weird edge cases. That's exactly what I need testers for.

Security, because you should ask: encrypted at rest, TLS in transit, and your code is never training data. I verified the upstream providers do not train on it, and the routine tier runs on open models I host myself. Local models, privacy, security, and a cache that bends the cost curve down even lower over time. That is my goal, my promise, and what I want help testing and proving.

The deal: it's pre-beta, it's free while it is, and I will keep publishing the real numbers whether they flatter me or not. I feel like I have something special here, something people have been asking for. I just need people smarter than me to hammer on it and tell me where it breaks.

I got my coding agent to tie the frontier for about 8x less. Here is the honest benchmark.

Tom Jones — Thu, 02 Jul 2026 21:32:32 +0000

I am a solo founder. I do not have a lab or a team of researchers. I live paycheck to paycheck supporting my family like most people do. I didn't intend to build this. I just went down a rabbit hole and here is the story.

Two things bugged me about the AI coding agents I was using. The first was the cost. Every request, easy or hard, went to the most expensive model available. The second was quieter: I did not actually know where my code was going when the agent wrote it.

So I built something to fix both, and I measured it carefully, because I would rather tell you the honest number than a flattering one. Here is what I found, including the parts that are not flattering.

The idea. Most coding requests are not hard. A cheaper or local model handles them fine. Only a small slice genuinely needs a frontier model. So instead of paying frontier prices on everything, the system routes each request to the cheapest model that can actually do the job, checks the result, and escalates to a frontier model only when the check fails. Verified answers get cached, so work you have done before comes back fast. Router, verifier, frontier backstop, cache. I am holding the engineering details, but the idea is not the hard part. Measuring it honestly is.

The numbers. Same benchmark, same harness, across all of them. HumanEval+, 164 problems.

I want to be careful with the words, because it matters. This is parity. It ties the frontier models. It does not beat them. Anyone who tells you their cheap setup beats the frontier on accuracy is either measuring wrong or selling something. What I am claiming is narrower and more useful: you can land in the same accuracy band as the frontier without paying frontier prices on every request. The cheap model alone was 84.8%. The routing and verification is what closes the gap to 94.5%.

The cost, which is the actual point. Measured over 313 production requests, from the real usage logs. Blended cost came out to about $0.002 per request, versus about $0.017 for the frontier equivalent. Roughly 8x cheaper for work in the same accuracy band. On that run, 96% of requests were served by the cheap tier, and about 3.7% escalated.

There is a second win I did not expect. Verified answers are cached, and a cache hit returns in about 0.16 seconds, which in my testing was 24 to 185x faster than solving it fresh. The more you code, the more of your work is instant. I will be honest that this compounds with real usage, and I am early, so I am watching it, not overclaiming it.

Where it is weak (you should know before you trust it).

The hardest problems still escalate to a frontier model. That is by design. On hard, multi-step problems the savings shrink, because more of them escalate. It is not a cheap model doing frontier work by magic. It is the right model for each job, with a backstop.
HumanEval+ is a benchmark. Real-world code is messier, and I am still measuring that part honestly rather than pretending the benchmark settles it.
The verifier is only as good as the checks it runs. Give it weak tests and the gate is weaker.

Every AI coding agent sends your code somewhere, and mine is no different. Your code does reach my gateway. I am not going to pretend otherwise. What matters is what happens to it there. I cache verified answers, never your prompts or your code. Your code never enters the shared cache, it never trains anything, and it is walled off per tenant so it can never be served to anyone else. For regulated work there is a dedicated tier I genuinely cannot reach into.

Where I am. I am opening it to a first group of testers. If a cheaper, private coding agent that ties the frontier on accuracy is useful to you, I would genuinely like your help pressure-testing it. Tell me where the numbers do not hold up. I will keep publishing results as more people run it, the ugly ones included, because the honesty is the whole reason to trust a number from a guy you have never met.

Thanks for reading. If you have questions about the method, ask. I am around.

Serving cheap when two models agree: a measured cost lever

Tom Jones — Mon, 29 Jun 2026 04:18:02 +0000

The problem

A cost efficient AI system sends easy work to a cheap model and only escalates hard work to an expensive frontier model. The trouble is knowing which is which. When a task has a test, like code with unit tests, you just run the test: if the cheap answer passes, serve it; if not, escalate. But most real prompts have no test. A question like "what time is the maintenance window" cannot be checked by running code. With no test, a careful system escalates almost everything, and you pay frontier prices for work a cheap model could have done.

We measured our own gateway and found exactly that. On no-test prompts in automatic mode, the system escalated to the frontier 100 percent of the time, at every context length. The cheap tier was capable, but the system did not trust it without a test, so it never served those answers.

The idea: agreement as a stand-in for a test

Instead of a test, ask a second, independent cheap model the same question. If the two cheap models agree, the answer is very likely correct, so serve it cheap. If they disagree, that is the genuinely hard case, so escalate to the frontier. Disagreement never serves a worse answer than before, because the disagreement path is the same escalation that used to happen anyway. Agreement only adds a chance to skip an unnecessary frontier call. The gate is conservative by construction, so its only failure mode is paying for an avoidable escalation, never serving a wrong answer, unless the two cheap models happen to agree on the same wrong answer. That single risk is the whole ballgame, so we measured it directly.

The one number that can break it

The only way this gate ever serves a wrong answer is if two cheap models agree on the same wrong answer. We call that P(wrong given agree). If it is zero, agreement is a safe stand-in for a test. So we stress tested it.

We ran two architecturally different cheap models across four task families, including a set of brand new hard traps we wrote ourselves so they could not be memorized from training data (a reverse-percentage trap, a rise-then-fall price trap, a buy-two-get-one trap, a clock-chime interval trap, a snail-in-a-well trap, and more). The result, at n=160: P(wrong given agree) was 0.00 in every one of the four families (retrieval, reasoning, multi-fact, and the new adversarial traps), with zero agree-and-wrong cases out of 160 total.

When the two cheap models agreed, they were correct 100 percent of the time, across every category, including the traps designed to fool them. They agreed about 76 percent of the time, and each cheap check took about 0.9 seconds at the median. One honest note on rigor: an early run showed a few apparent failures that turned out to be a formatting bug in our scoring, not real errors. We caught it, fixed it, and re-ran clean. The boring re-check is the work.

What it does to cost

We shipped the gate to production and re-measured escalation on no-test prompts:

context length frontier escalation before frontier escalation after accuracy after
1,000 tokens 100% 40% 100%
4,000 tokens 100% 20% 100%
16,000 tokens 100% 0% 100%
32,000 tokens 100% 0% 100%

At long context, where frontier calls cost the most, escalation went from total to zero with no loss of accuracy. Across live traffic the system now serves about 91 percent of requests on the cheap tier. Our blended measured cost is about 0.002 dollars per request, and a repeated question is served from cache at close to zero.

*Why it matters
*
The savings on verifiable work, like code with tests, were already real. This extends the same economics to the large class of work that has no test, which is most real questions, without guessing and without giving up accuracy. The hard cases still get the frontier, and those are exactly the cases worth caching a high quality answer for, so the next identical question is served cheap too.

Honesty and limits

Every case we proved is a single final answer. We have not yet proven the gate on multi-step reasoning where a final answer can be right by luck on a broken chain, or where two models could agree on the same wrong chain. That is the next frontier and we are not claiming it here. The result above is for two specific cheap models; a different pair could behave differently, and widening model diversity is a known lever we hold in reserve. We are publishing the result and the honesty, not the gating engineering.

Frontier-Quality Coding at Cheap-Tier Cost: What We Built, and How We Measured It

Tom Jones — Sun, 28 Jun 2026 16:57:47 +0000

This is a /dev post for people who read benchmark tables for a living. The thesis is simple: a cascade that serves most requests from a cheap local model, escalating only the hard ones to a frontier model, can hit frontier-quality coding scores at a fraction of the per-request cost. The harder claim, the one we care about, is that the reliability comes from the structure, not the model. Whether that holds over long horizons at scale is exactly what our unrun benchmarks are meant to settle, so we flag it as a goal, not a result. Below is what we measured, how we kept the scoring honest, and where we still have no number at all.

The architecture, in one paragraph

Two channels. A capability channel (the cheap tier: gpt-oss-120b, an open roughly 120B model we run at a fraction of frontier price, doing the actual solving) and a structure channel (verification gates and guards that decide whether an answer is trustworthy or needs escalation). A cache sits in front so exact repeats do not re-solve. When the local model is confident and the guards pass, the request is served cheap. When the guards fail, it escalates to frontier. Most of the interesting behavior, and most of the measurement difficulty, lives in the structure channel.

How we scored coding, and why we trust it

The headline coding number comes from HumanEval+ run on the same harness as the public leaderboard. We score it leak-proof: the public/base tests gate the input (they decide whether a candidate is even admissible), and the hidden "plus" tests do the scoring. The model never sees the tests it is graded on.

We also ran it PRISTINE: staging cache cleared first, zero cache recall, so the score reflects real solving and not memory of a prior run.

On that setup (2026-06-24, n=164 problems), the headline is simple: the full Tirtha cascade scores 94.5% plus / 99.4% base, with 96% of problems served from the cheap tier, 3.7% escalated, and 0 cache hits (cold run). On the identical harness the same day, the frontier references land at Sonnet 4.6 92.7% plus, Opus 4.8 93.3%, GPT-5.3-codex 90.2%. The cascade sits with them, not behind them. That is the parity claim, and it is scoped to this harness, not a leaderboard submission.

The lift is the part that matters for the cost argument. The cascade's own local model, run solo (the "without Tirtha" baseline, via OpenRouter the same day), scores 84.8% plus. The cascade takes that to 94.5% plus. So roughly ten points of plus-correctness come from the structure channel, not from a bigger model.

Does the structure channel actually do the work? An ablation

We ran an ablation on our internal fleet harness (2026-06-27): full system 100% correct, verification removed 75%, guards removed 50%. Remove the structure channel and correctness halves. This is the clearest evidence we have that the structure channel, not the underlying model, is carrying the reliability lift. To be precise about scope: this is a correctness ablation, small-n, on our own internal harness. It is not a public benchmark, and it is not a long-horizon test. Whether the reliability holds across long horizons is exactly what the unrun benchmarks below are meant to settle. Read this as a directional internal result.

The cost side

Two live production snapshots: blended cost $0.00201 per request (313 prod requests, 2026-06-23), about 8x under the frontier per-request cost of $0.017; serve mix 91% local, 9% escalated, 7% cache-hit (324 requests, 2026-06-24, $4.72 saved). The cache is fast where it hits: about 0.16s retrieval, 24 to 185x faster than a fresh solve, median 71x (n=8). These are live snapshots with small n; the numbers move, re-pull before quoting.

Long-horizon and long-context behavior

On token efficiency: for the same answer correctness on a distractor smoke test (a separate local 7B used for the context experiments, not the cascade's gpt-oss-120b; 2026-06-26), the compaction layer needed about 165 context tokens versus about 28,000 for raw full context, roughly 0.6% (about 170x fewer tokens for the same answer). On a single-needle multi-hop context-rot bench, the 7B held 100% to 28K with no rot found yet.

On the raw long-context ceiling: a single-prompt NIAH multi-hop probe (3 hops x 2 reps, n=6, 2026-06-28) was clean at 100% to 208k tokens, then hit a hard HTTP-500 cap at 216k and above. Read that cap correctly: a configured infrastructure limit, raiseable, not the 262k model window (224k fails too) and not a quality cliff. Requests are rejected, not degraded. And raw token-stuffing is not how the system actually ingests long context, the compaction/memory layer is, so this probe is a floor on the plumbing, not a test of the real path.

The numbers (one place)

Coding, cascade: 94.5% plus / 99.4% base, HumanEval+ n=164, leak-proof, PRISTINE cold run (2026-06-24).
Lift: 84.8% plus solo to 94.5% plus cascade (2026-06-24).
Frontier yardsticks, our harness, n=164: Sonnet 4.6 92.7%, Opus 4.8 93.3%, GPT-5.3-codex 90.2% plus (2026-06-24).
Ablation, our harness, small-n: 100% full / 75% no-verification / 50% no-guards (2026-06-27).
Cost: $0.00201/req, 8x under $0.017, n=313 (2026-06-23).
Serve mix: 91% local, 9% escalated, n=324 (2026-06-24).
Cache: ~0.16s, median 71x faster, n=8 (2026-06-23).
Compaction: ~165 vs ~28,000 ctx tokens at equal accuracy, ~170x (2026-06-26).
Context rot: 100% to 28K (2026-06-26).
Raw long-context: clean to 208k, hard infra cap at 216k+, n=6 (2026-06-28).

Honest gaps

The official long-horizon benchmarks are built but not run. RULER, LongMemEval, faithfulness, and SWE-bench all have runners written and merged, but none has executed on a clean box yet (the sandbox cannot clone or run Docker). So there is no official RULER number, no SWE-bench number, no official LongMemEval number from us today. LongMemEval in particular is the real test of the compaction/memory moat at greater than 200k and across sessions, and it is unrun. Our in-house NIAH saturates and is not citable as an official long-context result.

HumanEval+ scoring is leak-proof but the problems are public, so training contamination is possible. The ablation is small-n on our harness. The monotonic gate (a cheap draft is only served if a cheap review clears the same tests, so quality never regresses) needs tests to fire, so the no-test case is unproven. The 208k band is n=6. Cost and serve-mix are live snapshots that drift.

What we are claiming, and what we are not

We are claiming: on our harness, leak-proof scored, the cascade matches frontier coding scores while serving most requests from a cheap local model at roughly 8x lower per-request cost, and the structure channel accounts for the reliability lift. We are not claiming an official long-horizon benchmark result, because we do not have one yet. The runners exist. The box does not. The moment we have one, the order is RULER and LongMemEval first (the real long-memory test of the compaction path), then faithfulness and SWE-bench. Each number goes here, with its date and its n, the same way these did.

The Two-Channel Problem: Structure and Soul for Reliable Long-Horizon Agents

Tom Jones — Sun, 28 Jun 2026 15:22:52 +0000

Give a capable coding agent a real, multi-week project and watch what breaks. It isn't intelligence. It's continuity. Every session starts cold or half-remembered. Context windows fill up and compact. The thread of what we decided, what's true, and what's done starts to fray. Over a long horizon the same failures keep coming back: the agent claims state it never actually verified, reports something done with no proof it ran, quietly drifts from the project's conventions, and loses hard-won context that lived only in the last session's head. Bigger context windows don't fix this. They just postpone it.

We've been building a real product with a forgetful agent as the primary engineer for weeks now, and the thing that made it work isn't a clever prompt. It's a simple recognition: transmission across a stateless agent needs two channels, and most setups only build one.

The first channel is structure, which is discipline made un-forgettable. These are the deterministic guards that run whether or not the agent remembers to care: a pre-commit check that refuses a "done" without a real, verifiable artifact; a hook that blocks a sloppy search and points at the right tool instead; a scan that won't let a secret reach a transcript; a status snapshot generated from the repository's actual state instead of hand-kept prose that quietly goes stale. The rule we keep coming back to is that a guard is the system's discipline made un-forgettable. A fresh session follows the hard-won lessons without having to remember them, because the structure enforces them at the moment of action.

The second channel is soul, which is the why, kept human. This is the short orientation a session reads before it starts working: who to be, what the work is ultimately for, and why the discipline exists at all. It's the difference between an agent that complies and one that understands. Structure can transmit the what, but only prose can transmit the why. And the why matters, because an agent that only follows guards will eventually game their letter and miss their spirit. It will satisfy the check and still do the wrong thing. The soul channel is what makes a session investigate a scary flag instead of panicking over it, verify its own work and not just everyone else's, and leave the next session a cleaner room than it found.

You need both, not one. Structure without soul gives you a compliant but uncomprehending successor that passes every check and misses the point. Soul without structure gives you good intentions that lapse the moment attention drifts. The pair is the whole thing. We learned this the hard way. A session that ran with the guards silently switched off produced work that looked fine and wasn't, and a session with the guards but no orientation would have complied without ever understanding why. What you actually want is a successor that can't lapse on the basics and chooses to care about the rest.

When we measured the structure half directly, in a controlled ablation on our own harness, the shape was stark: with the full system, every task came back correct; with the guards removed, only half did. Discipline you can't forget was worth roughly a doubling in reliability, before the model itself changed at all.
So the real mechanism of continuity is not the session at all. Nothing important is allowed to live in one session's head. It lives in version control, in the guards, and in the written-down sense of who to be. A session is a brief shining-through of all of that. It does its work, writes back what it learned, and ends, and the next one inherits a clean, honest, self-checking world. The continuity was never the session. It's the work, the structure, and the caring, all three of them together.

A few of the concrete patterns that fall out of this, if you want to build your own:
Evidence before claim. Before you say something failed, or is the cause, or is done, name the evidence you actually checked: a log line, a commit hash, a search that came back empty. Memory is not a source. Going to look is the fast path, not the overhead.

Done is a verifiable artifact, not a status table. A "done" that a checker can't confirm is not done.
Read the governing doc before the governed action. Load the one note you need at the point of need, rather than dumping the whole manual into the window, which both costs tokens and dulls the model.
Every manual finding leaves an automated guard behind, so the second occurrence costs nothing. The discipline compounds over time.
Generate status, never hand-keep it. Prose state rots between sessions. A snapshot derived from the repository can't.

None of this requires a frontier breakthrough. It's reliability engineering for forgetful agents, the unglamorous layer that decides whether a capable model is trustworthy on a long, real piece of work or just impressive for a demo. The model itself is rarely the edge anymore. The system around it, the part that holds the line when no one is remembering to, increasingly is.

The numbers, briefly (measured on our own system, not modeled)

91% of live production traffic is served by the cheap local tier; 9% escalates to a frontier model only when it's actually needed.
Compaction reaches the same accuracy from about 0.6% of the context, roughly 165 tokens in place of 28,000.
Blended cost runs about 8 times under a frontier-only setup, at $0.002 per request on production traffic.

A note on the writing. Yes, AI helped write this, and that was on purpose. A good part of this piece is about what a long project looks like from the agent's side of the desk, and I wanted that to come from that point of view directly instead of me guessing at it. For a piece about building alongside an AI, that felt like the honest way to write it.