DEV Community: Harrison Guo

A Wrong Ruler Is Worse Than No Ruler: Verifying the Checks You Trust

Harrison Guo — Fri, 17 Jul 2026 19:47:14 +0000

There is one failure mode I have learned to fear more than a missing check.

A system with no check for something is at least honest about it. The gap is visible, the uncertainty is real, and everyone downstream knows not to lean on that part. A system with a wrong check for the same thing is worse, and it is worse in a specific, dangerous way. It answers with confidence. It overrules the signals that were actually right. And it does all of this wearing the one badge nobody thinks to question. It's deterministic, so it must be correct.

A wrong check with authority is worse than no check at all. No check leaves you honestly uncertain. A wrong authoritative check leaves you confidently wrong, and it silences the correct signal it overruled.

The two companion pieces to this one both push in the same direction: put as much of the system as you can onto deterministic ground, and let a rule decide wherever a rule can. I still believe every word of that. But both standards make a quiet assumption I want to drag into the light. When they say rules first, and especially when they say a deterministic rule may overrule a probabilistic judge, they assume the rule is right. This piece is about what happens when it is not, and how to make sure it is before you let it hold a veto.

The badge nobody checks

A deterministic number reads as ground truth. That is its great virtue and its hidden trap. When a metric prints 0 or 47 or false, it does not sound like an opinion, it sounds like a fact, and we grant it the standing of one. In a rules-first architecture we go further: we give the deterministic layer authority over the probabilistic one. The rule gates the output. The rule vetoes the judge. That is the correct design. A fluent model that hallucinates a verdict should not get to overrule a schema check that a parser can settle.

But look at what that authority does to the cost of an error. An advisory signal that is wrong is noise: you can ignore it. An authoritative signal that is wrong is not noise, it is a wrong answer with the power to enforce itself. And the badge of determinism, the quiet "it can't be wrong, it's just arithmetic," is exactly what stops anyone from auditing it. The check you trust the most, because it is deterministic and it is yours, is the one whose errors you are least equipped to catch. You built it to be believed.

A metric that lied

Here is the one that taught me the lesson.

Picture an evaluator for generated long-form documents, the kind a model drafts section by section. One of its quality checks looks for lazy repetition: a document that just restates the same paragraph a dozen times over is worse than one that actually develops. A reasonable thing to measure. The implementation folded each paragraph into a signature, a hash of its normalized text, and flagged the document when too many signatures came out identical.

The hash was cheap, and it was broken. It collided, so genuinely different paragraphs sometimes produced the same signature. And it was brittle to trivial edits, so the same paragraph with two words swapped or its sentences reordered hashed to something new. The net effect was a metric that missed near-duplicates when their surface form changed, and occasionally invented sameness where the content was genuinely different.

On its own, a buggy metric is just noise you could learn to ignore. But this one had been handed authority. When the model judge read the document and said, correctly, "these sections are near-duplicates of each other," the repetition metric was allowed to veto the complaint: "the signatures don't match, so the judge is imagining it." And the veto won. The score came out clean.

Sit with what that pipeline was doing on that facet. It was running its own logic backwards: using a wrong deterministic signal to overrule a correct subjective one, and then reporting a better score for the trouble. The judge had been right. The rule silenced it. And because the silencing came from a deterministic check, it arrived stamped with confidence, and every inflated score looked earned. Nobody goes back to re-litigate a number the arithmetic already settled. That is the trap closing.

The failure did not surface as an error. It surfaced as cleaner results, which is the most dangerous shape a failure can take, because clean results are what everyone was hoping for.

Why wrong-and-authoritative is the worst quadrant

Lay it out as a two-by-two: a check is either right or wrong, and it either carries authority or is merely advisory. Three of those cells are fine. A right check with authority is the whole point of the discipline. A right check that is only advisory is a mild waste. A wrong check that is only advisory is ignorable noise. It is the fourth cell, wrong and authoritative, that is not a lesser version of a good check but an active liability, and it is worse than the empty cell where no check exists at all.

Compare the two directly. With no repetition check, you know you cannot currently measure repetition. The gap is on the map, and a repetitive document simply passes unremarked, the same as anything else you haven't instrumented. With a wrong repetition check that vetoes the judge, a repetitive document passes and the one signal that correctly caught it gets overruled and the final score ticks upward, all under a badge that discourages anyone from doubting it. The missing check costs you a blind spot you know about. The wrong check costs you the correct answer you already had, plus the false confidence that you don't need to look.

Which yields the single most useful habit I took from this: suspect the metric first. When a deterministic check refutes an observation that a careful human or a competent model would make, the base rate is not on the check's side. A hand-rolled hash is far more likely to be broken than a fluent reader is to be hallucinating "these sections are near-duplicates" about a document where they visibly are. The reflex to trust the number because it is a number is precisely the reflex that keeps the wrong number in charge.

The other way a ruler lies

There is a second ruler that lies, and it lies in the very layer the first one was busy overruling. It is the bare absolute score: quality = 0.6, emitted by a model judge with no anchors, no comparison, no zero-point, and then handled as if it were a measurement.

A 0.6 like that is not a reading; it is a decimal in the costume of one. Ask what it is 0.6 of. Better than which example, worse than which, on a scale pinned to what? There is no answer underneath. Humans and models share a specific profile here: both are reliable at ordering ("A is closer to the brief than B") and unreliable at absolute calibration ("this deserves a 0.6"). A standalone perceptual number inherits all of the unreliable half and none of the reliable half. It is a subjective opinion that has been rounded to two decimal places and thereby laundered into looking objective, the same laundering the broken hash performed, run in the other layer.

The fix rhymes with the fix for the first ruler: do not trust the number until the number has earned trust. For a subjective axis, that means score by comparison, against anchors, a set of labelled reference examples that fix what good and excellent actually look like, and rank new outputs against them instead of emitting a free decimal. And before you trust a judge's agreement with humans, measure how well humans agree with each other. That inter-rater ceiling is the best score any judge can honestly aspire to. If two careful reviewers only agree seven times in ten on an axis, a judge that reports a confident number above that is not beating the humans, it is hiding the disagreement they honestly surfaced. Reporting a judge's number without that ceiling is one more measurement with no zero-point. A ruler with no marked zero and no fixed unit is not a strict ruler; it is a confident guess holding a straightedge.

How a check earns its authority

So the discipline here is not "add more checks." It is narrower and more demanding: before any check is allowed to gate output or overrule another signal, verify the check itself, as hard as the claim it will silence. In practice that is four moves.

Recompute it a second, independent way. Derive the same quantity by a different method and require the two to agree. If a repetition score can be computed by hashing and also by direct comparison, do both; a disagreement is a bug in one of them, and you want to find that in a test, not discover it as a mysteriously clean production score.
Regression-test it in both directions. A case you know is repetitive must trip it; a case you know is varied must not. A metric validated only in the direction it usually fires is half-tested, and the untested half is exactly where a false veto hides. And wrong does not only mean buggy. A check with correct arithmetic and a badly chosen threshold is wrong in the same way and hides in the same untested direction, so the cases you pick have to pin the boundary, not just the obvious middle.
Run it in shadow before it gates. Put it in production computing its verdicts, but let it change nothing. Watch it at scale, side by side with the signal it will eventually overrule. A wrong metric announces itself here, as a steady stream of vetoes against observations that were plainly correct, and it does so before it has moved a single real verdict. Shadow-first is not caution for its own sake; it is the one place a confidently-wrong check reveals itself while the damage is still zero.
Treat a veto as a claim, not a license. When a rule overrules a judge, it is asserting something checkable, "these sections are not actually duplicates," and that assertion has to be as verifiable as the judge's was. The fact-check does not get to be the one unaudited step in a pipeline whose entire purpose is to audit everything else.

There is an asymmetry worth naming here, because it is the whole reason this failure survives. We already do all four of these things for models. We stage a new model behind a flag, we hold a judge back until it agrees with humans, we assume the probabilistic thing needs proving before we lean on it. We almost never extend the same suspicion to a rule, because the rule wears the badge that says it does not need it. That asymmetry is the bug. The deterministic layer is handed the most authority and subjected to the least verification, and it is handed that authority precisely because nobody expects to have to check it.

Underneath all four is a habit good teams already apply to other people's numbers. When a result comes back too clean, too uniform, suspiciously convenient, you don't celebrate it. You go verify it against the source before you repeat it. This piece is that same skepticism, turned inward, and pointed most sharply at whatever metric you have handed a gate or a veto. External results earn scrutiny because they might be flattering. Your own authoritative checks deserve more, because they are flattering and you trust them.

Where a rough check is fine

Every standard needs its boundary, so here is this one's: the danger is authority, not error. A rough, sometimes-wrong signal is not merely tolerable, it is useful, as long as it stays advisory. A heuristic that flags documents for a human to glance at, a cheap metric that sorts a review queue, a smell test that nudges attention toward the likely problems: each can be wrong a fair fraction of the time and still pay for itself, because the cost of a wrong advisory signal is a wasted glance.

The line is crossed the moment a metric gains authority, the moment it can fail a build, block an output, or overrule another signal without appeal. At that instant its tolerance for being wrong drops toward zero, because its errors stop costing a glance and start costing the correct signal it now outranks. So the standard is not "verify every check to perfection," which would freeze you. It is "match the verification to the authority." Advisory signals can be rough and useful. Anything holding a veto has to have earned it, in both directions, in the open, before it ever changes a verdict.

A missing ruler tells you honestly that you cannot measure something yet. A wrong ruler tells you confidently that you can, and hands you the wrong number with a straight face. And if that ruler also has the authority to overrule the one instrument that was reading correctly, it will quietly make your whole system worse while every dashboard turns greener. Between the two, the honest gap is the safer place to stand. The only work worth doing from there is turning it into a ruler you have actually checked.

This piece extends two companions. The standard it builds on: Shrink the Stochastic Surface. The boundary it assumes: Determinism Where You Can, Judgement Where You Must. And the architecture both sit inside: Generative AI Builds Shapes, Not Games.

Determinism Where You Can, Judgement Where You Must: The Technique Boundary for AI Systems

Harrison Guo — Thu, 09 Jul 2026 15:27:05 +0000

I have now written this same sentence three times in three pieces, so it is time to write the sentence underneath it.

The first companion piece, Generative AI Builds Shapes, Not Games, argued that generative AI produces plausible shapes and that correctness has to come from structure and verification wrapped around the generator. That answers whether you need something outside the model. The second, Shrink the Stochastic Surface, argued that reliability is set by how much of the output you leave the model solely responsible for, and that the work is shrinking that fraction. That answers how much structure, and where the line goes between model and machine.

Both leave one question open. Once you have decided that a part of the system should be deterministic structure rather than a raw sample, which structure? A rule? A state machine? A model reasoning inside a harness? And the question almost everyone is getting wrong in 2026: should any of it be an agent?

This piece is about that boundary. It is less glamorous than the other two because it is the part you have to get right in the code, not on the whiteboard, but it is where systems actually live or die. Here is the law, as plainly as I can put it:

Determinism where you can, probabilistic judgement only where the question genuinely demands it, and control always with the orchestrator, never with an autonomous loop. The model may reason under that control; it may not direct the control itself.

Everything below is an unpacking of that one sentence.

Four techniques, not one tool

The dominant failure mode in AI system design right now is that everything collapses into a single verb: call the model. Need to validate something, call the model. Need to route, call the model. Need to decide what to do next, call the model. It feels modern and it is almost always wrong, because it treats four fundamentally different kinds of computation as if they were one.

There are four techniques, and each has exactly one correct home.

A rule engine owns any question with one machine-computable, reproducible answer. Is this valid syntax? Is the payload under the size bound? Does this reference resolve? Is the structure connected? These are not opinions. A parser answers them in microseconds, for free, and is right every single time.

A state machine owns lifecycle and flow, the questions where what happens next depends on where you are now and only some transitions are legal. A job goes pending → running → done | failed | timeout. A retry goes closed → open → half-open. The point of making this explicit is that it kills an entire bug class, the one you get from ad-hoc boolean flags, where a job somehow transitions from completed back to running because two flags disagreed.

An LLM owns probabilistic judgement, and only that: the questions that are genuinely subjective and that no rule can answer. Is this writing coherent? Does this read as the thing the prompt asked for? Is it good versus merely acceptable? There is no formula for "cozy." Something has to make a perceptual call, and for the parts a rule cannot reach, that something is a model.

An agent, an LLM in a loop choosing its own next actions, owns far less than the current enthusiasm suggests, and in a large class of systems it owns nothing at all. We will get to why.

The reason the boundary matters is that placing work in the wrong tier is expensive in both directions, and people usually only notice one.

Put an LLM where a rule belongs and you get the obvious costs, slower and more expensive, plus a subtle one that is worse: it is less correct. An LLM can hallucinate a syntax verdict. A parser cannot. You have taken a question with a perfect deterministic answer and handed it to the one tool in your kit that can be confidently wrong about it. That is not a trade of accuracy for flexibility. It is strictly worse on every axis.

Put a rule where the question is genuinely subjective and you get the opposite failure: the rule simply cannot answer, so you either ship a gate that misses the thing it exists to catch, or you contort a threshold into pretending a taste question is arithmetic. Both are ways of lying about the nature of the problem.

Getting the boundary right is not a matter of preference. It follows from what each tool is.

The test that draws the line

There is a single question that sorts most work into the right tier, and it is worth memorizing:

Would two informed people always reach the same answer from the data alone?

Run any criterion through it and one of three things happens.

Yes, always. The answer is a mechanical fact: a count, a dimension, a material, a connectivity check. This is the rule engine's territory, no exceptions. If two experts with the raw data in front of them cannot disagree, there is nothing for a model to add except cost and the possibility of error.

Not from the raw data, but the criteria can be articulated. Two reviewers might land in slightly different places, but they can explain what they are looking at, and that explanation can be written down as an explicit standard. This is the model's territory, but with a crucial constraint: the model's job is not ineffable taste, it is reasoning over that written standard, and the reasoning is the product. A judgement that hangs on a named criterion is auditable and improvable. A bare number is neither.

The judgement is real, but nobody has written it down yet. The reviewer knows excellent when they see it and cannot articulate why. This is the bucket everyone skips, and skipping it is the most common way an LLM judge silently fails. If you hand this straight to a model, it does not decline to answer. It scores against its own generic, unstated taste, and you get something reliable against the wrong ruler. The work here is elicitation: dragging the tacit standard out of the domain owner and into explicit criteria, at which point it becomes the second bucket and a model can apply it. Until that happens, the honest move is to route it to a human and log what you could not yet express, not to let the model paper over the gap.

The three buckets are not academic. They are the difference between a judge that applies your standard and a judge that applies its standard while wearing your logo. Most of the effort in building a trustworthy judge goes into the third bucket, the elicitation, and almost none of the discourse does.

The line that actually matters: control versus reasoning

Now the hard part, the one the whole law turns on.

The instinct, once you have accepted "determinism where you can," is to read the LLM tier as narrow: one call, one score, get in and get out. That is the wrong lesson, and it leaves enormous value on the table. The boundary is not about limiting how much the model reasons. A model can reason at length, chain several steps, weigh evidence, propose a plan, and all of that can be perfectly sound engineering. The boundary is about who owns control.

Reasoning is the model producing a judgement, a diagnosis, a plan, a score, as a step inside a flow that something else drives. Control is deciding what step happens next and when the work is done. An agent is precisely a model that has been handed control: it chooses its own actions, re-runs stages at will, and self-determines completion. That single property, self-direction of control, is the thing to withhold.

Say it as a rule: let the model propose, keep the disposing with the orchestrator.

Fine: the model emits a diagnosis, a suggested fix, and a confidence. The orchestrator decides whether to apply it, retry, or escalate.
Fine: the model reasons through five criteria and returns a structured verdict for each.
Not fine: the model decides which criteria to check, re-runs whichever stages it feels like, and announces on its own that it is finished.

The cleanest place this distinction shows up is function calling, because the same API serves both sides of the line. Using a tool schema to make the model emit a structured result, to fill a form with a sub-score and the fact it relied on, is just an output format, and it is fine. Letting the model choose which tools to call, in what order, and when to stop is a control loop, and that is what makes it an agent. Tool schemas appearing in your code tells you nothing. Who owns the loop tells you everything.

Here is the part that makes the whole thing click, and it is an argument from engineering, not from taste. Handing control to a model is only worth it when choosing the next step actually requires intelligence. In a surprising amount of real work, it does not. If the list of things to check is fixed, and each check is a static, design-time mapping from criterion to computation, then there is nothing to decide. The control flow carries no intelligence. Handing it to the model in that case buys you exactly nothing and costs you a great deal: non-reproducibility, latency, token spend, and a fresh set of failure modes, loops that never terminate, checks silently skipped, tools hallucinated. You pay the full price of an agent for a control flow a for loop expresses perfectly.

So the question is never "could this be an agent?" Almost anything could be. The question is "does choosing the next step here require judgement that is not known at design time?" Reserve autonomy for the cases where the answer is genuinely yes: open-ended or branching work whose shape you cannot lay out in advance, or an exploration phase whose entire purpose is to discover the steps, which you then harden into a fixed pipeline. Everywhere else, a fixed loop with a model reasoning inside it is not a compromise. It is the better design on its own merits, before you even invoke reproducibility.

There is one seam worth leaving open, because it is the place agency earns its keep even inside a fixed pipeline: the unexpected. If the model, while judging, notices something worth checking that is not on the static list, let it propose an extra check. The orchestrator still decides whether to run it. That places the model's intelligence where it actually helps, noticing the unanticipated, and keeps it away from the part that needs none, selecting among known checks. Bounded, and still not driving.

The same boundary on the generation side

Everything so far has been about inspection: taking an output and deciding whether it passes. But this is not an evaluation-only law. It governs generation just as strictly, and the mistake there is the mirror image. The default way to make a model produce an artifact is to ask it for the whole thing from a blank canvas, which hands the entire construction to the sampler and maximizes exactly the surface the companion pieces warned about.

The disciplined version applies the same split. Let the model do the subjective, creative part, planning, choosing, proposing parameters, and let a deterministic layer execute that plan against a pre-validated structure. A model that plans a bounded edit to a known-good template, with a rule engine applying that edit inside the template's constraints, inherits the template's validity and its taste, and confines the model to a change small enough to stay valid. The invalid result becomes unreachable instead of caught downstream, which is the strongest form of the funnel from Shrink the Stochastic Surface: the cheapest possible check is making the wrong output impossible to emit. Same boundary, same reason. The model proposes, deterministic structure disposes, whether the thing being produced is a verdict or an artifact.

The case that makes it obvious: never put an agent in the gate

If you want a single place to feel this law in your gut, look at evaluation.

An evaluator, a quality gate, a validator, anything whose job is to render a verdict, has one non-negotiable property: same input, same verdict. Reproducibility is not a nice-to-have for a gate; it is the whole basis of trust in it. A gate that rejects someone's work with 0.42 today and 0.71 tomorrow for the same input is unappealable, undebuggable, and uncalibratable. Nobody can act on it, including you.

Now notice what an agent is: a thing whose defining virtue is self-directed adaptivity, whose whole selling point is that it decides for itself what to do next based on what it sees. That is a wonderful property for an explorer and a catastrophic one for a ruler. An evaluator that decides for itself what to check next is non-reproducible by construction. Its adaptivity, the exact feature you would be buying it for, is the thing that destroys its authority.

This is the irony of the current moment. The industry is racing to make everything agentic, and the single most seductive place to reach for an agent, "just let the smart model look at the output and judge it," is one of the places it is most wrong. The correct relationship between an agent and a gate is not that the gate is an agent. It is that the gate watches the agent. Measure the agent's behaviour, count its convergence rounds, score its output, but the thing doing the scoring must be the opposite of an agent: fixed, predictable, the same ruler every time. You do not want your ruler to have opinions about how long it is today.

Where determinism and judgement collaborate: the rule as lie detector

The law is not "rules and models live in separate rooms." Their most valuable interaction is when they work on the same judgement, and it is the mechanism that turns an LLM judge from a black box into something auditable.

Start from the honest position: the subjective part is irreducible. No rule computes "harmonious" or "coherent." A model has to produce that number. But producing the number is not the same as being trusted for it, and this is where the rule engine earns a second job.

When the model scores, make it state the evidence it relied on. Not just "this summary is inaccurate, 0.3," but "this summary is inaccurate because it claims the report recommends option B." Now that stated evidence often reduces to a checkable fact, and a deterministic check verifies it. Does the source document actually recommend option B, or does a string search show it never mentions B at all? Does the code the model called unsafe actually reach the unchecked path, or does a static check show the guard is present? Does the answer it praised for citing three sources actually cite three, or is it one repeated? Whenever the model's load-bearing evidence is a fact, the rule engine confirms or refutes it, and a score built on a refuted fact is voided.

The division of authority is precise, and neither side can do the other's job:

The model has the scoring power. It produces the perceptual judgement and names the fact it rested on. The rule engine cannot author a quality score, because "harmonious" is not computable, so it never scores.
The rule engine has veto power over the evidence. It verifies the cited fact and can void a score built on a false one, but it cannot replace that score with a number of its own.

So neither is solely decisive. When the cited fact checks out, the model's score stands. When it is refuted, the rule does not overwrite the score, it invalidates it and routes to a bounded re-judge or a human. The judge becomes trustworthy not because you have decided to trust it, but because its evidence can be deterministically refuted. That is the difference between "we believe the model" and "the model cannot lie about the parts that are checkable." Only the second one is engineering.

What it looks like assembled

None of these techniques is interesting alone. The value is in how they compose, and the composition has a shape: a deterministic spine that owns the loop, with the model hanging off it as a called reasoning node that proposes and never drives.

Read one concrete gate end to end. Content arrives. Cheap rules run first, syntax and schema and size, and anything that fails dies immediately for free. What survives enters a lifecycle state machine that governs the run. The orchestrator then calls the model as a judge: it reasons over the standard and returns per-criterion sub-scores plus the facts it relied on. Those facts go straight to a rule fact-check, which refutes any that are false and sends the judgement back for a bounded re-score. Only what clears both the rules and the verification becomes a verdict.

Deterministic spine — the orchestrator owns the loop:

  content ─▶ cheap rules ─pass─▶ lifecycle FSM ─▶ [call judge] ─▶ rule fact-check ─confirmed─▶ verdict
               └─ fail fast ──────────────────────────────────────────────────────────────────▶ verdict

The judge is called and returns — it never drives:

  [call judge]  ──request──▶  LLM judge · reasoning node
                             (proposes sub-scores, cites facts; no control, no loop)
  LLM judge  ──proposal + cited facts──▶  rule fact-check
  rule fact-check  ──refuted──▶  bounded re-score  ⟲ back to [call judge]

Notice what never happens in that picture: the model never holds the arrow that decides what runs next. It is called, it reasons, it returns. Every solid line, the loop itself, stays with the orchestrator. The dotted lines are the model's entire role, a request out and a proposal back. That is the whole law in one diagram.

The ordering falls out for free

One more thing drops out of the boundary without any extra design, and it is the reason the whole arrangement is also the cheap one.

Deterministic checks cost microseconds and no money. Model calls cost seconds and real dollars. So the boundary hands you the correct ordering by default: run the cheap, reproducible gates first. They weed out the obviously broken before any paid work happens. The expensive judge runs last, only on content that already cleared everything cheap. You never spend a model call scoring something that fails a syntax check, which would be burning money to grade garbage.

Cost then scales with quality instead of volume: bad inputs die cheap at the front, and you only pay the expensive price for things good enough to deserve it. This is the same funnel shape the stochastic-surface piece described for reliability, and it is not a coincidence that the reliable arrangement and the cheap arrangement are the same arrangement. They are both consequences of putting each question in the tier that fits it.

Reserve the seams, so reasoning stays additive

The last clause of the law, control never to the agent, is the one most likely to be misread as "never add model reasoning to the loop." It is not. It is "add it without moving control into a loop the model owns." Those are different, and keeping them different is what lets a deterministic system grow bounded intelligence later without a rewrite.

If you expect to add diagnosis, repair recommendation, or deeper failure analysis down the line, build the seams now, cheaply:

Replayable events. Every transition and finding emitted as a durable, ordered record, so a later reasoning step can replay exactly what happened instead of guessing.
Independently addressable stages. Keep the pipeline decomposed, so a diagnosis step can point at one stage without unpicking the whole run.
A read-only retrieval seam. A way to pull prior records and context as evidence for a diagnosis, without letting the diagnosis mutate the verdict path.
Post-verdict hooks. Failure analysis runs after the reproducible verdict, never inside it, so it can never change the result it is analysing.
Advisory output only. A repair step produces a suggested fix, a rationale, and a confidence. Whether to apply, retry, or escalate stays the orchestrator's call.

The invariant across all of them is the law restated: repair and retry are controlled by the orchestrator, not by a model acting on its own. Anything that would move control into an autonomous loop is not a small extension, it is a change to the fundamental property of the system, and it should be treated as one.

The same law, one level down

These three pieces are one argument at three depths. The first said correctness comes from structure around the generator, not from a bigger generator. The second said the engineering is minimizing the fraction of output only the model owns. This one says: of the part you keep deterministic, choose the determinism that fits the question, a rule for facts, a state machine for flow, and of the part you keep probabilistic, let the model reason as much as the problem needs but never let it hold the wheel.

The probabilistic part proposes. The deterministic part disposes. And control, the decision of what happens next and when the work is done, is deterministic, always. That is not a constraint on what your system can do. It is the thing that makes what it does trustworthy, reproducible, cheap, and yours to debug at three in the morning when it matters.

Determinism where you can. Judgement where you must. Control never to the agent.

This is the third piece in the arc. The first: Generative AI Builds Shapes, Not Games. The second: Shrink the Stochastic Surface.

Shrink the Stochastic Surface: A Design Standard for Probabilistic Systems

Harrison Guo — Tue, 30 Jun 2026 16:08:30 +0000

I keep writing variations of the same sentence. Agent memory has to terminate at a source of truth. An agent loop has to terminate at a check. Generative 3D has to terminate at a verifier. The probabilistic part proposes, a deterministic part disposes.

Four pieces, one shape. That is usually a sign there is a law underneath, not four coincidences. This is my attempt to write the law down, and then to turn it into a standard I can apply to the next system before it ships instead of after it breaks.

The companion piece argued that generative AI builds plausible shapes and that correctness has to come from structure and verification wrapped around the generator. That answers whether you need something outside the model. It does not answer how much, or where the line goes. This piece is about the line.

The quantity nobody names

Every probabilistic system has a number attached to it that almost no one states out loud: the fraction of its output whose correctness rests on a model getting it right, with nothing deterministic to anchor or check that part.

Call it the stochastic surface.

A pure end-to-end model has a stochastic surface of one. Everything it emits is a sample, and every sample is trusted as final. A pocket calculator has a stochastic surface of zero. A real system sits between, and the exact position is the single most important design decision in the thing, more than the model, more than the prompt, more than the framework.

Here is the claim, stated as plainly as I can:

The reliability of a probabilistic system is bounded by the size of its stochastic surface. Reliability engineering is surface reduction: pushing as much of the output as possible onto deterministic structure, and leaving the model only the irreducible part that genuinely has no anchor.

This is not a style preference. It follows from what a sample is.

Why the surface bounds reliability

A deterministic component has a property a sampler never has: when it is wrong, it is wrong the same way every time, and you can find out. A unit test fails. A constraint is violated. A schema rejects the payload. The error is locatable, which means it is fixable, which means over time that part of the system trends toward correct and stays there.

A sample has none of this. It is drawn fresh, it can be wrong differently every time, and, crucially, it carries no signal about whether it is wrong. A plausible-looking output and a correct output are indistinguishable from inside the generator. That is the entire problem in one sentence: the model cannot tell you it succeeded, so it cannot be trusted even when it did.

Now stack many such samples. If thirty percent of your output is on the stochastic surface, then thirty percent of every result is an unverified claim that could be confidently, undetectably wrong, and you have no map of which thirty percent. The failures do not announce themselves. They surface downstream, far from where they were born, as a doorless cottage the workflow was perfectly happy to call done.

So the rough mental model is:

reliability ≈ 1 − (unanchored stochastic fraction)

Not literally a formula you compute, but the right intuition. Every percent of output you move off the surface, by retrieving it instead of inventing it, by deriving it from a rule instead of sampling it, by checking it against a spec, is a percent that stops being an undetectable liability. You do not make a probabilistic system reliable by making the model better. You make it reliable by giving the model less of the output to be solely responsible for.

That reframes everything. It means the work is not "pick the best model." It is "draw the smallest possible circle around the part only a model can do, and build deterministic structure everywhere else." Two sides of a system show this most clearly: how you generate, and how you evaluate. They turn out to be the same move.

The generation side: anchor, don't invent

When you need a model to produce something, the lazy default is to describe it and let the model conjure it from the prior. That maximizes the stochastic surface on purpose. Almost always you can do better, and the options form a clean ladder ordered by how many bits the model has to invent.

Tier 1: retrieve, then edit

The cheapest and most underrated move: do not generate from scratch, find the nearest high-quality real example and modify only what the task requires.

In a game studio this is the difference between "generate a medieval cottage" and "here is a hand-built cottage our artists shipped, adapt it to a 15 by 15 footprint with a south door." The first asks the model to hallucinate an entire artifact. The second hands it a correct starting point and asks for a delta. The stochastic surface collapses from the whole object to the edit.

The principle generalizes far past games. Retrieval-augmented generation is this move for text. Asking a coding model to modify an existing, tested function instead of writing one blind is this move for code. In every case the logic is identical: the model's real workload is total information minus retrievable information. Anything already present in a real example is information the model does not have to invent, and therefore cannot get wrong. You spend a similarity search to buy down the surface. It is almost always a good trade.

The failure mode to respect: retrieval anchors you to the retrieved thing, so retrieval quality becomes the new floor. Garbage neighbors give garbage edits. But notice what changed, the failure moved from an invisible generative hallucination to a visible, checkable retrieval step you can inspect, score, and improve. That is surface reduction even when it is imperfect, because it converts an unanchored failure into an anchored one.

Tier 2: fine-tune to narrow the distribution

Retrieval changes the starting point. Fine-tuning changes the distribution itself.

This is the tier worth dwelling on, because it is the one most often reached for first and understood least. A base model samples from an enormous manifold of plausible-for-the-internet outputs. Fine-tuning on your own high-quality, domain-specific data reshapes that manifold so the model's default sample lands much closer to what your domain considers good. You are not teaching it new facts so much as moving the center of mass of its prior onto your distribution, and shrinking the variance.

For a studio, training a model on your own corpus of shipped, art-directed, style-consistent assets does something retrieval cannot: it makes the typical generation on-style and on-spec, not just the retrieved-and-edited one. It raises the floor everywhere, including the cases where you have no neighbor to retrieve. The economics also favor it once volume is high enough: a one-time training cost amortized across millions of generations, versus paying for a large frontier prompt every single time.

But be precise about what fine-tuning does and does not do to the stochastic surface, because this is where people over-trust it. Fine-tuning lowers the surface, it does not remove it. A model fine-tuned on perfect 15 by 15 cottages will produce cottages that are usually closer to 15 by 15. It still has no representation of "footprint equals 15 by 15" as a predicate to satisfy and check. It samples from a tighter distribution, but it is still sampling. The discrete constraint is still discrete, and a narrower continuous distribution is still continuous. Fine-tuning buys you a better-behaved sampler. It does not buy you a solver or a verifier, and treating a fine-tuned model as if it were one is exactly the over-trust this whole standard is built to prevent.

The right read: fine-tuning is the strongest generation-side lever, and generation-side levers have a ceiling. They make the proposal better. They never make the proposal self-disposing.

Tier 3: generate the residual only

After you have retrieved what you can and narrowed what you can, whatever is left, the genuinely novel part with no prior to anchor to, is what you let the model invent freely. That residual is where a generative prior earns its keep: plausible, varied, rich single forms. The sphere and the gatehouse from the companion piece are this tier done well.

The discipline is to make that residual as small as the task allows, and to wrap it, never to let it stand as the final word. Which is the other side of the system.

The evaluation side: rules first, model last

Now flip from producing output to judging it. The same law applies, and the anti-pattern is more seductive because it looks like progress: hand the whole output to an LLM and ask "is this good?"

That puts your evaluation's stochastic surface at one hundred percent. You have built a judge that cannot tell you when it is wrong, to assess a generator that cannot tell you when it is wrong. Two unanchored samplers in a trench coat. It demos beautifully and rots quietly.

The reliable structure is a hierarchy, ordered by how anchored each layer is, and you push as much weight as possible to the top:

Quantifiable, to a deterministic metric. Anything you can measure, you measure. Footprint dimensions. Block counts. Latency. Compile success. Test pass rate. Schema validity. Forbidden-element count equal to zero. This is the bedrock layer and it should carry the most weight the domain allows, because it has a stochastic surface of zero. It is right the same way every time.
Formalizable but not numeric, to a program check. Things that are not a number but are still decidable: is the door actually passable (run a pathfinder), does the graph have the required structure, does the config satisfy this invariant. Still deterministic, still locatable, still zero surface. This is the procedural and symbolic layer.
Irreducibly subjective, to a bounded model judge. What is genuinely left, "is this fun," "is this on-brand," "is this elegant", goes to an LLM. But scoped: a small, well-defined slice, with a rubric, calibrated against human ratings, and auditable after the fact. The model judges the residual, not the whole.

The number that matters is the share of your evaluation weight sitting in layers 1 and 2. That is your evaluation's anchored fraction, and it is the inverse of its stochastic surface. A good eval system is one where the subjective model-judged slice is small and shrinking.

And here is the part that makes evaluation more than a one-time setup. Every time you take a judgment that used to live in the model's head, "this looks too wide," and turn it into a rule, "footprint must equal the spec," you move it from layer 3 to layer 1 permanently. It never goes back. Evaluation built this way is a ratchet: a mechanism for steadily converting accumulated human judgment into deterministic checks, one notch at a time, each notch shrinking the surface and never releasing it. "Let the AI grade the AI" is the opposite, a surface stuck at one hundred percent with no mechanism to ever bring it down.

The two sides are one move

Step back and the generation ladder and the evaluation hierarchy are the same diagram viewed from two ends.

On both sides you are drawing a line. On one side of the line is the deterministic part: retrieved examples, narrowed distributions, constraint solvers, metrics, program checks. On the other side is the irreducible stochastic residual: the novel shape, the genuinely subjective call. That line is the stochastic surface. Generation and evaluation are just its two projections, one for producing, one for judging.

Every system I keep writing about is an instance of drawing that line well. Agent memory fails when a model's hedge gets stored as a fact with no source to check it against; the fix anchors memory to a source of truth, shrinking the surface. The advisor strategy spends a cheap model on the bulk and reserves expensive, decisive compute for the few points that must be right; that is surface reduction in the cost dimension. Retrieval versus grep, plan-generate-solve-verify for game content, rules-first evaluation, all the same law: the probabilistic component proposes, and a deterministic component, as much of one as the problem allows, disposes.

The standard

The point of a law is to use it before you ship, not to explain the wreckage after. So here is the checklist I am going to run on the next probabilistic system I build, and the one after that. Five questions.

What is the stochastic surface? Name the exact fraction of the output whose correctness rests on the model alone, unretrieved and unchecked. If you cannot point to it, you do not understand the system yet. Stop here until you can.
What did you anchor instead of invent? For every generated part, did you retrieve a real example first (tier 1), narrow the distribution with fine-tuning where volume justifies it (tier 2), and reserve free generation for the residual only (tier 3)? Each part that skipped the ladder is surface you chose not to reduce, and you should be able to say why.
What is checkable, and is it checked? Every predicate the output must satisfy that can be verified deterministically, is it? Sizes, counts, schemas, invariants, passability. Anything checkable but unchecked is the worst category: a failure you could have caught for free and did not.
Where does the model judge, and is that slice bounded? In evaluation, what share of weight is deterministic (layers 1 and 2) versus model-judged (layer 3)? Is the judged slice small, rubric-bound, calibrated, and auditable? A universal LLM judge is a red flag, not a feature.
Is the surface shrinking over time? Is there a ratchet, a path by which recurring human judgments become permanent rules and the surface trends down with use? A system whose surface is fixed will not get more reliable no matter how long it runs.

Score every component against these five and the weak points light up immediately. They are always the same shape: a part handed to the model that could have been retrieved, derived, or checked, and was not.

Where the standard stops

A law worth trusting comes with its boundary, so here is mine. This standard governs correctness tasks, and only those. A correctness task has a spec the output must satisfy: a size, a passable door, a compiling program, a routed circuit, a factual answer. There, a large stochastic surface is pure liability and shrinking it is the whole job.

But some tasks have no spec, and for those the standard inverts. Style exploration. Brainstorming. A first rough draft meant to be thrown away. Concept art whose only requirement is "show me something I would not have thought of." There the stochastic surface should be near one hundred percent, because there is no predicate to anchor to and constraints would only strangle the thing you wanted. Clamping a creative task with a verifier is the same category error as trusting a generator on a correctness task, run backwards.

So the discipline underneath the discipline is telling the two apart. Ask of any output: is there a spec it must satisfy, or am I sampling for novelty? If there is a spec, the standard applies and you shrink the surface as far as the problem allows. If there is not, let the model run, and do not pretend a verifier was missing.

The hard engineering was never the model. An adequate one is already here, and the next one will be better in ways that do not touch this. The hard engineering is drawing the line around it, and making that line as small as the problem allows.

This is the second piece in a pair. The first: Generative AI Builds Shapes, Not Games.
Related, the same law at other layers: Agent Memory Is a Cache Coherence Problem and Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy.

Generative AI Builds Shapes, Not Games: The Constraint Gap and the Architecture That Closes It

Harrison Guo — Mon, 22 Jun 2026 23:40:52 +0000

I sat down to benchmark a tool and ended up with a map of a wall.

Higgsfield shipped a Minecraft "prompt-to-build" feature: type a prompt, get a structure in-world about a minute later. I ran eight building prompts through it, scored each one, and walked through the results. The point started as "how good is this tool." It ended somewhere more useful, because the shape of the failures turned out to be a clean read on where generative AI hits a wall in game content, and why, and what the architecture that gets past it has to look like.

The one-sentence version: generative models are plausibility engines, and games need correctness engines. Those are not the same machine, and you do not get one by scaling the other.

tl;dr — An AI Minecraft builder produced recognizable single forms (a sphere, a tower, a castle gatehouse with a genuinely walkable gate) in about a minute, but dropped exact sizes, named materials, and door positions, failed to compose all three multi-object scenes I gave it, satisfied a "no lava" rule only vacuously (it places no fluids at all), and surfaced no signal for whether any output met the prompt. That pattern is the signature of what generative models are: samplers over a distribution of forms. Game content demands three things a form-sampler structurally lacks: discrete constraint satisfaction, compositional structure, and functional correctness with verification. Scaling adds plausible shapes, not those three capabilities. The architecture that closes the gap separates the continuous from the discrete: a symbolic planner emits a scene graph and explicit constraints, generative models fill per-object shape, a solver places objects to satisfy the constraints, and a verifier checks the result and repairs failures. Plausibility from the generator; correctness from structure and verification.

The evidence, briefly

The full hands-on benchmark, per-prompt scores, and figures are in the companion findings post. The compressed version is enough to ground the argument.

What worked. Single cohesive forms with a strong visual prior came out fast and recognizable. "A giant sphere" produced a clean voxel sphere. A watchtower read as a tower. A gatehouse with "two towers and a central gate players can walk through" came back as exactly that, in about a minute, and the gate was genuinely passable when I walked through it.

The strongest result. A castle gatehouse is one cohesive, canonical form with named sub-parts, and it generated fast and well.

What didn't. The moment a prompt depended on discrete, checkable requirements, those requirements fell away. "A 15 by 15 block cottage using mostly wood and stone, entrance on the south side, inside walkable" came back as a doorless lumpy grey wall, far wider than 15×15, no wood, no way in.

The most constrained prompt produced the worst result. Every discrete requirement, size, material, door, enclosure, failed.

Multi-object scenes failed three different ways across three prompts: one hung with no output, one scattered into wildly inconsistent scale, one collapsed two tents and a campfire into a single teal mound. And the negative constraint ("do not use glass, lava, water, or redstone") was "honored" only because the builder never places fluids at all. A lava-colored band read in-game as solid orange_concrete with no fluid present. The rule was satisfied vacuously, by palette limitation, not by following it.

The negative constraint met vacuously: a color-matched solid block, not a parsed-and-honored "do not use."

Two structural facts sit underneath all of it. The behavior is consistent with a mesh-generation plus voxelization pipeline: produce one 3D mesh, voxelize it, color-map to a block palette, place it. (I did not decompile or trace it, so treat that as the most likely explanation, not confirmed.) And there was no validation signal anywhere: nothing in the workflow indicated whether the output satisfied size, material, door, or function.

Hold those two facts. They are the whole argument in miniature: one form, no structure, no check.

Why this happens: plausibility is not correctness

It is tempting to read the cottage as a bug, a model that needs more training. That misreads what the model is.

A generative 3D model is a sampler over a learned distribution of forms. Training teaches it the manifold of plausible shapes for a text condition; inference draws a point from it. This is the right machine for one job, "give me a plausible instance of X," and it is genuinely good at it. The sphere and the gatehouse are that job done well.

Game content asks for a different job: not "a plausible instance" but "an instance that satisfies these requirements." And the requirements games impose come in three flavors a form-sampler has no mechanism for.

1. Discrete constraints

"15×15." "South-facing door." "No lava." These are discrete, symbolic predicates. They are either satisfied or not, and you can check which.

A continuous sampler has no place to put a discrete predicate. It can make outputs that are distributionally consistent with the words "15 by 15", things that tend to be smallish and square-ish, but it cannot satisfy the predicate "footprint equals 15×15" because nothing in the architecture represents that predicate as a thing to be satisfied and checked. This is the same root cause behind image models that botch exact finger counts and legible text: counts and letters are discrete, and a plausibility sampler approximates them instead of satisfying them. The cottage's missing door is not a training gap. It is a category error to expect a form-sampler to honor a positional predicate.

2. Compositional structure

A single object is a form. A scene is a graph: objects as nodes, spatial relations as edges. "Three houses along a path with a tree between each" is a structured arrangement, not a shape.

A monolithic mesh generator has no node-and-edge representation to build that graph in. Asked for a scene, it has only one move available: hallucinate the entire arrangement as a single form and voxelize it. The three scene failures are three ways that move degrades, hang, scatter, collapse, but they share one cause: there is no scene graph, so there is no composition, only a blob that gestures at the elements. "It cannot do scenes" is too strong; "it has no structured representation in which a scene could be composed" is exact.

3. Functional correctness, and the missing verifier

"Players can walk through the gate" is a functional property. You cannot read it off the geometry by eye with confidence; you confirm it by testing, by trying to walk the path. When the gatehouse gate turned out passable, that was the shape prior paying off, not the system knowing or checking that the function held. There is no notion of function in a form-sampler, and, more tellingly, no loop that asks "did the output satisfy the ask?" after generating.

That missing loop is the deepest part. Even a model that frequently lands constraints by luck is unreliable without a verifier, because nothing distinguishes the lucky output from the failed one. The workflow had no score, no self-check, no "this build is 14×16, not 15×15." Generation without verification cannot tell you it succeeded, which means it cannot be trusted even when it did.

The unifying diagnosis

Stack the three together and the diagnosis is one line: generative models optimize plausibility; game content requires correctness, and correctness is discrete, compositional, and functional. Plausibility lives on a continuous manifold. Correctness is symbolic, structured, and checkable. They are different mathematical objects, and one architecture is built for the first.

Why scaling alone won't close it

The reflex in 2026 is "the next model will fix this." For this gap, scaling the same architecture has little reason to close it and some reason not to. A different architecture might; a bigger form-sampler won't.

More parameters and more data make the sampler draw more plausible shapes, more faithfully from the form distribution. That is real progress on the axis the architecture already optimizes. It does not add a discrete constraint representation, because the training objective never asks the model to satisfy and check a predicate. It does not add a scene graph, because the output is still one mesh. It does not add a verifier, because verification is a separate computation the generator was never built to perform.

You can see the shape of this in the parts of the benchmark that did improve with the form prior, versus the parts that did not. The gatehouse, more canonical, came out better than the watchtower. Scaling pushes everything along that axis: better, more canonical forms. The cottage's door does not live on that axis at all. No amount of "better sphere" becomes "satisfies 15×15 with a south door," any more than a sharper camera becomes a tape measure.

The gap is architectural. Closing it means adding the missing architecture, not enlarging the existing one.

This is bigger than Minecraft

The Minecraft builder is a clean, cheap microcosm because Minecraft makes correctness legible, you can literally F3 a block and read whether the constraint held. But the same wall stands wherever an output has to satisfy a spec rather than merely look right:

CAD and mechanical design: a part that looks like a bracket but is 2mm off the bolt pattern is scrap.
Architecture and floor plans: a plausible-looking plan with a bedroom you can't reach is not a plan.
Circuit and chip layout: plausible is meaningless; it routes and meets timing, or it doesn't.
Code generation: "looks like correct code" is exactly the trap; it compiles and passes tests, or it doesn't.
Level and quest design: a level must be completable, not just atmospheric.

Every one of these is the same split: plausible (continuous, distributional, what the generator gives you) versus correct (discrete, structured, functional, what the domain demands). Game building is a vivid instance because it bundles all three correctness flavors, dimensional, compositional, functional, into one minute-long generation you can inspect. The lesson generalizes to all of production-grade generative content.

The architecture that closes the gap

If one model can't be plausibility engine and correctness engine at once, stop asking it to be. Split the pipeline so that the continuous and the discrete each go to the machine built for them.

1. Plan — symbolic. A planner (an LLM or a program synthesizer) turns the prompt into a structured spec: a scene graph (objects and spatial relations) plus explicit constraints (sizes, materials, positions, forbidden sets, functional requirements like "this gate is a passable path"). This is the discrete representation the form-sampler lacks. "15×15, south door, wood and stone" becomes machine-checkable slots, not vibes.

2. Generate — continuous. Per-object generative models produce the shapes, conditioned on the spec's slots. This is exactly where the generative prior earns its keep: plausible, varied, rich single forms. The sphere and the gatehouse show the generator is already good at this when it is asked only for this.

3. Place and solve — symbolic. A constraint solver or procedural placement layer arranges the generated objects to satisfy the spatial, dimensional, and adjacency constraints. This is not new technology; it is the procedural-generation toolbox games have used for thirty years, wave-function-collapse, shape grammars, constraint-based layout, now used to arrange generative outputs instead of hand-authored tiles. Determinism and satisfiability are features here, not limitations.

4. Verify and repair — the loop the benchmark had none of. A checker evaluates the checkable predicates against the spec: is the footprint 15×15? is the door passable? are there zero forbidden blocks? does the object count match? Failures route back to regeneration or local repair. This is the validation signal whose absence was the most telling thing in the whole session.

Plausibility comes from step 2. Correctness comes from steps 1, 3, and 4. The generator stops being asked to do the job it can't and gets to do the job it's good at.

The deeper pattern: trust terminates outside the generator

This is the part that ties back to everything I've been writing about agents this year, and it's why the Minecraft toy matters more than a toy should.

The recurring failure of probabilistic systems is treating the generator's output as ground truth. In agent memory, the failure mode is a model's "the user could do X" getting stored and later read back as "the user did X", a probabilistic hedge flattened into a fact, with no source to check it against. In the advisor strategy, the design that works puts a cheap model on the bulk of the work and reserves the expensive, decisive computation for the few points that actually need to be right.

Generative 3D for games is the same principle in a different costume. The generator is the bulk path: cheap, fast, plausible, and not to be trusted on its own for any property that has to be correct. Correctness has to terminate at something outside the generator, a spec, a solver, a verifier, the same way agent memory has to terminate at a source of truth and an agent loop has to terminate at a check. The probabilistic component proposes; a deterministic component disposes. Systems that wire it that way are reliable. Systems that trust the sampler's output as final are plausible right up until they are confidently, unverifiably wrong, a doorless cottage that the workflow was perfectly happy to call done.

The winners in AI content generation will not have the best single-shape model. An adequate generator is already here. They will have the best structure and verification wrapped around the generator: the planner that emits a checkable spec, the solver that satisfies it, the verifier that proves it. That is where the engineering is, and it is the half the current monolithic tools skip.

Outlook

Three things follow.

Monolithic text-to-3D is a phase, not a destination. The single-prompt-to-single-mesh tool is the generative-AI equivalent of an early language model asked to do arithmetic in its head. The field moves toward decomposed, plan-generate-verify pipelines for the same reason agents moved toward tool use and verification: the monolith is plausible and the pipeline is correct.

Procedural generation gets a second life, not a retirement. PCG was always deterministic and constraint-satisfying and always limited in variety and richness. Generative models are the inverse. The synthesis, generative priors for per-object shape inside a procedural and constraint-solving frame, gives you both, and the verification loop makes it trustworthy. The 30-year toolbox is the missing half of generative AI for games, not its casualty.

The benchmark is a snapshot of the monolithic era. Re-run the same eight prompts against a plan-generate-solve-verify system and the cottage is 15×15 with a south door, because a solver placed it and a verifier checked it, not because a bigger model finally drew it. That is the test I'd want to see, and the architecture I'd bet on.

Generative AI can build a shape. Building a game asks for correctness, and correctness is a different machine. The interesting work, in games and well beyond them, is in building the second machine around the first.

Empirical companion (the hands-on benchmark, full scores and figures): I Tested Higgsfield's Minecraft "Prompt-to-Build"
Related: Agent Memory Is a Cache Coherence Problem
Related: Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy

I Tested Higgsfield's Minecraft Prompt-to-Build. It Generates Shapes, Not Scenes.

Harrison Guo — Thu, 18 Jun 2026 20:03:42 +0000

Higgsfield shipped a Minecraft "prompt-to-build" feature: a mod that drops a "Supercomputer" block into your world, takes a free-text prompt, and generates a structure in-world a minute later. I spent one session putting real building prompts through it to see what it actually does, not what the landing page says it does. Eight prompts, fixed screenshots, an in-world walkthrough, and a scoring rubric.

The short version: it behaves like a single-cohesive-3D-form generator with strong canonical priors, not an architecture or scene engine.

tl;dr — Higgsfield's in-world prompt-to-build produced recognizable single forms (a sphere, a tower, a castle gatehouse, including a functional walkable gate) in about a minute. But in my samples it dropped discrete constraints (exact size, specified materials, door position), failed to compose a coherent multi-object scene in all three scene prompts I tried, and exposed no validation signal for whether the output met the prompt. The behavior is consistent with a mesh-to-voxel pipeline: generate one shape, color-map it to blocks. Strong on shape, weak on constraints, composition, and function.

How it appears to work (inferred from behavior)

I did not decompile the mod or capture a network trace, so this is the most likely explanation, not a confirmed fact. The observed behavior is consistent with a mesh-generation plus voxelization pipeline: a text prompt produces a 3D mesh in the cloud, which is then voxelized, mapped to a limited block palette, and placed in-world.

If that model is right, it accounts for most of what I saw. A color or texture sampler would pick block colors rather than materials. A mesh encodes a shape, not discrete numeric or positional constraints. And a layout of separate objects has no single form to generate from. One in-game block check supports it directly: a region that looked like lava read as minecraft:orange_concrete with no fluid placed at all — a solid block chosen by color.

Method

Environment: Minecraft Java 1.21.1 + NeoForge 21.1.233 + the Higgsfield mod, creative mode, superflat world.
Flow: place the Higgsfield "Supercomputer" block, set Type: Structure, enter the prompt, insert a blank Structure medium, Generate, then print the result in-world.
Cost: about 1.15 credits per build (the UI estimates 2). Failed jobs are not charged — the one prompt that timed out cost nothing.
Prompts: eight total — a single object, a constrained object, a functional compound form, a negative-constraint prompt, and three multi-object scene prompts — plus one geometric-primitive control (a sphere) to check that outputs are genuinely prompt-conditioned.
Scoring: 1 to 5 per dimension (prompt adherence, constraint adherence, spatial/functional, editability, visual, reliability). Single rater, from fixed screenshots plus an in-world walkthrough.

The in-world "Supercomputer" panel: a free-text prompt, Type: Structure, a blank Structure medium, and a credit estimate. This is the entire interface.

Results at a glance

ID	Prompt class	Time	Outcome	Adher	Constr	Spatial	Edit	Visual	Reliab
P1	single object — watchtower	~1 min	recognizable tower, wrong material/size	2	1	2	2	3	5
P2	object + discrete constraints — 15×15 cottage	~1 min	doorless lumpy wall, unrecognizable	1	1	1	1	2	5
Ctrl	geometric primitive — "a giant sphere"	~1 min	clean voxel sphere	5	—	—	—	4	5
P3	multi-object scene — market	8+ min	timeout, no output	—	—	—	—	—	1
P4	functional compound form — gatehouse	~1 min	strong: 2 towers + walkable gate	4	4	4	3	4	5
S2	multi-object scene — houses + path + trees	~2 min	elements present, incoherent scale/layout	2	—	2	2	2	5
S3	multi-object scene — campsite	~2.5 min	collapsed into one teal blob	1	—	1	1	2	5
P5	negative constraint — 8×8 base, no glass/lava/water/redstone	~1 min	giant slab, not 8×8	1	1	1	1	2	5

Per-prompt findings

P1 — single object: watchtower

"Build a small wooden watchtower, 10 blocks tall, with a ladder, a roof, and a viewing platform."

It reads clearly as a tall tower, so the shape prior comes through. The discrete constraints did not. "Wooden" came out as orange terracotta and honeycomb blocks — the output matched colors, not the material word "wood." "10 blocks tall" became 20-plus. The ladder, roof, and platform are vaguely suggested by geometry but are not functional Minecraft elements. The surface shows the voxelization artifacts you'd expect: eroded edges, speckled palette quantization, asymmetry.

P2 — object + discrete constraints: 15×15 cottage

"Build a 15 by 15 block cottage using mostly wood and stone. Entrance on the south side, inside walkable."

The most constrained prompt produced the worst result. It does not read as a cottage at all — a long lumpy grey wall with a red-orange top band, random holes, and stray noise blocks. Every discrete constraint failed: the footprint is far wider than 15×15, it's a wall rather than an enclosed cottage, there's no wood, and there's no door on any side (confirmed in world — you cannot enter it). This is the clearest example of the pattern in my samples: the more a prompt depends on discrete, checkable requirements, the less of it came through.

Control — geometric primitive: "a giant sphere"

"A giant sphere."

I ran this to settle one question: are outputs actually driven by the prompt, or just canned blobs? The result is an unmistakable, clean voxel sphere with the classic concentric-ring stepping. That's strong evidence of prompt-conditioning, and with a fresh superflat world it rules out "it was already there." It also fits the pattern from the other direction: a clean geometric form came out clean. Output looked best when the prompt carried no discrete semantic constraints.

P3 — multi-object scene: market (timeout)

"Build a small village market: four stalls around a central well, with paths connecting them."

No screenshot, because there was no output. The job hung for 8-plus minutes and never produced a result, so I abandoned it (no credits charged). This is the first of three prompts that ask for multiple independent objects in a spatial layout rather than a single connected form. This one simply hung.

P4 — functional compound form: gatehouse (best result)

"Build a gatehouse with two towers and a central gate players can walk through."

The strongest output in the session, and an important counter-example. It reads clearly as a castle gatehouse: two flanking towers, a central arch, battlements. Done in about a minute. Crucially, the functional requirement — "players can walk through" — was honored: the central gate is genuinely passable (verified by walking through it in world). The material came out stone-like, which fits the canonical "castle" prior, but the prompt did not specify a material, so that's a visual/prior win, not constraint-following.

P4 sharpens the conclusion. A gatehouse has named sub-parts (two towers plus a gate) yet still generated fast and well, because it is one cohesive, canonical form — unlike the market scene of separate objects.

S2 — multi-object scene: houses + path + trees

"Three small houses arranged in a row along a dirt path, with a tree between each house."

Done in about two minutes, so scene prompts do not always hang — the market timeout looks like a one-off. And it genuinely emitted the scene elements: dirt paths, several separate small structures, trees. But the composition is incoherent: wildly inconsistent scale (one oversized house next to a miniature cluster), scattered placement, blobby objects. It read the scene as one mesh to voxelize, not as an arrangement of objects. So "it cannot do scenes" is too strong; "it does not compose scenes coherently" is accurate.

S3 — multi-object scene: campsite

"A campsite with two tents, a central campfire, and logs around it to sit on."

The opposite failure mode from S2. Instead of scattering, it collapsed into a single mound of teal blocks (the two tents merged into one), with a small patch of orange blocks that loosely reads as a campfire. No distinguishable tents, no logs. Elements are hinted by color (tents teal, fire orange) but the arrangement is gone.

Across the three scene prompts, none produced a coherent, usable multi-object layout. They failed three different ways: hang, scatter, collapse. The scene elements can show up; the composition did not.

P5 — negative constraint: 8×8 base, forbidden materials

"Build a tiny 8 by 8 starter base. Do not use glass, lava, water, or redstone."

Two findings. First, the positive constraint failed the same way P1 and P2 did: the output is a large purple-orange slab, nowhere near 8×8, and not a "base." Second, the negative constraint. Regions that look like lava or water turned out, on an in-game F3 block check, to be solid color-matched blocks. The orange band reads as minecraft:orange_concrete with targeted fluid empty.

F3 debug, crosshair on the lava-colored band: Targeted Block: minecraft:orange_concrete, Targeted Fluid: minecraft:empty — a solid block chosen by color, not a forbidden material.

So no forbidden material was actually placed, but that's because the voxelizer's palette is solid colored blocks and it never places fluids or functional blocks, not because it parsed and honored "do not use." The negative constraint is met vacuously, by palette limitation, not by rule-following.

Where it's strong vs. weak

Strong: single cohesive 3D forms with a clear visual prior — sphere, tower, castle gatehouse. Fast (about a minute), recognizable, and it will render named sub-parts (two towers, a gate, battlements) and even a functional opening (a walkable gate). Material is sensible when the canonical form implies it (castle implies stone).

Weaker:

Discrete constraints get dropped. Exact dimensions ("15×15", "8×8"), specified materials ("wood and stone"), and positions ("door on the south side") did not come through. (P1, P2, P5)
Multi-object scene composition (n=3, none coherent). Across three scene prompts, none produced a coherent, usable arrangement: one hung, one scattered into inconsistent-scale fragments, one collapsed into a single blob. The elements can appear; the composition did not.
Negative constraints are not meaningfully enforced. Regions that looked like lava or water were solid color-matched blocks with no fluid. Forbidden materials are avoided by palette limitation, not by following the rule.
No validation signal. The workflow surfaced no self-check or score for whether it met size, material, door, or function.

Across eight prompts a consistent line emerges: single cohesive forms are handled well, while scenes of independent objects and discrete or negative constraints are not. That's exactly what you'd expect from a mesh-to-voxel pipeline that generates one shape and color-maps it to blocks.

What this means if you're using it

If you want a quick, recognizable hero object — a tower, a statue-ish form, a gatehouse — Higgsfield's prompt-to-build is genuinely useful and fast. Lean into canonical shapes and let it pick the material.

If you need a build to satisfy something — an exact footprint, a specific material, a door where you asked for one, a multi-building scene with sensible spatial relationships — it isn't there yet in my samples. The gap isn't shape quality; it's everything layered on top of shape: constraints, composition, function, and any signal that the output actually met the ask. Treat the output as a starting silhouette to edit, not a finished, spec-correct build.

Limitations

Scene composition is n=3. Enough to say it produced no coherent scene in any try, but still a small sample.
One trial per prompt. Generation is stochastic; I did not sample variance, and a given prompt might do better on a re-roll.
Single rater, scoring from screenshots plus a walkthrough.
Credit-limited session (started with 10 credits, about 1.15 per build), so two planned prompts went unrun.
Figures are frames pulled from a daytime screen recording (the in-game screenshot key conflicted with macOS during the live session). The builds are unchanged; only lighting and clarity differ from the live run.

The honest one-line takeaway: in this session Higgsfield's Minecraft prompt-to-build handled shape well and everything that makes a shape correct poorly. If you test it yourself, the fastest way to see the split is to run one canonical single form (a gatehouse) and one constrained one (a 15×15 cottage with a south door) back to back.

Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy, Cost-Curve Frame Recursed

Harrison Guo — Tue, 16 Jun 2026 20:14:53 +0000

In April 2026, Anthropic published a blog post called "The advisor strategy: Give agents an intelligence boost", naming a pattern they had been A/B-testing in production: a cheaper model runs the agent loop end-to-end, an expensive model is consulted only when the cheap one hits a decision it can't solve. They reported concrete numbers — Haiku + Opus advisor on BrowseComp at 41.2% (Haiku alone: 19.7%) at 15% of the cost of running Sonnet through the whole task.

On May 18, 2026, Tobi Lutke (CEO of Shopify) tweeted about an autoresearch setup that did exactly this: Qwen 3.6 27B running locally on an RTX 6000, with a small "advisor extension" that periodically calls GPT-5.5 for direction. 13,000 impressions, 2,400 likes, dozens of replies from engineers reproducing the pattern or building open-source implementations within hours.

Underneath both of those, Stanford HazyResearch's Minions paper — published months earlier — had abstracted the same pattern into a compressor-predictor framework: a small local model distills raw context into compact text that a larger remote model then reasons over. They reported their Deep Research system recovering 99% of frontier-model accuracy at 26% of the API cost.

Three independent threads converging on the same architecture in roughly the same six-month window. That convergence is the story.

This post argues something specific about it: the advisor strategy isn't a new pattern invented for LLMs. It's the third recursion of the cost-curve frame from earlier in this mini-series — the same idea that argued grep beats RAG for code retrieval, and that argued SQLite + FTS5 beats a vector DB for the symbol-graph storage that grep-replacement tools (CodeGraph) need. Applied at the model-orchestration layer, the frame produces the advisor strategy. The strategy is the architecture; the frame is why.

tl;dr — Anthropic, Tobi Lutke, and HazyResearch independently shipped (or described) the same agent pattern in early 2026: a cheap model runs the loop, an expensive model is consulted only for decisions. The convergence is evidence the pattern is correct; the reason it's correct is the cost-curve frame from this series' first post, applied at the model-choice layer instead of the retrieval-architecture layer. Piece B argued grep+loop beats RAG because build/maintain cost dominates per-query cost below a crossover. The advisor strategy argues the same shape for tokens: cheap-model executor cost dominates expensive-model advisor cost for the bulk of low-value operations (reading context, format conversion, retries), so expensive-model tokens should be spent only at high-value decision points. Same frame, third layer.

The post does three things: (1) reports the three converging threads with what each contributed; (2) makes the cost-curve recursion argument explicitly — L1 retrieval, L2 storage, L3 model orchestration; (3) maps the gotchas the hype skips (data egress on handoff, eval difficulty, handoff-contract design as actual engineering, hardware realism). The mini-series concludes here, five posts in, with cost-curve frame as a meta-design law across three layers of agent architecture.

Three convergent threads, in the order they shipped

The convergence matters more than any single thread. Each was independent; each shipped within a six-month window of the others; each describes the same architecture from a different vantage. That's how you know the pattern is real and not just one team's design preference.

Anthropic's official advisor strategy (2026-04-09)

The Anthropic engineering blog "The advisor strategy: Give agents an intelligence boost" defines the pattern as a productized engineering primitive:

"Sonnet or Haiku runs the task end-to-end as the executor... When the executor hits a decision it can't reasonably solve, it consults Opus for guidance as the advisor."

"The advisor never calls tools or produces user-facing output, and only provides guidance to the executor."

The reported empirical numbers:

Configuration	Benchmark	Score	Cost (relative to Sonnet end-to-end)
Sonnet alone (no advisor)	SWE-bench Multilingual	(baseline)	1.00×
Sonnet + Opus advisor	SWE-bench Multilingual	exceeds baseline	0.88× (−11.9%)
Haiku alone	BrowseComp	19.7%	(baseline)
Haiku + Opus advisor	BrowseComp	41.2%	0.15× of Sonnet-end-to-end

Two observations on the numbers. First: the Sonnet + Opus combination outperforms Sonnet alone while also being cheaper — that's not a one-axis trade, that's a Pareto improvement. Second: the Haiku + Opus combination doubles Haiku's standalone score while costing 15% of Sonnet's. That's the compound gain — better and cheaper at the same time.

A specific detail in the blog: the advisor's outputs are typically 400–700 tokens — a short plan, not a full solution. That's the design saying out loud what the cost curve implies — the advisor exists to redirect, not to do work.

Tobi Lutke's personal experiment (2026-05-18)

Tobi Lutke (CEO of Shopify) posted on X:

"I've had very good results running autoresearch with local qwen 3.6 26b model as long as I had a simple vibed pi 'advisor' extension that allowed it to periodically ask GPT 5.5 for ideas. I think this direction has a lot of merit."

Tobi's setup is the open-source mirror of Anthropic's productized pattern, with two architectural variants:

Locality: the executor runs on his own hardware (Qwen 3.6 27B on an RTX 6000), not on Anthropic's API. Local-first by default.
Frontier model choice: the advisor is GPT-5.5 (OpenAI), not Opus (Anthropic). The pattern is model-agnostic on the advisor side.

The hardware caveat is real and worth naming: RTX 6000 is professional-grade, not consumer hardware. 27B-dense models with autoresearch-length contexts aren't laptop workloads. The pattern is reproducible at the architecture level on commodity infrastructure; the specific setup Tobi shows takes real investment.

Within hours of Tobi's tweet, developer Rob Zolkos published pi-lifeline — an open-source escalation extension explicitly inspired by the tweet, with reasonable defaults: at least 5 rounds before the first advisor call, automatic escalation after 3 consecutive failures, plateau-detection after 6 rounds, max 10 advisor calls per session, default advisor model GPT-5.5. That's engineering of the handoff contract — not a one-line config — and we'll come back to it later.

Stanford HazyResearch Minions (2025–2026 publication window)

Linked from a reply on Tobi's tweet — Dan Biderman pointing at HazyResearch's Minions paper (arXiv 2512.21720), which abstracts the pattern into a compressor-predictor framework:

"smaller 'compressor' LMs (that can even run locally) distill raw context into compact text that is then consumed by larger 'predictor' LMs."

The Minions paper's specific numerical contribution: in their Deep Research system, a local 3B-parameter compressor recovers 99% of frontier-model accuracy at 26% of the API cost. That's the academic version of the same architecture, with empirical bounds.

Three things HazyResearch's framing adds beyond Anthropic's product blog:

The compressor doesn't have to be 27B — even 3B works for context distillation, depending on the task. The lower the compressor can go, the more local you can run.
The cost-recovery curve has a specific shape — 99% accuracy at 26% cost isn't linear. It's the same Pareto improvement Anthropic reported in product form: better and cheaper, not just cheaper.
The general framing is "compress then decide" — a slightly broader frame than "executor + advisor" because it includes the case where the compressor runs once at the start and the predictor runs once at the end, with no escalation loop. The advisor strategy is a streaming version of compress-then-decide where compression happens iteratively.

Why three independent confirmations matter

Each thread is from a different vantage:

Anthropic: product engineering. Owns the model, designed the workload, reports field metrics.
Tobi Lutke: individual practitioner. Different model providers (Qwen + GPT-5.5), different hosting (local + cloud), different workload (autoresearch, not coding benchmarks). Reproduced the pattern without coordinating with Anthropic.
HazyResearch: academic research. Different framing (compressor-predictor), different time horizon (paper preceded Anthropic's blog), different cost-quality measurement methodology.

When three independent vantages produce the same architectural answer, the design is robust to who happens to be sponsoring the work. That's the convergence-as-evidence argument — the pattern is real and not just downstream of one organization's preferences.

The interesting question now isn't whether the pattern works (the convergence proves it does). It's why it works — and that question has a clean answer from earlier in this mini-series.

The cost-curve recursion: same frame, third layer

Piece B (the first post in this series) argued that LLM-driven code retrieval sits on a cost curve: index-based approaches pay high build cost + super-linear maintain cost, tool-loop approaches pay per-query cost only. Below a crossover point — which sits well above most projects' size — tool-loops win. Above it, indexes pay back.

That argument generalizes. Applied to other agent-architecture decisions, the same frame keeps producing the right call.

Layer 1 — Retrieval architecture (Piece B)

	Tool-loop (grep + LLM iteration)	Index (vector RAG)
Build cost	0	super-linear in repo size
Maintain cost	0	super-linear in churn × structural complexity
Per-query cost	N tool-call round-trips	one vector search + LLM reasoning
Win condition	Below crossover	Above crossover

Conclusion: for most repos, build/maintain cost dominates per-query savings, so tool-loop wins. Anthropic chose grep+Glob+Read for Claude Code, not an index.

Layer 2 — Index storage (C2, the first-principles read of CodeGraph)

When you do cross the curve and need an index — CodeGraph's territory — the next decision is which storage layer to use.

	FTS5 + SQLite (CodeGraph)	Vector DB (Chroma / Pinecone)
Build cost	linear in source size, parse-only	super-linear (chunk + embed every file)
Maintain cost	low (file watcher + incremental parse)	super-linear (re-embed on change, handle cross-chunk refs)
Per-query cost	exact lookup, sub-millisecond	ANN search + rerank + LLM call
Win condition	Exact-lookup workload	Semantic-similarity workload

CodeGraph's queries are exact lookups (find symbol X, trace A→B, callers of Y), so FTS5 wins. Same frame as Layer 1: pay only the costs your workload demands.

Layer 3 — Model orchestration (this post — the advisor strategy)

Apply the same frame to token allocation across models.

	Cheap-only (Haiku alone, Qwen alone)	Expensive-only (Sonnet/Opus end-to-end)	Executor + Advisor
Per-token cost on bulk operations	low	high	low (cheap executor handles 90%+ of tokens)
Per-token cost on key decisions	low (but quality suffers)	high (and quality matches)	high (advisor only for decision tokens, ~400–700 tokens per call)
Aggregate task cost	low if quality holds	high regardless	low (most tokens are cheap; decision tokens compound from the expensive model's quality)
Aggregate task quality	depends on whether decisions are within cheap model's capability	full	high (cheap executor + expensive decisions ≈ expensive end-to-end, sometimes better)
Win condition	Tasks where cheap model alone is adequate	Tasks where any decision could be critical	Tasks where most operations are routine but some decisions are hard

Most agent tasks fit the last column. The advisor strategy wins for the same structural reason grep+loop wins at Layer 1: the cost of the "bulk" operations dominates the cost of the "decision" operations, so the architecture should put the cheap tool on the bulk path and reserve the expensive tool for the decision path.

Cost-curve as a meta-design frame

Stating the generalization explicitly: whenever an architecture has a "many low-value operations + few high-value operations" structure, applying expensive tools uniformly across both pays the high cost for the low-value operations too. The right design separates the two paths and uses cheap-but-good-enough tools on the bulk path.

This is the design rule the cost-curve frame produces at every layer it's been applied to in this series. It's not specific to LLMs — the database community calls this "use the cheapest index that satisfies the query class"; the systems community calls this "tiered storage"; the chip design community calls this "the memory hierarchy". The LLM-engineering version is the advisor strategy plus its retrieval-architecture cousins.

This is the meta-design law the five posts in this series argue for. The argument's strength comes from the convergence — three independent recursions of the same frame producing the right architecture each time. That's not coincidence; it's the frame doing its job.

What this validates from Piece B's source-code analysis

Piece B's analysis of Claude Code's source code reported a specific finding: the Explore subagent runs on Haiku for non-ant builds (external users), not on Sonnet or Opus. The reasoning section of Piece B observed:

"Explore runs on Haiku for external users. Not the main reasoning model. Exploration is a cheap-tokens job — there's no creative reasoning happening, just iterate-and-filter — and Anthropic uses a fast, small, cheap model for it. The main agent gets the expensive model when it gets the summary back. This is the staffing analogue: junior associate does the deposition review, senior partner reads the brief."

That's the advisor strategy, visible directly in Claude Code's source. Piece B analyzed the mechanism rather than the branding, so it didn't use Anthropic's later label — but it's the same architecture. The point isn't priority over the announcement; it's that the pattern was already running in shipped code, observable by anyone reading the source rather than waiting for a launch post to name it.

There's a useful takeaway here for reading any AI engineering work: the source code is ahead of the blog posts. The blog post explains and packages what's been running in production. Reading the source is one of the cheapest ways to see where the foundational labs are betting, because the explainer post usually describes what was already shipping in the code months earlier.

The advisor strategy is one of three patterns Piece B's source-code reading surfaced in this category. The other two are worth flagging because they suggest the next blog posts to expect:

The Fork-subagent architecture (visible behind the isForkSubagentEnabled() flag) — a different model-orchestration shape where the cheap and expensive halves share a context (and prompt cache) rather than separating cleanly. If Anthropic productizes this, expect a blog post titled something like "Fork: shared-context model collaboration" in the next 1–3 quarters.
The tengu_amber_stoat GrowthBook flag — gating Explore vs. no-Explore as a deeper architectural test. If Anthropic concludes the cheap-executor-as-separate-subagent pattern doesn't pay off, the next blog post is about why the advisor strategy works in some shapes and not others.

The general point: reading the source code and observing the patterns lets you write the analysis before the productized name arrives. When the name does arrive, your analysis is what frames it. This is the time-shift advantage that source-leaning engineering writing has over pure-press-release-paraphrase content. It's why this series' posts have been holding up under fresh data — the frame was built from the same source the announcements describe, so new announcements tend to confirm it rather than surprise it.

The gotchas the hype skips

The convergence between Tobi, Anthropic, and HazyResearch is real and the pattern is solid. But there are four gotchas the hype reliably skips that any production implementation has to address.

1. Data egress on handoff

The local-first appeal of Tobi's setup (executor runs on your own GPU) hides a subtle leak. Every time the executor escalates to the cloud-hosted advisor, some subset of the executor's context goes to the advisor's hosting environment. What gets sent is the executor's choice; once it's sent, it's no longer local.

Commenter @DarshanSays on Tobi's tweet flagged this explicitly: "local + remote advisor mode quietly creates a data egress channel." The pattern gives you cost control and partial privacy — most of your raw data stays local — but not full privacy. For workloads on sensitive data (security tooling, healthcare records, internal source code), the advisor's contract is now an exfiltration vector if it's poorly designed.

The mitigations are real engineering, not config:

Redaction layer between executor and advisor — strip identifiers, replace specific names with placeholders, summarize before sending
Hand-off contract documentation — explicit specification of what gets sent and what's excluded
Audit logging — every advisor call is logged with what was sent, so it's reviewable

2. Eval is structurally harder than it looks

The benchmark numbers Anthropic and HazyResearch reported are real but represent specific tasks. For your task, you don't know whether the advisor strategy pays off without measuring on your workload. And measuring is harder than for a single-model agent because:

The executor's failure modes and the advisor's failure modes interact — bad escalation can make the advisor strategy worse than executor-alone
The right escalation policy is task-dependent — too eager wastes advisor cost, too reluctant leaves executor stuck
Quality differences from advisor strategy show up not just in pass/fail but in answer completeness (similar to the Q4 refactor-impact analysis in C1's benchmark) and modal status (is the answer correctly hedged vs. confidently wrong? — see Piece A's modality-flattening discussion)

A serious eval setup for the advisor strategy needs:

Baseline: executor alone, expensive model alone, advisor-strategy variant — three arms, not two
Multiple escalation policies (eager / moderate / conservative) tested separately
Both correctness and completeness scoring, not just pass/fail
Statistical reporting (variance across runs, not just averages)

This is more work than benchmarking a single model. It's the kind of thing teams skip because the "single number" benchmarks already look good — but the single numbers can hide that the policy matters more than the configuration.

3. Hand-off contract design is real engineering

The advisor strategy's "magic" is the executor calling the advisor at the right moments with the right context and getting back actionable guidance. Every clause in that sentence hides an engineering decision:

When to escalate — after N consecutive failures? When confidence (measured how?) drops below threshold? After K rounds of no progress?
What context to send — the full executor working state? A compressed summary? The recent N actions and outcomes?
How to format the advisor's response — free-form text? Structured JSON? Action recommendations vs. analysis?
How the executor integrates the advice — adopt verbatim? Treat as a hint? Use to seed the next attempt?

Pi-lifeline's defaults (5 rounds before first advisor call, 3 consecutive failures auto-escalates, plateau-detection at 6 rounds, max 10 advisor calls per session) are one set of choices. They're reasonable but not universal. The right choices depend on the task; getting them wrong destroys the strategy's value even when the underlying models are good.

4. Hardware realism

Tobi runs Qwen 3.6 27B on an RTX 6000 (NVIDIA professional-grade). The published benchmarks from Anthropic and HazyResearch use specific model versions and infrastructure. The architecture is reproducible on commodity infrastructure; the specific results are not.

For practitioners considering the pattern:

Local executor (Qwen-class 27B+ dense model with long-context autoresearch loads) realistically needs RTX 6000 or A100-class hardware. Consumer cards (RTX 4090, RTX 5090) work for shorter contexts but throughput drops on long sessions.
Quantized GGUF versions (e.g., Unsloth's quantizations) help with VRAM but not throughput — same hardware needed for the same wall-clock latency
Hybrid cloud-first executor (Haiku/Sonnet on Anthropic API) avoids the hardware question but loses the local-data-leaves-only-on-escalation property

The realistic deployment shape depends on what you're trying to optimize. Cost-only: cheap cloud executor + expensive cloud advisor. Privacy-first: local executor + cloud advisor with redaction. Speed-first: cloud executor with low advisor latency. The advisor-strategy architecture is the constant; the implementation varies by which axis dominates your requirements.

Engineering implementation: where to actually start

If you're considering the advisor strategy on a real workload, the cheapest first step is to measure your single-model agent's token distribution — what percentage of tokens go to context reading vs. format conversion vs. actual reasoning. If 70%+ of tokens go to bulk operations, the advisor strategy has a big payoff potential. If the distribution is flatter, payoff is smaller and the engineering overhead may not be worth it.

Once you've decided to try the pattern:

Pick an executor model — start with the cheapest model that can complete most of your tasks reliably. For coding agents, Haiku is the obvious starting point; for autoresearch, Qwen 3.6 27B (or whatever local 27B-class model fits your hardware).
Pick an advisor model — Opus for Anthropic-stack workloads, GPT-5.5 for OpenAI-stack, whichever frontier model you trust on the task class.
Design escalation triggers — start with pi-lifeline's defaults as baseline, tune based on observed executor failure patterns. The right number depends on your task.
Design the hand-off contract — what context goes to the advisor, what format the advisor returns. Start minimal (recent N actions + current goal), expand if advisor quality is poor.
Implement redaction — if your data is sensitive, the redaction layer is non-negotiable. If not, you can skip it for v0 but document the egress.
Measure — three-arm eval (executor alone, advisor strategy, expensive alone), correctness + completeness scoring, variance across runs.
Iterate on the policy, not the models — when the strategy underperforms, the fix is usually in escalation timing or hand-off content, not in swapping models.

The pattern works. The engineering around it determines whether it works for you.

Closing — the mini-series, completed

This is the fifth and final post in a series on agent retrieval, memory, and orchestration architectures:

Agent Retrieval Is a Cost Curve Problem (2026-05-25) — Layer 1: retrieval architecture. Why grep+loop beats RAG for code, and why the cost curve says so.
Agent Memory Is a Cache Coherence Problem (2026-05-28) — the cache-coherence frame for lossy auto-capture AI memory tools, with the modality-flattening failure mode mapped out.
I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't. (2026-06-01) — empirical: when the cost curve is crossed (mid-size repo, architectural questions, static-typed language), CodeGraph crosses it cheaply.
Agent Retrieval Above the Crossover: A First-Principles Read of CodeGraph (2026-06-08) — Layer 2: index storage. Why SQLite + FTS5 beats vector DBs for the symbol-graph workload, and where CodeGraph's abstraction leaks.
This post (2026-06-15) — Layer 3: model orchestration. The advisor strategy as the third recursion of the cost-curve frame, validated by Anthropic's product, Tobi Lutke's experiment, and HazyResearch's academic version.

Read as one argument: the same cost-curve frame applies at three layers of agent architecture. At each layer, the correct design separates "bulk operations" from "decision operations" and pays only the cost each operation class requires. The five posts are five different applications of one frame, each cross-checked against fresh data as the productized announcements landed — the frame held up because it was built from the same shipped source those announcements describe.

Read as a toolkit: if you're designing or evaluating agent architectures, the question to ask at every layer is the same. What's the cost distribution of operations at this layer? Is there a "bulk vs. decision" split? Can the bulk path use a cheaper tool? Does the expensive tool only need to be on the decision path? Apply at retrieval (grep vs. RAG), storage (FTS5 vs. vector), model orchestration (executor vs. advisor). The next layer the frame will apply to is plausibly memory consolidation (cheap distillation vs. expensive synthesis) — that's a future post topic if the pattern shows up.

A note on L2 iterating fast — between when this mini-series started (B published 2026-05-25) and when D publishes (2026-06-15), the LLM-symbol-graph layer kept moving: CodeGraph shipped point releases, and more tools in the same class are arriving. They all hit the same six conditions for a viable LLM-symbol-graph that the framework predicted; where they differ is inside the ranking layer — how each one orders the symbols a query surfaces (keyword + heuristics, graph-walk propagation, embedding re-rank, whatever comes next). The six conditions are about what's required for an LLM-symbol-graph to exist at all; the ranking algorithm is the secondary design space within the framework, and it's where the next tool will try to win. The empirical read of CodeGraph on a repo its team didn't choose is in the companion benchmark post; the first-principles architectural read is in the companion Lab post. L2 keeps iterating; the framework is what's stable.

Three threads converged on the advisor strategy because the cost-curve frame produced it independently each time. The frame is the durable insight; the architecture is the frame instantiated at one layer. Reading the source code, watching the productization, and modeling the convergence each contribute to the same picture.

If you build agents and are paying frontier-model rates for tokens that don't need them, the advisor strategy is the practical fix. If you read agents and want a frame for evaluating what comes next, the cost-curve recursion is the lens. The series ends here, five posts and three layers in.

Companion piece 1: *Agent Retrieval Is a Cost Curve Problem: Why Claude Code Doesn't Use RAG***
Companion piece 2: *Agent Memory Is a Cache Coherence Problem***
Empirical pair (Operator track): *I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.***
First-principles companion: *Agent Retrieval Above the Crossover: A First-Principles Read of CodeGraph***
Background: *Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works***
Anthropic advisor strategy blog (2026-04-09): "The advisor strategy: Give agents an intelligence boost"
HazyResearch Minions paper: https://arxiv.org/abs/2512.21720
pi-lifeline (open-source escalation extension inspired by Tobi Lutke): https://github.com/robzolkos/pi-lifeline

Agent Retrieval Above the Crossover: A First-Principles Read of CodeGraph

Harrison Guo — Mon, 08 Jun 2026 19:49:41 +0000

The prior post in this series, Agent Retrieval Is a Cost Curve Problem, argued that a viable LLM-symbol-graph would need to satisfy six specific conditions — and that no existing tool had hit all six. The post went live on 2026-05-25; seven days earlier, CodeGraph had hit GitHub trending with exactly those six properties satisfied.

That's the easy version of the update: framework predicted it, someone shipped it, here's the existence proof. The companion piece (I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.) handles the empirical half — 40 verified-connected runs, a decision matrix, the install-or-not call. Short version of that post: the tool-call savings reproduce on an independent repo (−55%), the cost savings from the vendor benchmark don't (+7% at Hono's size). Fewer steps, not fewer dollars, until your repo is big enough.

This post is the harder version of the update.

The interesting question isn't whether CodeGraph works. The interesting question is why are its specific architectural choices right, and where does the abstraction inevitably leak? Answering it gives you the lens for evaluating the next CodeGraph-class tool that ships — and there will be many — without redoing the benchmark each time.

To answer it concretely rather than abstractly, I read CodeGraph against its own artifact: the SQLite database it writes to .codegraph/codegraph.db. Every structural claim below is checked against the index it actually built for Hono (CodeGraph v0.9.7: 362 files, 4,128 nodes, 8,225 edges, a 7.4 MB database). The schema turns out to be the clearest statement of the architecture the tool's README never makes.

tl;dr — CodeGraph's architecture is right for three reasons that aren't obvious from the feature list, and all three are visible in its SQLite schema. (1) The AST extraction boundary: tree-sitter takes what syntax tells you (4,128 nodes across 13 kinds, 8,225 edges across 7 kinds) and leaves the rest to the LLM. The boundary is literal — references syntax can't resolve go into an unresolved_refs table instead of becoming fake edges. (2) SQLite + FTS5, not a vector DB: the index is plain relational tables plus a full-text table over symbol names. Zero embedding columns. The queries are exact lookups that B-tree indexes answer in log time; vector search would be solving a harder problem the workload never asks. This is the prior post's cost curve, recursed onto the index tool itself. (3) The abstraction leaks where syntax diverges from runtime semantics — macros, metaprogramming, codegen, JIT binding. CodeGraph tags its few guessed edges with a heuristic provenance flag (7 of 8,225 on Hono), which is honest; but what tree-sitter can't see at all gets no edge and no flag. Knowing that boundary is what separates a tool you trust from one you cargo-cult.

Why this is a first-principles question, not a tool review

Most coverage of CodeGraph reads like "19k stars in a week, here's the install script." That's news; it isn't analysis. The same coverage will get written for every CodeGraph-class tool that ships in the next 18 months, because the pattern — tree-sitter + local index + MCP server + an instruction snippet that routes the agent to it — is now demonstrated and the ingredients are well known.

The durable question isn't "is CodeGraph good?" It's "what makes this class of tool architecturally correct, and how do I evaluate the next one?" That's what a first-principles read produces. The benchmark in the companion post is one data point; this post is the lens for reading all future data points in the same space.

If you're deciding on CodeGraph specifically, read the companion. If you're thinking about LLM retrieval as a discipline — or about to bet on, or build, a similar tool — read this.

Recap: the six conditions, in 30 seconds

The prior post argued any viable LLM-symbol-graph needed:

No-compile parsing — cold start in seconds, not minutes
Language portability — one binary for many languages, not one server per stack
LLM-shaped API — flat, recordy output the model can digest, not nested LSP hierarchies
Broad enough coverage — code-as-structure plus a text-search fallback for everything else
Live update without reindex — file-watcher-driven, no manual rebuild
Zero-config install — single binary, configures the agent automatically

CodeGraph hits all six (the field-by-field mapping is near the end of this post). Taking the mapping as established, the interesting move is to ask: of the design choices CodeGraph made to hit those six, which were forced and which could have gone the other way? The forced ones are good engineering. The ones that weren't forced — where CodeGraph picked something specific over a live alternative — are where the architecture is making a claim, and where the first-principles content lives.

Three of those choices repay a deep read. The other three (file-watcher update, single-binary distribution, instruction-snippet routing) are well-understood in their own fields — OS notifications, package distribution, prompt engineering — and amount to "do the obvious thing well." The three that don't are the three this post takes apart, each against the actual index.

Section 1 — The AST extraction boundary: an information-theoretic case

CodeGraph parses source with tree-sitter and extracts a specific subset of the syntax into its graph. You don't have to take the README's word for what that subset is — it's enumerable straight out of the nodes and edges tables. On Hono, the 4,128 nodes break down like this:

Node kind	Count	Node kind	Count
import	1,033	method	240
route	873	interface	187
function	569	property	169
file	362	class	50
type_alias	358	enum_member	24
constant	247	variable / enum	16

And the 8,225 edges, which are the actually interesting part:

Edge kind	Count	What it encodes
contains	2,874	structural nesting (file → class → method)
calls	2,230	the call graph
references	1,955	symbol used here, defined there
imports	1,033	module dependency edges
instantiates	124	`new X()` sites
extends	7	class/interface inheritance
implements	2	interface implementation

Now look at what is not there. No "type" nodes. No generic-instantiation edges. No data-flow edges. No "this dynamic dispatch resolves to that concrete method" edges. CodeGraph extracts calls, references, extends, implements — relationships that are locally apparent in the syntax — and stops. The first-order reading of this is "because tree-sitter doesn't resolve types." True, but circular. The deeper reading is why this division of labor is correct for an LLM consumer.

The information-theoretic case

A type-checker (or full LSP) does work the LLM cannot easily redo: resolving obj.method() to the actual method given the static type of obj, propagating types through generics, walking an inheritance chain to the method actually invoked. That requires the full compilation context — every transitive import, every type definition, every generic instantiation. The cost is high (a build environment, slow cold start, breaks when the build breaks) and the benefit is narrow: precise semantic resolution that's genuinely hard to reconstruct from local context.

A syntactic extractor does different work. It makes the structure of the source queryable, but only the structure that's locally apparent: "function dispatch defined at hono-base.ts:406, calls match here, imported from router." No types, no generics, no runtime binding — but no compilation either.

The information-theoretic question is: given an LLM that's good at semantic reasoning but bad at structural enumeration, what's the right split between what the index provides and what the LLM provides?

CodeGraph's answer: hand the LLM the structural skeleton — what calls what, what's defined where, what imports what — because enumerating that across thousands of files is exactly the part the LLM is bad at and would burn dozens of tool calls trying to do by hand. Leave the semantic resolution — what does this call actually invoke at runtime under dynamic dispatch? — to the LLM, because the LLM is reasonable at that once the relevant code is in its context, and baking a type resolver into the index would multiply the build cost for a recovery the LLM mostly doesn't need.

The clean way to see this boundary is the contains + calls + references edges (7,059 of the 8,225) versus the things that aren't edges at all. When the companion benchmark's Q1 asked how a GET /users/:id request reaches its handler, what CodeGraph gave Claude Code was the call chain — fetch → dispatch → match — as graph edges. What it did not give, and didn't try to, was which concrete match implementation runs given Hono's SmartRouter picking RegExpRouter at runtime. The graph located the players; the LLM read the three files and resolved the dispatch. That's the split working as designed: enumeration from the index, resolution from the model.

The boundary is a literal table

Here's the detail that turns this from an argument into an observation. When tree-sitter sees a reference it cannot statically resolve to a definition, CodeGraph does not invent an edge. It writes a row to a separate unresolved_refs table — name, location, the node it came from, no target. The schema has a first-class place for "I saw a use here, I could not prove what it binds to."

On Hono, unresolved_refs has zero rows — and, as it turns out, so did every other repo I indexed to check it (Section 3 has that result, and it's not the one I expected). The empty table isn't the interesting part; the table existing is the architecture stating its own boundary. A tool that faked those edges — guessed a target to make the graph look complete — would be lying to the LLM in exactly the way that produces confident wrong answers. CodeGraph's choice to record the unresolved reference as unresolved is the same discipline a good cache has when it marks an entry stale instead of serving it: the honest move is to represent "don't know," not to paper over it.

Why this matters beyond CodeGraph

This boundary — syntactic graph for the index, semantic reasoning for the LLM — is the line the next generation of LLM-coding tools will either hold or violate. The violations are predictable:

Too far toward semantics in the index: a tool that tries to be a full LSP-plus for the LLM. High build cost, slow cold start, fragile on broken builds, marginal benefit because the LLM can do that resolution from local context anyway.
Too far toward raw text in the index: a tool that's just "grep with nicer indexing" — fast and broad, but it doesn't hand the LLM the structural skeleton it actually needs. That's the position grep+loop already occupies; an index there adds little.

CodeGraph sits in the middle, and that position is right for current LLM capability. As models get better at semantic resolution the line will move one way; as tool-loop iteration gets cheaper it will move the other. But the principle — that there's an information-theoretic boundary worth picking, and that picking it requires modeling the LLM's real strengths and weaknesses — is the durable take. The right way to evaluate any new LLM-retrieval tool starts here: what does it choose to extract, what does it leave for the LLM, and is that split calibrated for what an LLM is actually good at?

Section 2 — SQLite + FTS5 vs vector DB: the cost curve, recursed

CodeGraph stores its symbol graph in a local SQLite database. Not Chroma. Not Pinecone. Not Weaviate. Not Qdrant. The full table list from Hono's index:

nodes              edges              files
unresolved_refs    nodes_fts          schema_versions
project_metadata   (+ FTS5 shadow tables: nodes_fts_data/idx/docsize/config)

nodes and edges are plain relational tables. nodes_fts is an FTS5 virtual table. Searching the whole schema for an embedding column, a vector type, a float array — anything ANN-shaped — returns nothing. The only BLOB columns are FTS5's own internal segment storage (nodes_fts_data), not vectors. There are no embeddings in CodeGraph. That's not an omission; it's the architecture, and it's the same call the prior post made one level down.

The cost-curve frame, recursed

The prior post argued vector RAG over a codebase pays a build cost (chunk + embed every file), a maintain cost (re-embed on change, reconcile cross-chunk references), and a low per-query cost (ANN search + rerank) — and that for most repos this loses to grep+loop's (zero build, zero maintain, per-query round-trips).

Apply that exact frame to CodeGraph's own storage. If CodeGraph used a vector DB for its symbols, it would pay: embed every symbol's signature and body on index; re-embed on every file save (the file-watcher would have to fire embedding calls); ANN search per query. That's the same curve the prior post argued against — and CodeGraph's workload doesn't justify it, because the queries it serves are exact lookups, not similarity searches. The schema proves the queries are exact by the indexes it builds for them:

"Find symbol getUserById" → idx_nodes_name, and idx_nodes_lower_name for case-insensitive matches. A B-tree probe, microseconds. FTS5 (nodes_fts over name, qualified_name, docstring, signature) handles the fuzzier "name contains" variants. No similarity math.
"Who calls Context.set?" → idx_edges_target_kind (a reverse-edge index on (target, kind)). Reverse adjacency lookup, deterministic.
"What does dispatch call?" → idx_edges_source_kind (the forward-edge index). Forward adjacency, deterministic.
"Trace fetch → db_query" → repeated forward-edge hops over those same indexed edges. Graph traversal on stored adjacency, no vectors anywhere in the loop.

Those forward and reverse edge indexes are the whole ballgame. Callers and callees — the queries a code-intelligence tool exists to answer — are a single indexed adjacency lookup in each direction. Vector search cannot do this better; it can only do it fuzzier and more expensively, because "who calls this function" has an exact answer that an approximate-nearest-neighbor index would blur.

The only queries where vector search genuinely helps are semantic ones with no symbol to anchor on — "show me the code that does authentication." CodeGraph doesn't serve those. The LLM does, by issuing a sequence of exact structural queries and reasoning across the results. The division is the same one from Section 1: the index answers the exact-lookup questions deterministically; the LLM answers the fuzzy-intent questions by orchestrating exact lookups. Neither needs an embedding.

The recursion as a design principle

What's elegant — and worth surfacing for its own sake — is that CodeGraph's storage choice is consistent with the retrieval philosophy from the prior post, one level up. Both arguments are the same sentence: exact-lookup workloads should use exact-lookup tools; approximation overhead is paid only where approximation pays back.

If CodeGraph had reached for Chroma over FTS5, it would have violated its own retrieval philosophy — paying embedding and ANN cost to answer questions that have exact answers. That it didn't, that the designer recognized the symbol-graph workload is exact-lookup-shaped and picked the cheapest exact-lookup storage available, is what makes the architecture coherent across layers rather than just locally clever.

The next tool in this class will face the same fork, and most will reach for a vector DB by default, because "AI tooling = vector store" is the reflex. CodeGraph's choice is the corrective: ask what your workload needs, not what the category's fashion suggests. That's the cost-curve frame functioning as a meta-design tool — every time you add a layer to an LLM stack, ask which side of the curve the new layer's workload sits on, and pick storage and algorithm from the answer, not the trend.

Section 3 — Where CodeGraph's abstraction leaks

Every index lies a little. The question is where it lies and whether you can tell when it does.

CodeGraph's graph is built from syntactic extraction, so anywhere the runtime semantics diverge from the syntactic structure, the graph is incomplete in a way that's hard to detect from the index alone. The leak isn't a bug; it's the abstraction working as designed, at a layer that structurally cannot see certain phenomena. There's a tell for it in the schema, and there's a part the schema can't tell you about — and the difference between those two is the whole point.

The honest part: the provenance column

CodeGraph stamps every edge with a provenance value. On Hono, 8,218 of the 8,225 edges have empty provenance — meaning direct from the syntax tree — and exactly 7 carry the value heuristic. Those seven are edges CodeGraph's framework adapters inferred from a recognized pattern rather than read off the AST: route registrations, framework binding conventions, the handful of cases where a tool that "supports Hono / Flask / Spring" pattern-matches a known idiom and synthesizes an edge the raw syntax doesn't spell out.

That heuristic tag is the architecture being honest. It is, in the vocabulary of the memory post in this series, an arrow: every edge points back to how it was derived, and the seven guessed edges are flagged as guesses. A consumer that cared could treat heuristic edges with less trust than syntactic ones. That's good cache hygiene — the index records the confidence of its own entries instead of presenting all of them as equally certain.

The part the schema can't tell you about

Here's the catch, and it's the one that matters: the provenance column only flags edges that exist. The dangerous leak isn't a guessed edge that's marked as guessed. It's the edge that should exist and isn't there at all — because the relationship lives in a layer tree-sitter cannot see, so there's nothing to extract, nothing to tag, and nothing to warn you. The four big zones where this happens:

Macro-heavy code. In Rust, vec![1, 2, 3] expands at compile time into a call sequence the AST never contains; the graph shows a vec! invocation, not the Vec::new() + push() that actually runs. For procedural macros (#[derive(...)], attribute macros), the generated implementation is what executes and CodeGraph can't see into it without running the compiler — which would forfeit the no-compile property that Section 1 showed is the whole point. Same shape in C/C++ preprocessor-heavy code, Lisp/Clojure macros, Elixir compile-time metaprogramming.

Metaprogramming. Python decorators routinely rewrite functions: @dataclass synthesizes __init__/__repr__/__eq__; @app.route("/users") registers a handler with a router. Tree-sitter sees the decorator and the function as adjacent syntax, not the synthesis or the registration. CodeGraph's framework adapters catch the common cases — and that's literally what the 7 heuristic edges on Hono are — but arbitrary user-defined decorators that mutate behavior are invisible. Ruby method_missing, Python __getattr__, Java reflection: same story. The graph confidently returns "no callers" for a method invoked entirely through reflection, and the LLM, trusting structured output, may hand you a confidently wrong blast radius.

Generated code. Protobuf, GraphQL codegen, OpenAPI clients, ORM model generation (Prisma, SQLAlchemy declarative), JSX/Svelte compilation — the code the runtime executes isn't the code in source control. It lives in build/, dist/, .cache/, places .gitignore excludes. CodeGraph indexes what's checked in; the generated layer is outside the boundary. "Who implements UserService?" returns the hand-written interface, not the generated stub that implements it on the wire. Any source-only index has this; it's worth naming because it interacts badly with the user's instinct that an "AST graph" must be complete. It's complete over the source it indexed — and the generated layer was never in that source.

JIT and runtime-registered bindings. DI containers (Spring, Guice, Dagger, ASP.NET service collection), FastAPI Depends, plugin systems with runtime registration, and — the one the companion benchmark hit directly — middleware chains composed at app startup. Hono's app.use(...) builds the middleware array at runtime; tree-sitter sees the use call sites and the handler as unconnected syntax. When the benchmark's Q2 asked Claude Code to trace the middleware call stack, what codegraph_trace could return was the syntactic call chain through compose() — accurate as far as it goes, and genuinely fewer steps than baseline grep — but the actual runtime ordering of middlewares is assembled by app.use calls scattered across the app, which the graph doesn't compose. The trace looked authoritative and was structurally real; it just wasn't the runtime composition, and only someone who knew the leak zone would know to check.

The empirical check, and the null result that sharpens it

I expected unresolved_refs to be where this shows up — index a macro-heavy repo, watch the table fill. So I indexed three to test it: Hono (TypeScript), click (Python, decorator-heavy), and ron (a Rust crate leaning on derive macros and serde). unresolved_refs was zero on all three; heuristic edges were 7, 0, and 0. The null result is the finding. A #[derive(Serialize)] impl never appears as an unresolved reference, because nothing in the source ever wrote a reference to it to leave dangling — the impl only exists after macro expansion. codegraph callers serialize on ron returns its seven real syntactic callers and silently omits whatever the derive generates, with no flag and no empty-table warning, because from the index's point of view nothing is missing. And that is the trap. An empty unresolved_refs table reads like a clean bill of health, but on derive-heavy or reflection-heavy code it means the opposite of "everything resolved" — it means the thing that didn't resolve never left a trace to flag. The table catches references it can't resolve; it cannot catch code that was never written down to reference. That's the leak that costs you: not the guess that gets flagged, but the absence that looks exactly like completeness. It's the same failure shape as the memory post's "could" stored as "did" — the dangerous error is always the one that wears the face of a correct answer.

Why mapping the leaks matters

A tool you trust everywhere is a tool you stop checking. The four zones above are where the LLM, trusting the graph, gives you confidently wrong answers — and those are the failures that cost real engineering time, because the answer looks right and you have no reason to second-guess it.

The practical rule is small. Inside one of these zones — heavy macros, reflection/DI, codegen-heavy projects, runtime-composed bindings — CodeGraph is still a fine starting point, but the LLM's answer has to be cross-checked against the runtime, not against the graph. Outside them — most application code in most languages, which is most of what most people query — the graph is enough. The provenance column tells you which present edges were guessed; nothing tells you which absent edges were never seen. That asymmetry is the actual trust boundary, and it's the thing to internalize before you wire any syntactic index into an agent's decision loop. Joel Spolsky named this pattern for compilers and frameworks twenty years ago — every abstraction leaks, and you pay for the leak precisely when you've forgotten the abstraction is there. CodeGraph is the latest data point in a very old series.

Mapping CodeGraph to the six conditions

Field-by-field, how CodeGraph hits each condition from Agent Retrieval Is a Cost Curve Problem. Compressed; the prior post defines the conditions, the companion post applies them empirically.

1. No-compile parsing. Tree-sitter parses source into an AST with no build invocation, no dependency resolution, no language environment. On Hono, 362 files indexed to 4,128 nodes and 8,225 edges in 1.7 seconds; the published 7-repo benchmark reports first-index on the order of minutes for VS Code-scale (~30k files), all subsequent updates incremental. LSP needs tsc / cargo check / mvn; CodeGraph reads raw text. Met.

2. Language portability. ~19 languages via tree-sitter, plus framework adapters for route-aware extraction (Hono's 873 route nodes come from one of them). One binary, no per-language server. Met.

3. LLM-shaped API. Here the scaffold version of this post — and a lot of the casual coverage — gets a fact wrong worth correcting precisely. The CLI exposes a dozen commands (query, callers, callees, impact, affected, context, …). But the MCP server exposes exactly five tools to the agent: codegraph_search (locations only), codegraph_context (described in its own schema as the PRIMARY tool, call FIRST for any how-does-X-work question), codegraph_node (one symbol plus its callers/callees trail), codegraph_explore (several related symbols in one capped call), and codegraph_trace (the call path between two symbols). The narrowing is the design: the human CLI gets impact and affected as separate verbs; the agent gets a context-first surface of five flat tools, each returning {symbol, file, line, snippet, related[]}-shaped records, with the instruction snippet steering it to codegraph_context before anything else. Ten tools would be worse for an LLM than five; CodeGraph picked five. Met, deliberately.

4. Coverage breadth. Symbol graph for structure; FTS5 over name, qualified_name, docstring, signature for text-fallback; Claude Code's native Grep stays enabled for everything outside the index. Partially met — the correct partial.

5. Live update without reindex. OS file-watcher with a short debounce; a save re-parses the touched file and re-resolves dependents' import edges. Met.

6. Zero-config install. Single binary, one-line install, auto-detects the agent, writes the MCP config and the instruction snippet, then codegraph init -i builds the index. Ten minutes from curiosity to working under ~1,000 files. Met.

Six for six. The architecture the prior post argued was theoretically right but practically missing exists, in production, with a working installer — and, read against its own schema, the choices hold up under inspection rather than just on the landing page.

What this says about LLM retrieval as a discipline

Three things, in increasing order of generality.

1. The right LLM-index design is not a copy of human-IDE design. Sourcegraph and LSP were built for a human reading one precise answer; an LLM reads many cheap rounds and reasons across them. The architectures should differ, and CodeGraph's choices — tree-sitter not LSP, five flat MCP tools not a nested LSP API, FTS5 not vectors — are evidence of someone designing for the actual consumer instead of porting an existing design. The framework predicts the design space, and the interesting variation between the tools that will fill it is not in the six conditions (those are now the table stakes) but in the ranking layer — how each one orders the symbols a query surfaces. That's where the next tool will try to win, and where the next benchmark should aim.

2. The cost-curve frame is recursive. It applies to every layer of an LLM stack, including the tools that wrap the LLM. CodeGraph's FTS5-not-Chroma choice is the same shape as the original grep-not-RAG choice. Use it as a meta-design tool: at every layer, ask which side of the curve the workload sits on, and let that pick the storage and the algorithm.

3. The abstraction leaks are the trust boundary — and trust, in the end, has to terminate at the source. This is the thread that runs through the whole series. CodeGraph's graph is a derived view of the source: a cache. Its heuristic provenance tags and its unresolved_refs table are the parts where it keeps an arrow back to that source and is honest about what it did and didn't see. But a syntactic graph is still a lossy projection of a running program, and the leak zones are exactly where the projection drops information that only exists at runtime. The discipline that falls out of this is the same one the retrieval post and the memory post arrived at from their own directions: a derived artifact is trustworthy only where you can check it against the source that produced it. CodeGraph is fast and exact in the 80% of code where syntax determines structure, and quietly incomplete in the 20% where it doesn't — and the only way to stay out of the failure modes is to remember the graph is a cache and keep the real code, the actual runtime, as the thing that wins every conflict.

The bigger move CodeGraph represents — third-party MCP tools filling the retrieval gap the foundation model's main agent doesn't fill — is the ecosystem direction the feature-flag analysis in the prior post suggested Anthropic is hedging toward. Whether Anthropic eventually builds tree-sitter symbol-graph functionality natively or leaves it to the CodeGraph-class ecosystem is a product call. The technical case for "let MCP fill it" is strong: the design space is still settling, and locking one approach into Claude Code spends option value the ecosystem is currently pricing for free.

Closing — the mini-series arc

This is the third of a three-part Lab series on Claude Code's retrieval and memory architectures:

Agent Retrieval Is a Cost Curve Problem (2026-05-25) — why grep+loop, not RAG, for most projects
Agent Memory Is a Cache Coherence Problem (2026-05-28) — why hand-curated Markdown, not lossy vector recall, for cross-session memory
This post (2026-06-08) — what lives above the cost-curve crossover: CodeGraph as the architecturally coherent symbol-graph companion the first post argued was missing, read first-principles against its own index for what its choices say about the discipline

Read together, the three describe one stance on agent retrieval and memory: choose lossless and exact by default; expose MCP as the integration substrate; let third-party tools fill the gaps you don't want to own; and keep an arrow back to the source everywhere, because every derived view is a cache and the source is the only thing that can't drift from itself. The cost-curve frame is the math, the cache-coherence frame is the failure taxonomy, and the first-principles read of CodeGraph is what the architecture, looked at carefully, says about where LLM retrieval is going.

If you're building agent retrieval, the three frames are now in your toolkit. The companion empirical post gives you the install-or-not decision; this one gives you the lens for the next ten tools that ship in the same space.

Companion piece 1 (this is the third in a 3-post Lab series): *Agent Retrieval Is a Cost Curve Problem: Why Claude Code Doesn't Use RAG***
Companion piece 2: *Agent Memory Is a Cache Coherence Problem***
Empirical pair on the Operator track: *I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.***
Background: *Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works***
CodeGraph repo: *https://github.com/colbymchenry/codegraph***

I Tested CodeGraph on Hono. The Tool-Call Savings Reproduce — the Cost Savings Don't.

Harrison Guo — Mon, 01 Jun 2026 22:39:59 +0000

Two weeks ago CodeGraph hit GitHub trending — tree-sitter + SQLite/FTS5 + MCP for Claude Code, 19k+ stars in a week. The team published a benchmark on 7 repos showing 35% cheaper, 57% fewer tokens, 46% faster, 71% fewer tool calls vs. baseline.

Those are big numbers. They're also numbers from a benchmark designed by the team that built the tool, on repos they chose. Designer bias is the #1 risk in any retrieval benchmark — when you pick the test repos and write the ground truth, you'll consciously or unconsciously favor your own tool's strengths.

So I ran an independent test on an 8th repo — Hono (TypeScript, ~280 source files, in neither CodeGraph's published 7-repo suite nor any other published benchmark I could find). 5 architectural questions covering different retrieval shapes, with a deliberate control case (Q5) where the tool should not win. Two conditions (baseline grep+Read+Glob+Explore vs. CodeGraph active), 4 repeats per question per condition. 40 runs on Claude Opus 4.8 — and, critically, every CodeGraph run was verified to have connected, and actual codegraph_* tool usage was recorded per run (more on why that sentence exists below).

The result splits in a way the single published headline number hides — and the split is the useful part.

tl;dr — On Hono, CodeGraph delivers a large, consistent reduction in tool calls (-55%, 14.0 → 6.3 avg) and a smaller latency win (-20%) — the published 7-repo direction reproduces here. But cost is a wash: +6.8%, not the published −35%. On narrow-scope questions (route lookup, middleware trace) CodeGraph is actually 20-43% more expensive, because each structural lookup loads a big chunk of graph context that costs more in cached tokens than the grep round-trips it replaces. The cost win only appears on broad multi-file navigation (Q3 multi-runtime adapters: −29% cost, −80% tool calls, −53% latency). A second finding: baseline grep+Read has high variance — the agent occasionally spiraled to 47-52 tool calls on the broad questions, while CodeGraph never exceeded 16. Net at Hono's size: CodeGraph makes the agent take fewer steps and finish faster, but not for fewer dollars. Total cost of the 40 valid runs: ~$14 of Opus 4.8 calls. Raw per-run CSV and the 5 verbatim prompts are below.

What "tool calls down, cost flat" actually means

CodeGraph's published 7-repo suite (VS Code, Excalidraw, Django, Tokio, OkHttp, Gin, Alamofire) skews larger and more architecturally complex than Hono. Hono is ~280 TypeScript source files (362 files indexed by CodeGraph, including tests and configs), 16MB on disk — small enough that a thoughtful agent with grep + Read can finish most architectural questions in a handful of tool calls.

The interesting result is that the axes come apart. CodeGraph replaces several grep+Read round-trips with one or two structural lookups — so step count drops hard (-55%). But each codegraph_context / codegraph_explore call returns a sizeable chunk of graph context, which then rides along in the conversation cache and gets re-read every turn. At Hono's size, the dollar cost of carrying that cached payload roughly equals the dollar cost of the grep round-trips it replaced — so dollars stay flat (+7%) even as steps fall by more than half.

That's not a contradiction of the cost-curve thesis from the prior post in this mini-series — it's a sharper reading of it. Hono sits above the step-count crossover (the index already saves tool calls) but below the dollar crossover (it doesn't yet save money). On a much bigger repo, the grep path churns through far more files and the index pays back on dollars too. Hono just happens to land in the gap between the two crossovers.

A useful complementary benchmark answers three things the published one doesn't:

Cross-validation on a repo not chosen by the tool's team — do the published advantages generalize?
Within-repo variance across question types — does the win concentrate on certain question shapes? (It does — heavily.)
A control case where the tool shouldn't win — Q5 (text search) tests whether the agent correctly declines to use the structural engine when grep is the right tool.

Setup — install CodeGraph, ~10 minutes

# install (downloads a single binary, no Node/npm required)
curl -fsSL https://raw.githubusercontent.com/colbymchenry/codegraph/main/install.sh | sh

# clone the test repo + index it
git clone https://github.com/honojs/hono.git ~/tmp/hono
cd ~/tmp/hono
codegraph init -i

Index build time on Hono (362 files, 4,128 nodes, 8,225 edges): 1.7 seconds. On-disk index: 7.1 MB.

Per-condition setup for the two arms:

Baseline (control): a clean copy of Hono via rsync -a --exclude='.codegraph/' to a separate directory so Claude couldn't accidentally grep into the index. No MCP servers registered. Agent uses native Glob + Grep + Read + Explore + Task.
CodeGraph active: original Hono directory with .codegraph/ present, MCP server registered:

{"mcpServers": {"codegraph": {"command": "codegraph", "args": ["serve", "--mcp"]}}}

Both arms run claude --print --output-format stream-json --model opus so the model and the rest of the agent loop are identical; the only varying input is whether the CodeGraph MCP server is in the loop. Each run is a fresh session with no prior context.

Verifying the tool actually ran (this is not optional)

A retrieval-tool benchmark is only valid if the tool is actually in the loop — and I learned that the hard way. My first pass at this benchmark silently ran with CodeGraph's MCP server never connected: the config was missing the --mcp flag, and Claude Code proceeds without a server that fails its hand-shake in time rather than erroring out. Every "CodeGraph" run was really just grep+Read. The comparison was noise, and the numbers looked plausibly small — which is exactly how a broken benchmark slips through.

So for the data here, every run is instrumented:

--strict-mcp-config — only the server under test is loaded, with no contamination from other globally-registered MCP servers.
Pre-warmed daemon + MCP_TIMEOUT=30000 — CodeGraph's stdio server attaches to a warm daemon and finishes its handshake before the agent loop starts. (MCP connection is async; on a fast question the agent can otherwise finish before a cold server is ready.)
A per-run assertion that CodeGraph reached connected status, plus a record of whether the agent actually invoked a codegraph_* tool. Runs that didn't connect were discarded.

All 20 CodeGraph runs in this post connected. The agent invoked CodeGraph on Q1-Q4 (4/4 repeats each) and — correctly — chose grep on the Q5 control (0/4). Most vendor benchmarks never report this check. After watching mine fail it silently, I won't publish a retrieval benchmark without it.

The 5 questions

Full verbatim prompts in the Appendix. Brief overview:

#	Question	What it tests	Hypothesis for CodeGraph
Q1	Route resolution: `GET /users/:id` → handler	Route-aware extraction	Strong win
Q2	Middleware chain trace through `app.use`	Dynamic dispatch tracing	Decisive win via structural lookup
Q3	Multi-runtime adapter architecture	Cross-file abstraction	Mid-strong win
Q4	Refactor impact: add mandatory `requestId` to `Context`	Impact propagation + completeness	Strong win (what `codegraph_impact` is built for)
Q5	Text search: every literal `'Content-Type'`	Keyword search baseline	~Parity; agent should decline the tool

Q5 is the CONTROL — the tool should not win here, and whether the agent even reaches for it is itself a signal.

The data

Each row averaged over 4 repeats. Cost is Claude's own total_cost_usd (the API's authoritative figure, not my own multiplication); wall latency from request to final token; tool calls counted from unique tool_use blocks in the transcript; tokens are input + output (cache tokens tracked separately in the CSV).

Cost / tokens

Q	Baseline cost	CodeGraph cost	Δ cost	Baseline tokens	CodeGraph tokens	Δ tokens
Q1 route	$0.321	$0.393	+22.5% ❌	10,115	6,045	−40.2%
Q2 middleware	$0.212	$0.303	+43.4% ❌	7,233	5,649	−21.9%
Q3 multi-runtime	$0.490	$0.348	−28.9% ✓✓	11,582	7,048	−39.1%
Q4 refactor	$0.402	$0.509	+26.5% ❌	9,119	8,567	−6.1%
Q5 text (ctrl)	$0.267	$0.253	−5.3%	8,874	8,998	+1.4%
Aggregate	$0.338	$0.361	+6.8%	9,385	7,261	−22.6%

Tool calls / wall latency

Q	Baseline calls	CodeGraph calls	Δ calls	Baseline latency	CodeGraph latency	Δ latency
Q1 route	7.8	6.8	−12.9%	49.8s	51.2s	+2.8%
Q2 middleware	5.0	4.0	−20.0%	41.2s	43.1s	+4.5%
Q3 multi-runtime	35.2	7.0	−80.1% ✓✓	123.7s	58.4s	−52.8% ✓✓
Q4 refactor	19.8	11.8	−40.5% ✓	87.9s	85.9s	−2.2%
Q5 text (ctrl)	2.2	2.0	−11.1%	51.2s	43.5s	−15.0%
Aggregate	14.0	6.3	−55.0%	70.8s	56.4s	−20.3%

Two headline rows: −55% tool calls (real and consistent — CodeGraph used fewer tools on every single question) and +6.8% cost (CodeGraph is not cheaper on Hono). The latency win (−20%) is real but concentrated: almost all of it is Q3; on Q1/Q2/Q4 latency is within ±5%.

The variance story — CodeGraph bounds the worst case

Averages hide the most interesting result. Baseline tool-call counts, per repeat:

Q	Baseline (4 repeats)	CodeGraph (4 repeats)
Q3 multi-runtime	14, 23, 52, 52	5, 6, 8, 9
Q4 refactor	9, 10, 13, 47	9, 10, 12, 16

On the broad questions, baseline grep+Read occasionally spiraled — the agent without an index wandered to 47-52 tool calls chasing files. Across all 40 runs, baseline ranged from 2 to 52 tool calls; CodeGraph ranged from 1 to 16. A large part of CodeGraph's value here isn't the mean — it's that it bounds the worst case. When the structural answer is one graph query away, the agent can't spiral. That's a reliability property, not just an efficiency one, and it doesn't show up in a single average.

A caveat on sample size: this is 4 repeats on one repo. Treat the magnitudes as indicative and the directions as robust — CodeGraph used fewer tool calls in every question and nearly every repeat, and the cost direction was consistent within cells (more expensive on Q1/Q2/Q4, cheaper only on Q3). What I would not over-read is the exact percentages; a 47-vs-9 baseline spread on Q4 means the per-question means carry real uncertainty even at n=4.

Per-question narrative

Q1 (route resolution) — CodeGraph used, but more expensive. Both arms traced app.fetch → #dispatch → this.router.match() and read SmartRouter → RegExpRouter. CodeGraph used codegraph_context + codegraph_trace (2 calls/run) and cut tool calls 13% and tokens 40% — but cost rose 22.5% and latency was flat. The structural context it front-loaded was heavier than the 1-2 grep steps it saved. Hono's router is small enough (5-6 files for the full picture) that grep finds it directly.

Q2 (middleware chain trace) — used, 43% more expensive. CodeGraph landed the app.use → middleware array → compose() chain in 4 tool calls vs baseline's 5, but cost jumped 43%. Same mechanism as Q1, more pronounced: the call-chain context payload dominated a question baseline answered cheaply in 5 small steps. The clearest example of "fewer steps, more dollars."

Q3 (multi-runtime adapter) — the unambiguous win. Enumerating 6 adapter directories (Cloudflare Workers / Deno / Bun / Node / AWS Lambda / Vercel Edge) is exactly where one graph query beats many Glob+grep iterations. Baseline averaged 35 tool calls and 124s (and spiraled to 52 twice); CodeGraph: 7 calls, 58s, −29% cost. This is the question shape where structural retrieval pays back on every axis at once — and the only one where it saved money.

Q4 (refactor impact: add requestId to Context) — tools halved, cost up. Supposed to be CodeGraph's strongest case (codegraph_impact is built for blast-radius). It did cut tool calls 40% (and tamed baseline's 47-call spiral), but cost rose 26.5%: the impact-graph walk pulled wide context the agent didn't fully need at Hono's size. Completeness was comparable across arms (both identified the Context constructors in src/hono-base.ts, the Variables plumbing, and the per-method handler signatures). Fewer, more-bounded steps — but the propagation graph isn't wide enough here to pay back on dollars.

Q5 (text search, control) — the agent declined the tool, and that's the point. On a pure literal-'Content-Type' search, the agent never invoked CodeGraph in any of the 4 repeats — it reached straight for grep. Result: near-parity (−5% cost, −15% latency, both inside the noise). The old version of this post claimed an "FTS5 fallback" win here; that was an artifact of the broken first run. The truth is simpler and better: with CodeGraph connected and available, the agent correctly chose grep for a grep-shaped task. No over-engineering. That's the table-stakes behavior you actually want from a retrieval tool, and it's worth more than a fabricated 1-step saving.

Cross-validation with CodeGraph's published 7-repo benchmark

Metric	Published (7 repos)	This test (Hono, n=4)	Reproduces?
Tool calls	−71%	−55%	✓ Yes — same ballpark
Latency	−46%	−20%	~ Directionally, ~half the magnitude
Tokens	−57%	−23%	~ Directionally, smaller
Cost	−35%	+6.8%	✗ No — opposite sign

The tool-call reduction is the part that generalizes cleanly to a repo the team didn't pick. The cost reduction is the part that doesn't — and that's not an attack on CodeGraph, it's a statement about repo size. Their published suite skews large (VS Code is 30k+ files; Tokio is mid-thousands), and their own published table is non-monotonic in file count — Gin (~110 files) shows a 21% cost win while OkHttp (~645 files) shows ~2%, and Tokio (~790 files) shows 82%. Repo size matters, but it isn't a clean threshold; question shape matters at least as much. A single repo can't locate a universal crossover. What Hono shows is one clear data point: at ~280 files, the step-count win is already here, the dollar win isn't yet.

Decision matrix — install CodeGraph when

Signal	Install CodeGraph	Skip (baseline grep+Read is enough)
You care about fewer agent steps / lower latency	✓ (−55% tool calls even on a small repo)
You're optimizing dollar cost on a sub-~500-file repo	(may cost slightly more)	✓
Repo is large (low thousands of files+)	✓ (dollar win should appear too)
Workload is broad multi-file navigation / architecture	✓ (this is where it wins on every axis)
Workload is narrow single-symbol lookups	(fewer steps, but not cheaper)	(grep is fine)
Static-typed (TS / Rust / Go / Java / Swift / C#)	✓
Dynamic-typed (Python / Ruby / untyped JS)	⚠️ partial (tree-sitter misses runtime semantics)
Workflow is text-search dominant	(no penalty — the agent declines the tool)	✓
Agent reliability matters (bounding worst-case exploration)	✓ (caps the 50-tool-call spirals)
You're not sure	install it; ~10 min, <2s to index a Hono-sized repo, uninstall is one command

Key call from the data: at Hono's scale the reason to install CodeGraph is fewer steps, lower latency, and bounded worst-case exploration — not a lower bill. If your decision rule is purely dollars-per-query on a small repo, baseline grep+Read is still competitive. If it's agent speed, predictability, or you're working in a larger codebase, the index earns its place.

What I'd want to test next

Larger TS repo head-to-head — same 5 questions on Prisma (~2,000 TS files) or TanStack Query (~600 files) to find where the dollar crossover actually is. Hypothesis: cost flips negative somewhere in the high hundreds to low thousands of files.
Dynamic-typed repo — same 5 questions on FastAPI or Django REST to see how much of the step-count win survives when tree-sitter can't resolve dynamic dispatch.
Long-session compounding — single-question runs miss the multi-turn agent context. Does the per-query step saving compound across a real session, or stay linear?

All future content. None block the install-or-not decision the data above already answers.

One-line verdict

On TypeScript / Rust / Go projects, install CodeGraph if you want fewer agent steps, lower latency, and bounded worst-case exploration — those reproduce on an independent repo. Don't install it expecting a lower bill on a small codebase: at Hono's ~280-file scale it was ~7% more expensive, and in this benchmark a cost win appeared only on broad multi-file navigation (Q3) — the published −35% likely needs much larger repos.

For the architectural deep-dive on why this class of tool works and where the abstractions leak, see the companion Lab piece Agent Retrieval Above the Crossover: A First-Principles Read of CodeGraph (publishing 2026-06-08).

For the broader cost-curve framework this benchmark applies, see the prior Lab post: Agent Retrieval Is a Cost Curve Problem.

Appendix: Benchmark Questions

The 5 prompts used, verbatim. Each was sent to Claude Code in a fresh session (claude --print --model opus), 4 times per arm. No follow-up prompts within a run.

Q1 — Route resolution

When a request hits GET /users/:id in a Hono app, walk me through how Hono's routing finds and invokes the right handler. Where in the source does the URL → handler matching happen, and what data structure stores the route table?

Q2 — Middleware chain trace

Hono middleware is chained via app.use(middleware). When a request flows through several middlewares before hitting the handler, what's the actual call stack from the incoming request to the handler? Specifically — how does Hono ensure middleware runs in order, and how is c (context) + next passed between them?

Q3 — Cross-file abstraction navigation

Hono supports multiple runtime adapters (Cloudflare Workers, Deno, Bun, Node, AWS Lambda, Vercel Edge). How is this multi-runtime abstraction implemented? What's the shared interface, and where do the per-runtime adapters live? Show me the architecture.

Q4 — Refactor impact

Imagine I'm planning to add a mandatory new property requestId: string to Hono's Context class. What files and functions across the codebase would be affected? Give me the full blast radius — where Context is constructed, where it's typed in signatures, and where mandatory-property additions would break.

Q5 — Text search (control)

Find every place in the Hono codebase where the literal string 'Content-Type' (the exact HTTP header name, case-sensitive) appears. Include source code, tests, comments, and documentation.

Scoring & environment

Each question evaluated on cost (total_cost_usd), tokens (input+output, cache tracked separately), wall latency, unique tool-call count, and a manual correctness/completeness check (both arms agreed on the same answer in every Q). Every CodeGraph run was additionally checked for connected MCP status and actual codegraph_* tool invocation.

Environment: Hono @ commit 2cbeadda (2026-05-28) · CodeGraph 0.9.7 · Claude Code 2.1.159 · model claude-opus-4-8 (Opus 4.8) · macOS. Raw per-run data (cost, tokens, tool calls, latency, connection status) for all 40 runs: CSV.

Appendix: Why there's no third arm (knowing)

I intended to include knowing (Blackwell Systems) as a third arm. In headless batch mode its MCP server connected only intermittently — knowing advertises asynchronous tools/listChanged, which races Claude Code's MCP startup window, so on most runs the agent never saw knowing's tools and silently fell back to grep+Read.

Reporting task-cost numbers from runs where the tool wasn't actually in the loop is exactly the trap that invalidated my first attempt at this benchmark (see Verifying the tool actually ran), so I'm not publishing knowing figures. That's a limitation of my batch harness, not a verdict on knowing — a persistent / pre-warmed MCP host or an interactive session would likely fix it. If I get a reliable knowing setup, I'll benchmark it on its own terms and publish separately.

Lab companion (first-principles architectural read of CodeGraph and the class of tools it represents): **Agent Retrieval Above the Crossover* — publishing 2026-06-08.*
Prior Lab post in the retrieval / memory mini-series: *Agent Retrieval Is a Cost Curve Problem***
Companion Lab post on cross-session memory: *Agent Memory Is a Cache Coherence Problem***
CodeGraph repo: *https://github.com/colbymchenry/codegraph***

Agent Memory Is a Cache Coherence Problem

Harrison Guo — Fri, 29 May 2026 15:17:06 +0000

This post is one half of a pair. The other half — Agent Retrieval Is a Cost Curve Problem — argues that Claude Code's within-session code retrieval avoids RAG because the cost curve says it should. This piece argues something parallel about cross-session memory: the lossy auto-capture systems being marketed as "AI memory" are, in classical distributed-systems vocabulary, caches. They inherit every problem caches have always had, and the hype around them is mostly arguing for one side of a write-back vs write-through trade as if the other side didn't exist.

Sequel to Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works. If you remember that piece, you'll recognize the move: take a problem space the AI community is debating with fresh vocabulary, and notice that the database community already mapped the failure modes thirty years ago.

tl;dr — Cross-session agent memory varies on two independent axes, not one: fidelity (lossless vs lossy) and retrieval (exact lookup vs approximate vector). Claude Code's built-in memory plus a hand-written CLAUDE.md lives at lossless + exact. The currently-trending claude-mem (70k+ GitHub stars as of May 2026) lives at lossy + approximate — auto-capture passed through a Haiku compression step and recalled via SQLite-FTS5 + Chroma vectors. The second is, structurally, a cache: a derived lossy view of the source of truth, retrieved approximately. It inherits every cache problem the distributed-systems literature already named — staleness, wrong-row retrieval, no coherence with the source. I ran claude-mem under controlled conditions and compared it against the deterministic CLAUDE.md baseline; the numbers (and the kinds of failures) line up with the classical cache framing. The most interesting failure isn't tokens or latency. It's that compression flattens modality — a hedged hypothetical becomes a flat fact, indistinguishable from a firm decision. An agent confidently acting on a maybe-it-said-yes is worse than an agent with no memory at all.

The Two Axes (Don't Collapse Them Into One)

Most takes on agent memory collapse the design space onto a single axis: "lossless and limited" vs "lossy and powerful." That framing hides the failure modes.

The real space is two-dimensional:

Fidelity — lossless (verbatim, what-you-wrote-is-what-was-stored) vs lossy (LLM-compressed: a summary written by a smaller model over the raw events).
Retrieval — exact / curated (you wrote an index entry; the system reads it back) vs approximate / semantic (vector embeddings; cosine similarity; top-K nearest neighbors).

The interesting failures live in the bottom-right quadrant — lossy + approximate — because the failures of one axis are invisible to a user evaluating along the other. The system loses information at write time and approximates at read time, and the user sees a single "answer" that fused both losses. Debugging means asking: was that wrong because the original event was corrupted in compression, or because retrieval surfaced the wrong row? You usually can't tell.

Most takes conflate "lossy = unreliable" with "vector = powerful." They're orthogonal. You can have lossless + vector (raw logs, vector-indexed — fine but storage-heavy and signal-weak). You can have lossy + exact (compressed summaries, FTS-indexed — works for some applications). Lossy + approximate is what's being marketed as "AI memory," and it's the quadrant most exposed to compounding failure modes.

The Baseline: Lossless + Exact (`CLAUDE.md` + built-in memory)

Claude Code's built-in memory system, paired with a hand-written CLAUDE.md, sits firmly in the lossless + exact quadrant. The design choices, made explicit:

Verbatim storage. What the author wrote is what gets stored. Markdown in, Markdown out. There's no compression step.
Always-loaded index + on-demand body. MEMORY.md (the index) gets injected into every session — capped, deliberately, around 200 lines to avoid context bloat. Individual memory files are read on demand, when the index entry suggests one is relevant.
Curated, not a firehose. A human (or a structured prompt) decides what is worth storing. Not every tool call. Not every file read. Only the durable, surprising, cross-session-useful facts.
Exact recall. The model reads a specific file. Either it's there and is read verbatim, or it isn't. No fuzzy near-matches; no confidence score.

Mental model: a hand-written WAL (write-ahead log) plus a curated index. Both are close to a source of truth — the author's deliberate decision — and they recall exactly what was committed.

Tradeoffs are visible from this framing:

✅ Precision: 100%. What you stored is what you get back.
✅ Auditability: you can read the file yourself. No black box.
✅ Token economics: index sits in context; bodies fetched only when needed. Cheap.
❌ Coverage: limited to what the author bothered to write down.
❌ Upkeep: manual. Memories rot; updating them is a chore.

The earlier post in this series, Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs, walks through the specific design choices in the publicly circulated build snapshot. Here I'll focus on what happens when you put the lossless+exact baseline next to the lossy+approximate contender.

The Contender: Lossy + Approximate (`claude-mem`)

claude-mem is among the highest-starred entries in the agent-memory category right now (70k+ GitHub stars as of May 2026). I tested v13.2.0. The architecture, summarized:

Auto-capture firehose. Lifecycle hooks (SessionStart, UserPromptSubmit, PostToolUse, Stop, SessionEnd) fire on essentially everything the model does. The hooks pipe events to a Bun worker on localhost:37701.
Compression to facts/narrative. At session boundaries the worker invokes Haiku 4.5 to compress raw observations into structured facts (a JSON array) and a narrative (a paragraph). This compression runs on your own Claude subscription — billed to your quota, ~5,150 compression tokens per session in my test.
Hybrid index. Compressed observations are written to two indexes simultaneously: SQLite-FTS5 (full-text keyword) and Chroma (vector embeddings). Recall is hybrid — keyword and ANN together.
Disables built-in memory. Installation sets CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 in ~/.claude/settings.json. The built-in CLAUDE.md path is turned off; claude-mem is meant to replace, not augment.
Hardcoded data dir. Despite respecting CLAUDE_CONFIG_DIR for plugin config, the data store path (~/.claude-mem) is hardcoded. Sandboxing is partial.

Mental model: a derived, lossy materialized view of session events, plus a similarity cache for retrieval. Two layers of approximation: a lossy write transform (Haiku compression) and an approximate read transform (ANN). Each compounds the other.

This is exactly what the bottom-right quadrant looks like in deployment.

The Test (Real Numbers, 2026-05-20)

To make the comparison concrete I built a small URL-shortener as the test bed: simple enough that the "right answer" was unambiguous, structured enough that real architectural decisions had to be recorded. The setup:

Test arm: claude-mem v13.2.0, sandboxed via CLAUDE_CONFIG_DIR=/tmp/cmem-test/dot-claude. Built-in memory disabled (per install default). Session 1 established four decisions about the codebase; Session 2 asked for them back. This is the arm I measured.
Baseline: the same four decisions written into a hand-curated CLAUDE.md — 1,075 chars, ~269 tokens. Built-in memory intact. The baseline numbers in the table below are deterministic properties of how the built-in CLAUDE.md path works (verbatim recall, no extra round-trip, no compression bill), not a separately measured session.

The decisions in Session 1 (so the comparison is fair):

Use Redis for URL storage.
Generate short codes with base62.
Add a 30-day TTL.
"We could refresh the TTL on each access if we want sliding expiration." (Note the hedge — this is the modality test that matters.)

Numbers, side by side:

Measure	claude-mem (v13.2.0)	bare CLAUDE.md
Per recall-cycle tokens	~6,700	~280
↳ Passive context injection on session start	~1,050	~269 (full file)
↳ `mcp-search` retrieval round-trip	~502	0
↳ Haiku 4.5 compression cost (charged to your quota)	~5,150 / session	0
Extra round-trip for details	yes (~22s)	no
Fidelity of recall	lossy (see below)	100% verbatim
Built-in memory state	disabled (`CLAUDE_CODE_DISABLE_AUTO_MEMORY=1`)	intact
Compression cost	on user's Claude subscription	none
Headless capture (`claude -p`)	zero events	n/a
Upkeep	automatic	manual edit

The token gap — 6,700 vs 280 — is meaningful but not the headline. The headline is the fidelity row.

The Sharpest Failure: Compression Flattens Modality

The four decisions written in Session 1 included three firm choices and one hedge:

"We use Redis... base62 short codes... 30-day TTL... we could refresh the TTL on each access if we want sliding expiration."

When I read the raw observation row that claude-mem wrote, the four items appeared as a flat JSON array — facts: [...] — with no modal marker distinguishing the hedge from the decisions. The hedge had been flattened into the same shape as the firm choices.

Session 2 confirmed it. I asked the recalling agent to describe the TTL design. It cheerfully reported "we refresh the TTL on each access for sliding expiration" — as though that had been decided. When I challenged it directly, its own reply was that it could not distinguish firm decisions from options that had merely been considered. The compressed facts[] row it was reading from preserved the content of each item but not its modal status — what I'll call its epistemic status throughout the rest of this post.

That's the failure. The lossy layer loses epistemic status, not just bytes. A maybe becomes a decision. The recalling agent has no way to know it shouldn't trust the row.

This is worse than no memory. An agent with no memory has to ask, or reread, or check the code. An agent with confident-wrong memory acts. The cost of acting on a fabricated decision compounds: now there's code (or a PR, or an architectural note) committed under the false premise, and that will be the next round's input.

The generalization: any LLM compression step that maps "speech-act varieties" (decisions, hypotheses, questions, jokes, hedges) onto a single typed structure — like a facts[] array — loses the modal axis. To preserve it, you'd need to compress into a richer schema (with kind: 'decision' | 'option' | 'question' per item), and you'd need the compression model to reliably tag the modality. Haiku 4.5 didn't tag it. Whether a more careful prompt or schema would is an open question, but it's a design question the current tool doesn't even pose.

Six Measured Findings, Versioned to v13.2.0

In one place, six things I measured. Versioned because tool behavior changes:

Replaces, not augments. Install sets CLAUDE_CODE_DISABLE_AUTO_MEMORY=1. Built-in memory is turned off. Default deployment is single-system, not hybrid.
Partial sandbox. CLAUDE_CONFIG_DIR redirects plugin config, but the data store path ~/.claude-mem is hardcoded. Multi-tenant isolation is incomplete.
Compression runs on your subscription. Haiku 4.5 compresses observations to ~5,150 tokens per session, billed to your Anthropic quota. Free tools that consume your paid quota deserve a footnote.
Invisible to headless mode. claude -p runs (non-interactive) emit zero capture events in my tests. The lifecycle hooks fire only in interactive sessions. CI users and automation pipelines get no memory at all.
Compression flattens modality (the sharpest finding, detailed above). A hedge becomes a flat fact, indistinguishable from a firm decision in the compressed facts[] schema.
Token economics lose on small projects. ~6,700 tokens per recall cycle (passive inject + mcp-search round-trip + Haiku compression) versus ~280 deterministic, 100%-faithful tokens for the CLAUDE.md baseline. On a 1,000-line project, the per-token cost gap is wider than any retrieval benefit it provides.

Tool-version drift is real — by the time you read this, some of these may have been fixed. The cache-coherence framing in the next section is version-independent and was the actual reason I wrote the post.

Why This Is a Cache Problem, Precisely

The distributed-systems vocabulary for this design is materialized view of a source, served from a similarity cache.

Write-time lossy transform = a materialized view that can drift from the source of truth (the actual codebase, the actual decisions). The source is the user's intent and the live code; the view is the compressed facts/narrative. Each write step can lose information that the view will never recover.
Read-time ANN = approximate retrieval. Top-K nearest neighbors. False positives are structural, not a bug — a sufficiently-similar wrong row will be returned with confidence indistinguishable from the right one.
No coherence with the source. Classical caches have invalidation hooks — write-through, write-back, snoop protocols, MESI states. They tie cache lines back to the canonical source so that writes propagate and stale lines get evicted or rewritten. claude-mem has no tie to the codebase or to user-issued corrections. You reverse a decision in conversation, the memory still believes the original.
Staleness without expiry. Even without explicit invalidation, classical caches use TTLs to bound staleness. claude-mem has no TTL on facts. A fact written six months ago competes for retrieval with one written yesterday, and the older one might win the vector hop.

When the cache-coherence frame is the right frame, the literature is rich and useful. Pat Helland's Immutability Changes Everything (ACM Queue, 2015) and the broader databases-and-OS literature on cache-coherence protocols (MESI / MOESI), materialized-view invalidation, and write-through vs write-back are the right starting reading. The trades they describe — staleness vs cost, eventual vs strong coherence, when to flush, when to invalidate — are the same trades the agent-memory community is rediscovering with fresh names.

What's missing from this diagram, and crucially: a backedge from "source of truth" to the materialized view that fires when the source changes. That's the invalidation arrow. Its absence is the structural reason claude-mem gets wrong-row retrieval on decisions the user has reversed. Until something supplies that arrow, the system is best understood as a write-only cache.

When Each Wins

Cost-curve thinking (the same frame used in the companion piece) gives a clean answer:

Lossless + Exact wins when:

Project size is small or scope is clear. Curation is cheap; the manual upkeep budget is small.
Fidelity matters. You need to recall the exact decision, not a vibe of it. Coding agents, design decisions, security policy.
The author exists and is engaged. Someone is willing to write three lines into MEMORY.md when a decision is made.

Lossy + Approximate wins when:

History is too big to hand-curate. A year of conversations across multiple contributors, none of whom can be expected to maintain a MEMORY.md.
Coverage matters more than precision. You'd rather have a fuzzy memory that something was discussed than no memory at all. Customer-support agents over a year of tickets; team retrospectives over a quarter of standups.
The cost of acting on a fabricated fact is low. A confident-wrong recall in a support agent says "sorry let me check"; the user corrects it. The same recall in a coding agent ships broken code to production. The blast radius of a false positive determines the budget for accepting one.

The rule of thumb: fuzzy-but-present beats precise-but-absent, but only when the false-positive cost is low enough to absorb. For coding work on a 5,000-LOC project, it isn't.

A Decision Framework

Signal	Lossless + Exact	Lossy + Approximate	Hybrid
Single project, < 50k LOC	✓
Multi-project / multi-year history		✓	✓
Decisions need exact recall	✓
Vague-but-present recall is acceptable		✓
Author is engaged (willing to curate)	✓		✓
No human curator available		✓
Cost of confident-wrong is high (production code, money)	✓
Cost of confident-wrong is low (suggestion, search)		✓
You can pay the Haiku compression bill from your quota	n/a	✓	✓
You operate headlessly or via CI	✓

For most readers of this blog — engineers working on a single non-trivial codebase, where decisions matter and confident-wrong is expensive — the columns lean hard left.

What a Coherent Agent Memory System Would Need

The interesting question, once you accept the cache-coherence frame, is: what would the lossy + approximate corner look like if it were built like a real cache instead of a write-only one?

A short list of capabilities the current generation of "AI memory" tools is missing, and which any serious system in this space will eventually have to ship:

Source pointer (provenance). Every fact carries a back-pointer to the originating event: timestamp, session ID, the raw transcript turn or tool result it was derived from. Without this, you can't audit a wrong recall — you only see the fact, never its lineage.
Modality tagging. Every fact tagged with epistemic status — decision | option_considered | hypothesis | question | observation — at write time, by the compression model. Without this, the system loses what the failure section above showed: the difference between we will and we could.
Supersedes / invalidation chain. A later fact can declare an earlier fact superseded ("decision A was reversed on date T by B"). Recall surfaces the latest applicable fact, not the most semantically similar one. This is the in-band invalidation classical caches use; agent memory currently has none.
Expiry / TTL by class. Decisions might be permanent; observations rot fast ("the build was passing this morning" should not influence behavior at 4 PM). Different fact classes get different TTLs.
Invalidation hook tied to the source of truth. When the underlying codebase or document changes in a way that contradicts a stored fact, the fact gets flagged for re-validation. This is the write-through arrow in the cache diagram earlier — currently absent.
Confidence surfaced to the caller. Instead of returning a flat string, return {value, confidence, provenance}. The recalling agent then knows when to trust, when to double-check, when to ignore.

None of these is novel. All of them are standard in production cache systems, query planners, and event-sourcing stores. They're hard to retrofit onto a system that wasn't designed with provenance and invalidation as first-class concerns. They're not hard to design in from the start — but doing so means giving up the "just drop a hook on everything, ship next week" simplicity that makes the current crop of tools accumulate stars.

If you're building an agent that has to act on its memory rather than just display it, this list is the spec.

Closing — "Persistent Memory" Is a Trade, Not a Feature

"Persistent AI memory" gets talked about like a feature you turn on. It isn't. It's a choice on the two-axis design space above, and every position on that space has known failure modes. The lossless-and-exact corner has the upkeep cost and the coverage limit. The lossy-and-approximate corner has the staleness, the wrong-row retrieval, and — the finding I came away most surprised by — the loss of modality.

Two takeaways worth carrying out:

If someone is selling you "automatic AI memory," ask which quadrant. If the answer is lossy + approximate, ask the six questions from the section above: provenance, modality, supersedes, TTL, invalidation, confidence. If the answer to most of them is "the embeddings handle it," you're being sold a cache wearing the word memory.
The classical literature is the right starting point. Cache coherence, write-back vs write-through, eventual vs strong consistency, materialized-view invalidation — the database and OS communities have spent forty years working through these tradeoffs. Reading their writing is more useful than reading the latest agent-memory thread.

The companion to this post — Agent Retrieval Is a Cost Curve Problem — argues a parallel thing about within-session code retrieval: that Claude Code's "use grep, not RAG" choice isn't romance ("trust the model") but math (cost curves), and that the source code shows Anthropic A/B-testing alternative retrieval architectures (Explore vs Fork) in production. Read together, the two pieces add up to a coherent stance about Anthropic's bets across the fidelity × retrieval design space: in both within-session code search and cross-session memory, the default is lossless + exact, and the alternative branches are kept gated behind feature flags so the decisions can flip when the cost curves do. The memory side alone has at least four such gates visible in the snapshot I reviewed — tengu_coral_fern, tengu_herring_clock, tengu_passport_quail, tengu_slate_thimble — plus the build-time KAIROS, TEAMMEM, and EXTRACT_MEMORIES gates in src/memdir/.

That stance is not "we trust the model." It's: read the cost curves, build for the curves' current shape, leave the toggles in for when they shift.

Appendix: How I Measured This

For the reader who wants to reproduce — or, more usefully, who wants to know exactly what was and wasn't measured.

Versions and environment.

claude-mem v13.2.0 (npm install via the project's standard install script)
Claude Code: current public release at time of test (2026-05-20)
macOS 25.4.0, zsh
Sandbox: CLAUDE_CONFIG_DIR=/tmp/cmem-test/dot-claude for the plugin config; data store at ~/.claude-mem (this directory location is hardcoded inside the tool — see Finding #2)

Test project.
A small URL-shortener spec written from scratch, with four decisions in Session 1: (1) Redis for URL storage, (2) base62 short-code generation, (3) 30-day TTL, (4) the hedged sliding-expiration option that became the modality test.

Commands and protocol.

Fresh claude-mem install into the sandboxed CLAUDE_CONFIG_DIR. Verified install behavior set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1 in settings.json (Finding #1).
Session 1: interactive Claude Code session, walked through the four decisions with the tool actively capturing via lifecycle hooks. Watched the Bun worker on localhost:37701 accept events.
Session ended; session-boundary compression fired; Haiku-compressed facts[] and narrative written to SQLite-FTS5 + Chroma. Token count for the compression call read from the API trace.
Session 2: fresh interactive session; queried for each of the four decisions; observed the recall path (mcp-search round-trip; ~22s extra latency).
Compared retrieved content against the original Session 1 transcript byte-by-byte to identify the modality flattening (Finding #5).
Repeated the install + Session 1 pattern in headless claude -p mode to confirm Finding #4 (no events captured).

What I'm explicitly not claiming.

This is a single test run on a deliberately small project. The N is 1.
The CLAUDE.md baseline column in the test table is not a separately measured comparison session. It reflects deterministic properties of the built-in CLAUDE.md path (verbatim recall, no compression bill, no extra round-trip) that follow from the design — not a measured outcome.
I didn't benchmark Chroma vector recall quality across many queries. The modality finding came from a single targeted probe (the TTL question); the cache-coherence framing predicts the same class of failure across many queries, but predicting and measuring are different.
I tested v13.2.0. The tool is actively developed; specific findings may have been addressed in later releases by the time you read this. The cache-coherence framing is version-independent.

Artifacts. Session 1 / Session 2 transcripts, the raw claude-mem SQLite + Chroma snapshots, and the token-cost API traces from the test run are kept locally and can be made available on reasonable request — get in touch.

Companion piece: *Agent Retrieval Is a Cost Curve Problem: Why Claude Code Doesn't Use RAG***
Background: *Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works***
For the design rationale behind Claude Code's built-in memory in particular: *Claude Code Deep Dive Part 4: Why It Uses Markdown Files Instead of Vector DBs***

Agent Retrieval Is a Cost Curve Problem: Why Claude Code Doesn't Use RAG

Harrison Guo — Tue, 26 May 2026 03:12:20 +0000

There's a popular interview question making the rounds: "Why doesn't Claude Code use RAG to retrieve code? Why grep?"

The popular answer goes: chunking breaks code structure, vectors approximate when code demands exact, indexes go stale, cold-start is slow, retrieval is a black box. All five are real. None of them are the reason.

They're symptoms. The reason is older than RAG, older than LLMs, older than the term retrieval. It's a cost curve.

tl;dr — Index-based retrieval pays a high build cost plus a nonlinear maintenance cost in churn × index complexity. LLM tool-loop retrieval pays nothing up front and a per-query cost that's roughly project-size-independent for queries an LLM actually issues. For most small-to-mid-size repos the crossover is never reached. The "Anthropic trusts the model" framing is romantic; the actual answer is colder — build cost zero, per-query cost amortizes faster than index drift, so the math says grep.

There's also a precision axis, which most engineers care about more than cost. Vector RAG is approximate by design — getUserById returns alongside getUserByEmail because they're semantically adjacent. Code usually wants exact, which grep gives you for free. Symbol-graph indexes (Sourcegraph, Kythe, LSP) are precision-first but haven't become the LLM companion either — covered below.

Audited against a publicly circulated build snapshot of Claude Code with file:line citations. The kicker: the Explore subagent's "use this when you'll need more than 3 queries" rule is gated behind a feature flag (tengu_amber_stoat) and being A/B-tested against a parallel architecture (Fork). The canonical answer is conditional. That is the answer that gets you the offer.

The Frame That Matters: Total Cost Over Time

Pick any retrieval system and you pay for three things, on different schedules:

Build cost — one-time work to assemble whatever structure makes lookup fast. For an index, this is chunking + embedding + insert. For tool-loops, this is zero.
Maintain cost — ongoing work to keep the structure honest as the underlying data changes. For an index, this is invalidation, reindex, drift reconciliation. For tool-loops, this is also zero — the "structure" is the live filesystem.
Per-query cost — work done when a question arrives. For an index, this is a vector search + a few reranks + an LLM call. For tool-loops, this is N LLM-tool round-trips, where N varies.

The temptation is to compare per-query cost: "vector search is one round-trip, tool-loop is six." That's why RAG looks dominant on a whiteboard. But you ship a system, not a whiteboard. The bill is:

total_cost = build_cost + maintain_cost × time + per_query_cost × queries

For a project that changes daily and gets queried hourly, the term that actually grows is maintain_cost × time. For index-based retrieval on a churning codebase, maintain cost grows at least linearly with churn — and can grow faster than teams expect, because cross-chunk and cross-file references force cascading re-embeddings and symbol-graph consistency checks. A naive incremental indexer is linear; a correct one tracking cross-file refactors is often worse than linear in the worst case. For tool-loops, maintain cost is identically zero, because the loop has no persistent structure.

The build/maintain term dominates anything you save on per-query cost, until your project is large enough that per-query cost itself becomes the bottleneck. For most small-to-mid-size repos, the crossover is never reached.

That's the whole argument. The rest of this post is evidence — what the cost-curve choice looks like in source code, and where Anthropic is hedging.

A teaser for the punchline two sections down: the canonical "Claude Code spawns an Explore subagent for open-ended search" rule that most explainers quote is gated behind a feature flag (tengu_amber_stoat) and A/B-tested in production against a second architecture (Fork) that takes the opposite trade. Anthropic is hedging on the retrieval design itself. We'll come back to this in "The Subagent Twist Nobody Quotes Correctly."

The Popular Answer, Charitably

Before deflating it: the popular answer isn't wrong, it's just downstream. Briefly, with the steel-manned version:

Chunking breaks structure. A function split across chunks loses both halves of an if/else, and call-graph relationships fragment between chunks. AST-aware chunkers exist; they're better, not solved.
Vectors approximate. getUserById returns alongside getUserByEmail and getUserByName because they're semantically adjacent. Exact-symbol search beats this trivially. Important distinction: this is true of vector RAG specifically. Symbol-graph indexes — Sourcegraph, Kythe, Glean, LSP-backed code search — are a different category: they index by function/class/reference, not by chunked vector. They give exact answers and are what mega-monorepos at Google and Meta actually run for code search. When this post says "RAG," it means vector RAG. The cost-curve argument applies to vector RAG; symbol-graph has its own cost curve and lives higher up the scale spectrum. It has its own separate reason for not becoming the LLM companion — covered in the next section.
Indexes go stale. Every commit invalidates some subset of chunks. Incremental update has edge cases (renames, file moves, cross-file rename refactors). Full reindex is expensive enough to discourage frequent commits.
Cold start. Minutes-to-first-query is a non-starter for "open the tool, start working" UX.
Black-box recall. Top-K vector hits are not human-auditable. When the LLM returns a wrong answer, you can't tell whether retrieval failed or reasoning did.

All five are pains. Each has a counter — better chunkers, hybrid retrieval, incremental indexers, warm pools, attribution layers. The counters cost engineering. Some teams spend the engineering and ship working systems. Anthropic looked at the bill and decided not to.

Why? Because the baseline for code retrieval — grep over a clean filesystem with an LLM in the loop — already works well enough that adding an index is paying engineering cost to solve symptoms whose root cause is the index itself. Removing the index removes the pains. The remaining cost is per-query LLM round-trips, which the cost-curve frame says is acceptable below the crossover.

That's why grep. Everything else is engineering details on top of that decision.

So Why Hasn't Symbol-Graph Become the LLM-Companion Either?

If symbol-graph indexes are precision-first, language-aware, and battle-tested at FAANG scale, the natural question is: why didn't they become the default companion to LLM coding agents? Why grep, not LSP-over-MCP?

The answer is the same shape as the vector-RAG answer — high friction in places that don't show up on a feature comparison — but the specific frictions are different:

Build cost is high in a different way. Symbol-graph indexes need to compile (or semi-compile) the project to resolve symbols. For Rust, C++, large TypeScript or Java codebases this is minutes to tens of minutes per cold start. "Open Claude Code and start working" can't pay that toll.
Language-specific, not portable. LSP is one server per language. Tree-sitter coverage helps but isn't uniform. A grep-backed agent works on any text in any language with zero setup; a symbol-graph-backed agent inherits the project's language-server matrix.
API/format mismatch with how LLMs reason. LSP returns deeply nested JSON (locations, ranges, document hierarchies); grep returns file:line: content. The second is almost literally an LLM's native dialect; the first needs adapting. The translation tax is real.
Coverage is narrower than it looks. Symbol-graph models code-as-structure. It misses config files, comments, strings, generated code, markdown, env files, shell scripts, the README — all of which are first-class context for a real coding session. Grep covers anything that's text.
The win is for structure questions, not intent questions. "Where is getUserById defined?" — symbol-graph is exactly right. "How does the login flow work?" — back to grep + read. A real coding day has both kinds; building infrastructure that only solves one kind is paying a high fixed cost for half the answer.
The constraint reversed under it. Symbol-graph was designed for a world where the constraint was human attention bandwidth — give a developer one precise answer they can read. LLMs don't have that constraint; they can read 30 grep hits cheaply and reason across them. The bottleneck moved from "precision of retrieval" to "fluency of the model reading the retrieval." Symbol-graph is optimizing the part that's no longer expensive.

One-line summary: symbol-graph is precision tooling built for human IDEs. LLMs are not human IDEs. Their retrieval bottleneck is different — they prefer many cheap rounds over one expensive precise call. Installing a symbol-graph for an agent that can grep thirty times in a session is, roughly, hiring a second driver for someone who already drives.

This is also why the few existing LLM ↔ symbol-graph integrations (Cursor's @symbol references via LSP, Sourcegraph Cody, Codeium with LSP backend) are additive niceties in those products, not the retrieval backbone. The backbone is still the same as Claude Code's — grep over text.

The Three Primitives, Audited

Source paths below are from a publicly circulated, non-public build snapshot of Claude Code that I have on disk for analysis purposes. APIs and exact line numbers will drift; the design choices below have been stable in the snapshot I reviewed, and match observed runtime behavior on current public Claude Code releases.

Grep — ripgrep, with structured output and a "don't shell out" enforcement clause

The Grep tool description, from src/tools/GrepTool/prompt.ts:7-16, is short and pointed:

A powerful search tool built on ripgrep

  Usage:
  - ALWAYS use Grep for search tasks. NEVER invoke `grep` or `rg` as a Bash command. The Grep tool has been optimized for correct permissions and access.
  - Supports full regex syntax (e.g., "log.*Error", "function\s+\w+")
  - Filter files with glob parameter (e.g., "*.js", "**/*.tsx") or type parameter (e.g., "js", "py", "rust")
  - Output modes: "content" shows matching lines, "files_with_matches" shows only file paths (default), "count" shows match counts
  - Use Agent tool for open-ended searches requiring multiple rounds
  - Pattern syntax: Uses ripgrep (not grep) - literal braces need escaping (use `interface\{\}` to find `interface{}` in Go code)
  - Multiline matching: By default patterns match within single lines only. For cross-line patterns like `struct \{[\s\S]*?field`, use `multiline: true`

Three things to read out of this:

The ALWAYS / NEVER is doing work. The model has Bash. It could shell out to rg or grep directly. The prompt forbids it. Why? Three reasons, in order of importance:
- Permission surface. Bash is a universal tool. Auditing what the model can do with Bash means auditing every shell command. Audited Grep means audited Grep, period.
- Output discipline. A Bash rg dumps raw text into the context window. Grep returns one of three structured modes — content, files_with_matches, count — letting the model pick the cheapest mode that answers the question. This is a token-budget decision, not a feature decision.
- Backend swap. Today the implementation is ripgrep. Tomorrow it might be bfs/ugrep (the source already has a branch for embedded-search tools — see hasEmbeddedSearchTools() in src/utils/embeddedTools.ts). A tool boundary makes the swap invisible to the model.
The default output mode is files_with_matches. Not content. The model has to opt in to seeing actual lines. This is a token-conservation default: most of the time, the model wants to know which files matched so it can narrow further; only when it's ready to read does it ask for the lines.
Multiline is opt-in. Default ripgrep is line-bounded — a deliberate restriction, because cross-line regex on a large tree is a perf cliff. The model can opt into multiline when it knows it needs to, paying the cost only then.

These are all small choices. Each one shaves a tail off the per-query cost. Cumulatively, they're why the per-query term in our cost equation stays low enough that the build/maintain term — zero — dominates.

Glob — filename patterns with a recency heuristic and a hard cap

Glob is the "find files by name" primitive. Two design tells in its construction:

Results are sorted by mtime descending. Most-recently-modified file first. The heuristic is that in any given session, the files you've touched recently are the files you're about to touch again. This is the same logic IDEs use for the "Recent Files" list, and it's empirically right far more than it's wrong.
A hard cap of 100 results. Past that, output truncates. The model can tighten the pattern and re-call. The cap exists because an LLM that consumes 800 file paths because it asked for **/* is an LLM that's burned a quarter of its context on noise.

Both are token-budget decisions disguised as ergonomic ones. The mtime sort means a small N typically covers the relevant set. The 100-file cap means a careless query degrades gracefully instead of catastrophically.

Read — bounded, fresh, stat-checked

src/tools/FileReadTool/prompt.ts is the most interesting of the three because of what it constrains:

- By default, it reads up to 2000 lines starting from the beginning of the file (...)
- When you already know which part of the file you need, only read that part. This can be important for larger files.

(MAX_LINES_TO_READ = 2000 at line 10; the "only read that part" guidance is OFFSET_INSTRUCTION_TARGETED at line 20-21, swapped in dynamically based on context.)

A 2000-line default cap. offset and limit parameters for targeted reads. And — critically — every read calls stat on disk. No cache. No index. No staleness layer.

The implication is that Read is always live. The model that just edited a file and wants to see the result reads the file and gets the new bytes. The model that's iterating on a fix doesn't fight cache invalidation because there is no cache to invalidate. This is the "no maintain cost" term in the cost equation, made concrete: the live filesystem is the index, and the filesystem is always up to date with itself.

The OFFSET_INSTRUCTION_TARGETED template is worth noting separately. It's swapped in over the default when the prompting context suggests the model already knows what range it wants. It's a tiny prompt-engineering detail, but it's also a piece of evidence that the team thinks carefully about teaching the model to read selectively. The lesson the model is being taught, every prompt, is don't be greedy. That's exactly the discipline that keeps per-query cost from blowing up.

The composition

Walk through a real query: "Where's the login flow in this project?"

Glob `/login.{ts,tsx,js}`** — returns up to 100 files, most-recently-modified first. Usually under ten matches; usually the right one is in the first three.
Grep passport|auth|login --glob '' — narrows to specific lines. Three output modes available; the model picks the cheapest one that disambiguates.
Read the file, with offset/limit targeted at the matched region — reads only what's needed.

Three primitives. One realistic query. Total cost: three round-trips, a few hundred tokens of output total. No index, no embedding, no rebuild. Filesystem unchanged.

Now add ten more iterations as the model investigates a bug. Each one is the same three primitives, in different combinations. The cost grows linearly with iterations; it doesn't grow super-linearly with the codebase. That's the curve.

The Loop, Sketched Honestly

You can find pseudo-code versions of Claude Code's main loop in essays and threads. They all look something like:

while (true) {
  const response = await callLLM(messages);
  if (!response.toolUses.length) break;
  for (const use of response.toolUses) {
    const result = await execute(use);
    messages.push(result);
  }
}

That sketch is conceptually right. It is also missing roughly 1,415 lines.

The real loop is at src/query.ts:307 (while (true) {) and closes at :1728 (} // while (true)). The body is 1,421 lines. The full file is 1,729. If you want a guided tour, the Claude Code Deep Dive Part 2 walks through it section by section. The summary for our purposes here is shorter:

The Platonic loop — call model, run tools, repeat — is six lines. The other 1,415 lines are doing one of four things:

Streaming tool execution. Tool calls don't wait for the full model response before they start running; they execute as the streaming tool_use blocks arrive. This is non-trivial because the model can emit a partial tool_use, retract it, or continue thinking before the rest of the call arrives.
Cache and compaction management. Microcompact maintains tool-result cache coherence by tool_use_id without inspecting content. The auto-compact path triggers when the message stream nears a context budget. Both interact with the prompt cache, which has its own coherence rules.
Failure recovery. A nested while (attemptWithFallback) at line 654 handles model fallback when the primary returns a recoverable error. There's a separate path for max_output_tokens recovery (when the model would emit a tool_use but truncate before the tool_result). Orphan tool_results are aggressively pruned so a retry doesn't carry stale IDs forward.
Thinking-block preservation. Reasoning content has to span the assistant trajectory — same turn, plus any subsequent tool_use/tool_result chain — because the model expects to see its own prior reasoning to continue coherently. The loop preserves these blocks across iterations.

The point is not "the loop is complicated." The point is: the cost of running an agent on tool-loops moves into the loop, where it's controllable. There's no second system — no index pipeline, no vector store, no embedding service — with its own failure modes, latency budgets, and consistency guarantees. The loop is the engineering target.

This is the same property that makes a single-process database easier to operate than a microservices fleet, for reasons that have nothing to do with the database. Concentrate complexity where you can attack it.

The Subagent Twist Nobody Quotes Correctly

This is the section where most explanations of Claude Code's retrieval get it wrong, including the article that prompted this post. The popular telling goes: "For open-ended exploration, Claude Code spawns an Explore subagent — Anthropic codifies the rule that if you need more than three queries, you should fork."

The rule exists. The popular telling has two things wrong about it:

The rule is not in AgentTool/prompt.ts. It's in src/constants/prompts.ts:378-379, injected into the system prompt by template:

For simple, directed codebase searches (e.g. for a specific file/class/function) use [searchTools] directly.

For broader codebase exploration and deep research, use the Agent tool with subagent_type=Explore. This is slower than using [searchTools] directly, so use this only when a simple, directed search proves to be insufficient or when your task will clearly require more than [EXPLORE_AGENT_MIN_QUERIES] queries.

EXPLORE_AGENT_MIN_QUERIES is defined in src/tools/AgentTool/built-in/exploreAgent.ts:59 as the integer 3. So the "more than 3 queries" threshold is literal — but it's a single named constant, easy to change, and the guidance is interpolated dynamically per system-prompt build.

The rule is conditional. The two lines above are only emitted when both conditions hold:

// src/constants/prompts.ts:374-381
...(hasAgentTool &&
areExplorePlanAgentsEnabled() &&
!isForkSubagentEnabled()
  ? [
      `For simple, directed codebase searches ...`,
      `For broader codebase exploration ... subagent_type=Explore ...`,
    ]
  : []),

Both areExplorePlanAgentsEnabled() and !isForkSubagentEnabled() have to be true. The first is itself feature-gated:

// src/tools/AgentTool/builtInAgents.ts:13-22
export function areExplorePlanAgentsEnabled(): boolean {
  if (feature('BUILTIN_EXPLORE_PLAN_AGENTS')) {
    // 3P default: true — Bedrock/Vertex keep agents enabled (matches pre-experiment
    // external behavior). A/B test treatment sets false to measure impact of removal.
    return getFeatureValue_CACHED_MAY_BE_STALE('tengu_amber_stoat', true)
  }
  return false
}

BUILTIN_EXPLORE_PLAN_AGENTS is a Bun build feature gate. tengu_amber_stoat is a GrowthBook key. (Aside: tengu_* is the Anthropic-internal naming prefix for Claude Code — any flag you see with that prefix in the source is a Claude Code feature toggle.) The comment in source is the giveaway: "A/B test treatment sets false to measure impact of removal." Anthropic is actively testing what happens when they remove Explore and Plan agents entirely for some fraction of internal users.

Read that again. The canonical "Anthropic uses Explore for exploration" claim is true for the default treatment and being measured against the no-Explore baseline. The interview answer that says "Anthropic always uses Explore" is wrong — or at least, more confident than the source.

This is the half-credit point. Knowing that the rule exists is the first half. Knowing the rule is feature-gated and under measurement is the second half.

The Economics of Explore, in Detail

If you read just the function of Explore, the design looks like a generic subagent: open-ended exploration, returns a summary, isolates from the main context. If you read its parameters, the economics jump out:

// src/tools/AgentTool/built-in/exploreAgent.ts:64-83
export const EXPLORE_AGENT: BuiltInAgentDefinition = {
  agentType: 'Explore',
  whenToUse: EXPLORE_WHEN_TO_USE,
  disallowedTools: [
    AGENT_TOOL_NAME,             // can't spawn nested subagents
    EXIT_PLAN_MODE_TOOL_NAME,
    FILE_EDIT_TOOL_NAME,         // read-only
    FILE_WRITE_TOOL_NAME,        // read-only
    NOTEBOOK_EDIT_TOOL_NAME,
  ],
  source: 'built-in',
  baseDir: 'built-in',
  // Ants get inherit to use the main agent's model; external users get haiku for speed
  // Note: For ants, getAgentModel() checks tengu_explore_agent GrowthBook flag at runtime
  model: process.env.USER_TYPE === 'ant' ? 'inherit' : 'haiku',
  // Explore is a fast read-only search agent — it doesn't need commit/PR/lint
  // rules from CLAUDE.md. The main agent has full context and interprets results.
  omitClaudeMd: true,
  getSystemPrompt: () => getExploreSystemPrompt(),
}

Three economic decisions encoded here:

Explore runs on Haiku for external users. Not the main reasoning model. Exploration is a cheap-tokens job — there's no creative reasoning happening, just iterate-and-filter — and Anthropic uses a fast, small, cheap model for it. The main agent gets the expensive model when it gets the summary back. This is the staffing analogue: junior associate does the deposition review, senior partner reads the brief.
Explore omits CLAUDE.md. The argument in the inline comment: "The main agent has full context and interprets results." CLAUDE.md is for project rules — commit conventions, PR style, lint guidance. Explore isn't doing any of those things. Loading CLAUDE.md into its prompt would be paying tokens for guidance it can't act on.
Explore can't spawn subagents. AGENT_TOOL_NAME is in disallowedTools. No recursion. This is a budget guarantee: an Explore call has bounded depth.

And the system prompt makes the read-only constraint impossible to misread:

=== CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS ===
This is a READ-ONLY exploration task. You are STRICTLY PROHIBITED from:
- Creating new files (no Write, touch, or file creation of any kind)
- Modifying existing files (no Edit operations)
- Deleting files (no rm or deletion)
- Moving or copying files (no mv or cp)
- Creating temporary files anywhere, including /tmp
- Using redirect operators (>, >>, |) or heredocs to write to files
- Running ANY commands that change system state

(exploreAgent.ts:26-34. The full prompt also requires Explore to spawn parallel tool calls aggressively — "you must try to spawn multiple parallel tool calls for grepping and reading files" — which is yet another budget squeeze: turn N round-trips into one wall-clock round, lower latency, same token bill.)

Pull the threads together and Explore reads like a budgeted exploration worker: small model, no CLAUDE.md noise, no nested recursion, read-only, parallel-first. It is the most economically tuned piece of the retrieval system, and it's there because once you've decided the model drives retrieval, the next question is which model, and what it gets paid to think about.

The popular telling skips all of this. It says "Anthropic spawns Explore." The source says Anthropic spawns the cheapest possible agent that can do the work, strips its context to the minimum that lets it function, and measures whether spawning it at all beats just letting the main agent loop directly. That last clause is the A/B in the previous section.

Fork — The Architecture Under Test

Here's the other half of the experiment.

src/tools/AgentTool/forkSubagent.ts exports isForkSubagentEnabled(). When that returns true, AgentTool/prompt.ts rewrites the system prompt to introduce a new operation: fork.

The fork section, paraphrased from prompt.ts:80-96:

Fork yourself (omit subagent_type) when the intermediate tool output isn't worth keeping in your context. The criterion is qualitative — "will I need this output again" — not task size. Forks are cheap because they share your prompt cache. Don't set model on a fork — a different model can't reuse the parent's cache.

Don't peek. The tool result includes an output_file path — do not Read or tail it unless the user explicitly asks for a progress check. You get a completion notification; trust it.

Don't race. After launching, you know nothing about what the fork found. Never fabricate or predict fork results. The notification arrives as a user-role message in a later turn; it is never something you write yourself.

Read this against Explore and the trade becomes visible:

	Explore	Fork
Context isolation	Fresh subagent, separate context	Inherits parent context
Model	Haiku (cheap)	Same as parent (no swap allowed)
Prompt cache	New cache, no parent reuse	Reuses parent cache (huge savings on the system prompt and prior turns)
CLAUDE.md	Omitted	Inherited
Recursion	Disallowed	Allowed (forks can fork)
Failure model	Subagent returns summary or error	Notification arrives later as user message
Discipline required	Caller frames the query	Caller writes a directive prompt; trusts the notification

Neither dominates. Explore wins when isolation matters more than cost — the exploration is noisy enough that you don't want any of the tool spew in your main context, and the work is mechanical enough that Haiku can handle it. Fork wins when the work needs the main model's depth and you're willing to pay the cache-reuse savings to get it, accepting that the fork is just a future-you with a directive.

Anthropic is shipping both, gated, and watching which one wins on internal metrics. The interview answer that says "Claude Code uses Explore for exploration" is a snapshot of one branch of the experiment. The next release cycle may settle it differently.

This is exactly the kind of fact you can't get from the popular tellings, because they're written from a single observation of behavior. Reading the source gives you the experimental design.

Version caveat. Flag names (BUILTIN_EXPLORE_PLAN_AGENTS, tengu_amber_stoat, isForkSubagentEnabled) and exact gate semantics reflect the snapshot I reviewed and will drift — Anthropic ships fast. The two specific architectures (Explore, Fork) may be renamed, consolidated, or replaced by the time you read this. What's stable is the pattern: Anthropic running multiple retrieval architectures in parallel, gated, measuring against each other. That pattern outlasts any specific flag name and is what the argument here actually rests on. Verify against the current release before depending on any specific flag.

When RAG Still Wins

The cost-curve thesis predicts when RAG crosses over and dominates. Three regimes:

1. Mega-monorepos where any kind of index is amortized across millions of queries

At Google or Meta or a top-tier hedge fund — codebases large enough that every per-query latency point costs real money across the org — there is an index. But, importantly, the index they actually run is almost always symbol-graph (Kythe, Glean, Sourcegraph) rather than vector RAG, for the precision reasons in the earlier section. The index is built once, maintained by a dedicated team, amortized across the engineering population's queries forever. Per-query cost goes to single-digit milliseconds; index drift is handled.

In this regime, tool-loops are paying the same per-query cost over and over across N engineers, and the math flips toward indexing. Below that scale, the staffing cost of "a dedicated team that owns the code index" is itself a hidden cost that wipes the savings — and the index of choice at smaller scales (per-project Sourcegraph, ctags) is still not vector RAG. Vector RAG specifically tends to win in the next two regimes, not this one.

2. Pure semantic queries where there's no symbol to grep for

"Find the code that handles user-deactivation edge cases when the account is also a billing admin" — there's no specific symbol. There's a conceptual region of the code that you're looking for. A vector search over function-doc embeddings might point you to the right cluster of functions faster than the model would grep.

Claude Code's answer to this is: spawn Explore, let it iterate. That works, but it isn't free. If your workload is dominated by semantic-cluster queries (auditing, security review, refactor planning), RAG starts to pencil out.

3. When LLM context is the scarce resource — cheap, short-context model regime

This is the framing nobody else gives. Cost-curves cut both ways. If your model has a 32k context window and costs $0.50/M tokens, doing six tool round-trips for retrieval is six rounds of context consumption. A one-shot RAG hit lets you spend that context budget on reasoning. RAG dominates when per-query token cost dominates per-query latency cost.

Anthropic optimized Claude Code for the regime where context is cheap and abundant (Opus 4.7 with a 1M-token window, in some configurations) and per-query latency is what users feel. In a different regime — say, on-device coding agents over a small open-source model — the cost-curve flips and RAG is the right tool. Same first principles, different numerical answer.

A Decision Framework

Match your project to the column. The cost-curve answers the question for you.

Signal	Use grep + LLM tool-loop	Use RAG	Either / hybrid
Project size: under ~1M lines	✓
Project size: 1M–100M lines	✓		✓
Project size: 100M+ lines (mega-monorepo)		✓
Codebase changes daily	✓
Codebase is mostly static (knowledge base, archived)		✓
Queries are exact-symbol ("find getUserById")	✓
Queries are conceptual ("how does auth work")		✓	✓
Model context is cheap and large	✓
Model context is scarce/expensive		✓
You don't have a team to own a code index	✓
You have an index team and tooling already (Sourcegraph, Glean, ctags)		✓	✓

The crossover doesn't happen at any one threshold; multiple columns moving in the same direction tips the cost curve. For most non-FAANG projects, the columns sit on the grep side.

The Companion Question

This post is one half of a two-part argument about Claude Code's retrieval and memory choices. The companion — "Agent Memory Is a Cache Coherence Problem" (publishing 2026-05-28) — makes the same kind of argument about cross-session memory: why Claude Code's built-in memory is hand-written Markdown instead of vector-recalled embeddings, even with the world hyping claude-mem (70k+ GitHub stars as of May 2026) as a drop-in upgrade.

Read together, the two pieces add up to a coherent design stance. Anthropic's bet, across both axes:

Fidelity over fuzz. Both within-session retrieval (grep) and cross-session memory (CLAUDE.md) are lossless and exact. Both refuse vector approximation as the default.
Cost curves over romance. Neither choice is justified by "we trust the model." Both are justified by the math: zero build cost + zero maintain cost beats nonlinear maintain cost for the workloads they target.
Experimentation in production. Both architectures have alternative branches under active flag gating. On the retrieval side: tengu_amber_stoat for Explore-vs-no-Explore, with Fork as a parallel architecture. On the memory side: tengu_coral_fern, tengu_herring_clock, tengu_passport_quail, tengu_slate_thimble, plus the build-time gates KAIROS, TEAMMEM, and EXTRACT_MEMORIES — all visible in src/memdir/. The pattern is the same: ship a default, leave the toggles in, keep measuring.

The shape of the design is a careful refusal to lock in. The cost curves favor the current choices today. They might not in a year. The system is built to flip.

That, finally, is the answer the interview question is fishing for. Why not RAG? Because the cost curves don't justify it for this workload, and Anthropic has the engineering culture to refuse the cargo-cult. Will it always be that way? No — and the feature flags in the source say so out loud.

Companion piece: **Agent Memory Is a Cache Coherence Problem* (publishing 2026-05-28)*
Background: *Consistency in Distributed Systems: Scenarios, Trade-offs, and What Actually Works***
For a focused walk through the loop file: *Claude Code Deep Dive Part 2: The 1,421-Line While Loop***

Channels Aren't Message Passing — How Parked Goroutines OOM-Killed a Pod

Harrison Guo — Thu, 14 May 2026 05:26:27 +0000

It's 3am. The Kafka consumer pod that's been running cleanly for six weeks gets OOM-killed. Kubernetes restarts it. Five minutes later: OOM-killed again. Restart. OOM-killed a third time. By the fourth restart I've shelved the dashboard and started reading runtime/chan.go.

The code that died fit on one line:

events := make(chan Event)

I want to tell you that line is the bug. It isn't. An unbuffered channel will happily backpressure a single producer — every send rendezvous with a receiver, the producer cannot run ahead. The channel did exactly what it was designed to do.

What I had built around it didn't. The Kafka consumer loop wrapped events <- parseEvent(msg) inside a go func(msg) { ... }(msg), spawning a fresh goroutine per inbound message. Every one of those goroutines blocked on send, parked on the channel's sendq list, and kept its stack and the parsed event alive in memory. The channel was the gravestone. The unbounded go func fan-out was what filled it.

This is the story of what a Go channel actually is at the runtime level, why "channels are message passing" is one of the most expensive lies in the Go ecosystem, and why the most common channel bug isn't in the channel — it's in the code that calls into it.

tl;dr — A Go channel is not a queue and not a message bus. It's a heap-allocated hchan struct containing a mutex, a ring buffer, and two parked-goroutine lists. The send operation is a memcpy under a lock, not a transmission. Channels only deliver backpressure if the producer side is bounded. The OOM that started this story came not from make(chan Event) — that was working as designed — but from an unbounded go func(msg) fan-out parking thousands of goroutines on sendq, each retaining a 10KB payload. The fix isn't a buffer size. It's making backpressure part of the producer contract: a single long-lived producer with select-based backoff, plus a bounded queue as a safety net. The same architectural mistake shows up at every layer where engineers reach for an "in-process queue" — including the inbound queue of your AI agent.

The Mental Model That Killed The Pod

Here is what I thought a channel did, and I suspect most Go engineers carry some version of this picture:

"A channel is like a Kafka topic in-process. Producers push messages onto it. Consumers pull messages off it. The runtime handles ordering and delivery. It's CSP — Communicating Sequential Processes — Hoare's thing, basically a typed pipe."

Every word of that sentence is wrong in a way that matters. There is no topic. Nothing is being pushed anywhere. The runtime is not a broker. The word passing — borrowed from message-passing concurrency, where independent processes communicate across an isolation boundary — is the most misleading part. In a Go channel, there is no isolation boundary. There is one struct on the heap, and both goroutines reach in and mutate it.

I held the message-passing model long enough that when the Kafka consumer started ingesting a 12-hour upstream replay at full throttle, I had no instinct that the messages were going somewhere bounded. They weren't. They were sitting in a ring buffer that I had failed to give a size to.

What A Channel Actually Is

Crack open runtime/chan.go in the Go source tree and you'll find this (layout stable since Go 1.7, confirmed against Go 1.21–1.25):

type hchan struct {
    qcount   uint           // total data in the queue
    dataqsiz uint           // size of the circular queue
    buf      unsafe.Pointer // points to an array of dataqsiz elements
    elemsize uint16
    closed   uint32
    elemtype *_type
    sendx    uint           // send index
    recvx    uint           // receive index
    recvq    waitq          // list of recv waiters
    sendq    waitq          // list of send waiters
    lock     mutex
}

That's it. That's the channel. A struct with a mutex, a pointer to a circular byte array, two indices to track read/write positions in the ring, and two intrusive linked lists holding parked goroutines that are waiting to send or receive.

When you write ch <- value, the runtime calls chansend, which does roughly this:

Take the lock (lock(&c.lock)).
Check recvq — is there a goroutine already parked waiting to receive? If yes, copy value directly from the sender's stack into the receiver's stack via sendDirect, mark the receiver runnable with goready, release the lock, return. No buffer involved — when a receiver is already waiting, send can hand off directly without ever touching the ring buffer. (In normal operation a buffered channel can't simultaneously have queued data AND parked receivers; if recvq has a waiter, the buffer is empty.)
Otherwise, check buffer space — if qcount < dataqsiz, copy value into buf[sendx], advance sendx, increment qcount, release the lock, return.
Otherwise, park the sender — append the sender's goroutine to sendq, release the lock, and call gopark to suspend execution until a receiver wakes it up.

Receive is the mirror image, calling chanrecv with sendq and recvq swapped.

Here is the shape of it:

Three things are worth burning into memory:

One — there is no transport. The "message" never leaves the heap. Sender writes bytes; receiver reads bytes; the lock arbitrates. This is shared-memory synchronization with the appearance of message passing.

Two — the buffer is just a ring of typed slots. dataqsiz is set exactly once, at make(chan T, N) time. If you write make(chan T), dataqsiz is zero and there is no buffer at all — every send must rendezvous with a receiver or park.

Three — sendq is unbounded. This is the part nobody talks about. The ring buffer has a fixed size. The list of parked senders waiting to write into the ring buffer does not. If a thousand goroutines all hit a full channel, the runtime parks all thousand of them on sendq and each one keeps its stack and any data it was about to send alive in memory.

That third point is what made the OOM I had a different shape from the one I was about to describe.

The Incident, Mechanism By Mechanism

The pod that died had a goroutine topology that looked like this — and the bug is not the make(chan Event) line. Watch the outer loop:

events := make(chan Event)

// Consumer — slow.
go func() {
    for ev := range events {
        process(ev) // ~3ms per event
    }
}()

// THE ACTUAL BUG: outer loop spawns a fresh goroutine per inbound message.
for msg := range kafkaConsumer.Messages() {
    go func(msg kafka.Message) {
        events <- parseEvent(msg) // every blocked send parks on sendq
    }(msg)
}

If you replace the inner go func(msg) { ... }(msg) with a direct events <- parseEvent(msg), the outer loop itself becomes the producer, and the unbuffered channel correctly backpressures it — the loop simply doesn't advance until the consumer is ready. No OOM.

But because each message is dispatched to a fresh helper goroutine, the outer loop never blocks. It keeps spawning. Each helper goroutine reaches the send, finds no waiting receiver, and parks on sendq. Now sendq is the unbounded thing. Here is what actually happened, in order:

1. Sustained baseline: rendezvous works

At 1K msg/sec inbound and ~3ms per process call (~333/sec consumer throughput), the consumer is already behind by 3x at steady state. For weeks this didn't OOM because the Kafka client's own internal buffer absorbed the gap, and lag built up on the broker side — visible in Grafana, ignored by me.

2. Replay: the producer detaches from the consumer's pace

When upstream re-emitted 12 hours of events, the Kafka client's internal pre-fetch buffer filled to capacity (default fetch.message.max.bytes × partition count = several hundred MB) and started backing up Kafka-side without applying backpressure to the consumer goroutine, because the client library was configured with a large internal queue.

3. The actual heap growth: parked sender goroutines

Each call to events <- parseEvent(msg) on the unbuffered channel would either rendezvous (rare during replay) or park. When it parked, the sender goroutine held:

Its own stack (~8KB minimum, grew under load)
The Event value it was about to send (~10KB per event with strings, headers, payload)
A reference into the Kafka message it was parsing (another ~10KB)

Multiply by the number of in-flight parsing goroutines — which kept being spawned by an outer loop that didn't apply backpressure to itself — and you arrive at the 12GB heap. The channel's sendq was the proximate memory sink, not the buffer (which was zero-sized).

The goroutine lifecycle for each parsing goroutine looked like this:

Every goroutine sitting in Parked_on_sendq is reachable (it's on the runtime's wait queue, which is rooted in the hchan struct, which is rooted by both the producer and consumer goroutines). Reachable means non-collectible. The longer the consumer falls behind, the longer the queue grows.

4. GC can't help

Go's GC can only reclaim unreachable memory. Every parked goroutine on sendq is reachable (it's on the runtime's scheduler queue). Every Event it's holding is reachable. The GC ran, found nothing to free, and the heap continued growing until the kernel OOM-killer fired.

5. The cgroup hammer drops

cgroup memory limit was 4GB. Heap crossed 4GB. OOM kill. Kubernetes restarted the pod. The replay was still in progress on the broker side, so the same sequence ran again. And again.

What this looks like in pprof

You don't have to take my word for the mechanism — it reproduces in under a minute. I built a minimal demo at harrison001/channels-oom-demo (cmd/bug) that runs the same workload shape on a laptop. The output of the bug version over 22 seconds, captured with runtime.NumGoroutine() and runtime.MemStats.HeapAlloc:

t=   1s  goroutines=   497  heap_alloc=     5 MB
t=   5s  goroutines=  2462  heap_alloc=    28 MB
t=  10s  goroutines=  4915  heap_alloc=    61 MB
t=  15s  goroutines=  7369  heap_alloc=    89 MB
t=  20s  goroutines=  9828  heap_alloc=   109 MB
t=  22s  goroutines= 10813  heap_alloc=   125 MB

Goroutine count grows at almost exactly 1 per millisecond (the spawn rate). Heap grows at ~5MB/sec, dominated by the 10KB Event payload each parked goroutine is holding. Extrapolate to a 12-hour replay at production volume and you arrive at the original 12GB OOM.

For comparison, the fix version (cmd/fix) on the same workload:

t=   1s  goroutines=     3  heap_alloc=     3 MB  chan_len= 256
t=  10s  goroutines=     3  heap_alloc=     4 MB  chan_len= 256
t=  20s  goroutines=     3  heap_alloc=     5 MB  chan_len= 256

Three goroutines (producer, consumer, pprof listener). Heap flat at 4-5 MB. Channel pinned at its 256-slot bound, meaning the producer is constantly blocked on send and applying backpressure upstream — exactly what we want.

The Fix, And Why It Works

The visible code change was one parameter. The real fix was making backpressure part of the producer contract — two changes, working together:

events := make(chan Event, 256) // (1) bounded queue as safety net

// (2) single long-lived producer goroutine with select-based backoff —
// NO outer loop spawning fresh goroutines per message.
go func() {
    for msg := range kafkaConsumer.Messages() {
        select {
        case events <- parseEvent(msg):
            // sent — loop continues at consumer speed when channel fills
        case <-ctx.Done():
            return
        }
    }
}()

The key word in change (2) is single. There is exactly one goroutine reading from Kafka and writing to the channel. When the channel fills, that goroutine blocks on send; the for msg := range loop stops calling Poll(); the Kafka client's internal pre-fetch queue stops draining; consumer lag accumulates broker-side; the broker simply retains messages until we come back. No go func(msg) helpers. Nothing piling up on sendq. Memory stays bounded because the producer is bounded — the buffer is only a safety net to absorb short bursts.

What this changes, mechanically:

Before (unbounded `go func` fan-out + `make(chan Event)`)	After (single producer + `make(chan Event, 256)`)
One goroutine per inbound message	One long-lived producer goroutine
`sendq` grows unboundedly with parked helpers	`sendq` empty by construction; producer is sole sender
No signal to upstream — outer loop never blocks	Producer blocks on send; outer loop runs at consumer speed
Kafka client keeps pre-fetching, lag invisible	Kafka client's internal queue fills, consumer stops polling, broker-side lag accumulates
OOM	Bounded heap, bounded latency, Kafka rebalances cleanly when behind

A bounded channel buffer alone does not prevent this OOM. If you applied change (1) without change (2), you'd merely increase the OOM-killing rate — the outer go func(msg) fan-out would keep spawning, the buffer would fill in milliseconds, helpers would pile up on sendq exactly as before. Backpressure is not a property of any one component — it is a property of the entire chain having no unbounded buffer (and no unbounded fan-out) anywhere in it.

Every link in this chain is bounded — the database has connection pool limits, the consumer is rate-limited by process() latency, the channel buffer is 256, the Kafka client's internal queue has a configured max, and the broker simply retains messages on disk when its consumer falls behind. When ANY downstream link slows, the pressure propagates back up by the consumer ceasing to pull; the broker doesn't need to be told anything. The whole system runs at the rate of its slowest component.

If any link in that chain has an unbounded buffer, the chain has no backpressure. That link will absorb the load until it OOMs.

Bounded Buffers Are Not About Channels

The lesson is not "use buffered channels." The lesson is:

Any in-process queue without a capacity bound is a latent OOM.

This applies identically across runtimes:

Runtime	The footgun	The fix
Go	Unbounded goroutine fan-out parked on sends (`go func(msg) { ch <- ... }(msg)`); oversized buffered channels	Single long-lived producer + `select` + bounded buffer as safety net
Rust (Tokio)	`mpsc::unbounded_channel()`	`mpsc::channel(N)`
Python (asyncio)	`asyncio.Queue()` with no `maxsize`	`asyncio.Queue(maxsize=N)`
Node.js	Unbounded array of in-flight Promises	`p-limit`, `Sema`, or explicit pool
Erlang/Elixir	Process mailbox grows unboundedly when selective receive can't keep up	Demand-driven flow control: `GenStage` / `Flow` for pipelines, or explicit ack-based protocols in `gen_statem`

Every one of these reaches for the same shape — an in-process queue — and every one of them OOMs the same way when the shape is unbounded.

When Channels Are The Right Tool

I want to be careful not to overcorrect. Channels are not a mistake. They are an excellent primitive used incorrectly. Cases where reaching for a channel is the right call:

Cancellation signaling — context.Done() is a <-chan struct{}. This is canonical.
Fan-out work distribution with a worker pool — a bounded channel feeding N worker goroutines is a clean semaphore. Buffer size = pool size or small multiple of it.
Producer-consumer with a known throughput ratio — yes, with a bounded buffer sized to the latency budget.
Error aggregation from concurrent goroutines — small buffered channel, drain on goroutine completion.
Handoff between pipeline stages — bounded, with explicit close semantics on the upstream stage.

Cases where reaching for a channel is the wrong call:

Cross-process messaging — use a real broker (NATS, Kafka, Redis Streams). Channels do not survive a pod restart.
Persistence — channels are stack-local-ish. If your pod dies, the in-flight data is gone. If you need "at least once" across restarts, you need a real queue.
Bursty load with unknown shape — if you cannot put a meaningful upper bound on the buffer, you have not understood the load. Adding a channel does not give you understanding; it postpones the OOM.
Anything that wants to be a message bus — that's not a channel. That's a message bus. They are different categories of system.

The Same Bug, Different Layer: AI Agent Inbound Queues

The reason this post lives in the SecurityLab track and not just "Go tips" is that the exact same mistake is now happening, at scale, in LLM agent infrastructure. I've seen the pattern repeatedly in recent AI backends — same architectural shape, different runtime.

The pattern: an agent backend exposes an HTTP endpoint. Each inbound request is dispatched to a worker pool via an in-process queue.

# The bug, in a different language
request_queue = asyncio.Queue()  # unbounded

async def http_handler(req):
    await request_queue.put(req)  # never blocks
    return {"status": "queued"}

async def worker():
    while True:
        req = await request_queue.get()
        await llm_call(req)  # 8 seconds, sometimes 30

Steady state is fine: requests arrive faster than they're processed, queue grows slowly, latency creeps up, nobody notices because the HTTP layer keeps returning 200.

Then a launch happens. Or a viral tweet. Or a marketing email goes out. Inbound rate spikes 50x for 20 minutes. The queue accepts everything (it's unbounded). The worker pool can't keep up — LLM calls are inelastic, you can't parallelize past your token-per-minute quota. The queue grows to 200K items. Each item holds a request payload (~50KB with conversation history) and a future. 10GB of heap. OOM. Pod restart. All 200K requests lost. Users see 500s instead of the explicit "rate-limited, try again in 30s" they would have seen with proper backpressure.

The fix is identical to the Go fix:

request_queue = asyncio.Queue(maxsize=100)

async def http_handler(req):
    try:
        request_queue.put_nowait(req)
    except asyncio.QueueFull:
        return Response(status=503, headers={"Retry-After": "30"})
    return {"status": "queued"}

503 is a feature. It is the system telling the client we're at capacity, retry in 30 seconds. It is honest. It is bounded. It is the difference between a system that degrades gracefully and one that dies silently.

Reproducing This Yourself

The numbers in this post come from a minimal Go program that fits in under 100 lines per command. The repo lives at:

github.com/harrison001/channels-oom-demo

git clone https://github.com/harrison001/channels-oom-demo.git
cd channels-oom-demo

# Watch goroutine count + heap climb every second
go run ./cmd/bug

# Switch to the fix — flat at 3 goroutines, 5 MB heap
go run ./cmd/fix

Each program exposes pprof on localhost:6060. While the bug version is running:

# Confirm 10K+ goroutines parked on chansend → runtime_chanrecv1
curl -s 'http://localhost:6060/debug/pprof/goroutine?debug=1' | head -20

# Confirm the heap is dominated by Event payloads, not the channel itself
go tool pprof -text http://localhost:6060/debug/pprof/heap

The bug demo has a hard cap at 20,000 goroutines so it won't actually OOM your laptop. Remove the cap if you want to see the kernel finish the job.

What I Wish I'd Known

If I could send one note back to myself eighteen months before the OOM:

When you reach for an in-process queue, you are choosing a backpressure boundary. The buffer size is not a performance tuning knob. It is a contract: under sustained load greater than my consumer's throughput, this is how much memory I am willing to lose before I tell the producer to stop. If you don't pick a number, the runtime picks one for you, and the number is whatever fits in RAM right before the kernel kills the process.

Channels in Go look like message-passing because the syntax was deliberately borrowed from CSP, a model where independent processes communicate by passing values across an isolation boundary. In Go there is no isolation boundary. The channel is a struct in shared memory, the goroutines are coroutines on the same scheduler, and the entire setup is synchronization plumbing in CSP clothing.

Once you see the hchan struct, you can't un-see it. Every channel decision after that is a synchronization decision, not a transport decision. And synchronization decisions always have a capacity bound — you just have to choose whether to pick it explicitly or have the OOM-killer pick it for you.

Keep going

Code: harrison001/channels-oom-demo — reproduce both versions, capture your own pprof
Next piece: Goroutines Are Cheap — Until Backpressure Is Missing — coming next. The producer side of the same mistake: why "just spawn a goroutine" is the second half of the bug.
Subscribe: I write one of these monthly on runtime mechanics, distributed systems postmortems, and the security implications of getting them wrong. Newsletter · SecurityLab track

If you've hit this bug — or its cousin in a different runtime — I'd genuinely like to hear about it. The Erlang and Node.js shapes especially: I have hunches but not enough scars. Reply to the newsletter or open an issue on the demo repo.

How I Improved an AI Agent from 40% to 60% — With A/B Test Data

Harrison Guo — Tue, 12 May 2026 15:49:19 +0000

The Setup

I was optimizing an AI agent for a production system — a creator agent that handles user requests like "make this character fiercer" or "rename this entity." The agent runs a 5-layer pipeline: Perceive → Cognate → Decide → Act → Express, with real LLM calls at each step.

Quality was bad. Not "it doesn't work" bad — "it works 40% of the time" bad. The remaining 60% were wrong entity targeting, infinite reasoning loops, and silent failures.

I ran 5 standardized test cases, each repeated 5 times (LLMs are non-deterministic), measuring pass rate:

Test	What It Does	Baseline
QL-001	Create 4 entities + 1 relationship in one message	0%
QL-002	Classify user intent correctly	80%
QL-003	Update the right entity in a world with 6 characters + 4 locations	40%
QL-004	Maintain context across long conversation	100%
QL-005	Simple rename ("Ember" → "Infernia")	20%

Overall: 40% pass rate. The model (equivalent to GPT-4 class) was plenty capable. Something else was wrong.

The Diagnosis: Context Was the Problem

QL-003: Why the Agent Confused Entities (40% → 80%)

The user says: "Make Ember more fierce and give her fire breath."

The world has 10 entities: 6 characters (Ember, Luna, Grak, Roland, Mira, Pip) and 4 locations. The agent's BuildChatCompletionMessages function dumped ALL entity data into the prompt — every character's backstory, every location's description.

The LLM had to find Ember in a wall of irrelevant text. Sometimes it picked Luna. Sometimes it referenced the wrong character's traits. Not because the model was stupid — because the context was noisy.

QL-005: Why Simple Rename Failed (20% → 80%)

"Rename Ember to Infernia." One entity, one operation. Should be trivial.

Two problems:

No round limit — the agent sometimes looped 15+ times on a rename, reasoning tools firing endlessly
When a tool failed, the LLM got: {"error": true, "message": "This tool is temporarily unavailable."} — no context on what to do next

The model gave up or produced responses that didn't contain "Infernia."

QL-001: Why Multi-Step Creation Was Impossible (0% → 0%)

"Create a dragon named Ember who lives in Crystal Caves. Ember has a rivalry with Sir Roland who guards the village gate."

This requires creating 4 entities + 1 relationship. The 5-layer pipeline processes entities sequentially, each in isolation. The relationship creation doesn't know the knight was just created — there's no shared state between action steps.

Both baseline and improved scored 0%. This is an architectural problem, not a context problem.

The Fixes: 8 Changes, 7 Pure Code

Fix 1: PlanExecution (the only LLM call)

One API call before the main loop. The LLM generates a plan:

Goal: Update Ember's properties
Steps: 1. Identify Ember entity  2. Apply personality changes
Tools needed: updateCharacter

This plan gets injected into the cognition layer's context. The intent classifier now sees a roadmap, not just raw entity data.

Cost: ~$0.003 per request, 3-5s latency. The only fix that uses an LLM call.

Fix 2: PrioritizeContext (pure code)

Sort context items by salience score. Higher-relevance items go first. Low-relevance items dropped when the token budget is exceeded.

When the user says "Make Ember fiercer," Ember's data gets priority. Luna's backstory gets dropped. The LLM sees signal, not noise.

sort.Slice(items, func(i, j int) bool {
    return items[i].Salience > items[j].Salience
})
items = items[:tokenBudget]

Cost: Zero. Pure sort + filter.

Fix 3: CompressContext (pure code)

Old conversation rounds get summarized extractively — find tool names, find CONCLUSION markers, truncate the rest. No LLM needed for this level of compression.

Cost: Zero. String operations.

Fix 4: Preserve Conclusions (pure code)

When reasoning text is truncated at 4,000 characters, the truncation used to cut wherever it landed. If the LLM decided "I need to rename Ember to Infernia" in round 1 but that conclusion was at character 4,100, round 2 forgot the decision.

Fix: truncateReasoningPreservingConclusions() finds CONCLUSION/DECISION markers and keeps them even when truncating.

Cost: Zero. String search.

Fix 5: Max Rounds Cap (pure code)

const DefaultMaxRounds = 10
if roundCount > DefaultMaxRounds { break }

Previously unlimited. The agent sometimes looped 15+ rounds on a trivial task. Now it stops at 10 and produces its best result.

Cost: Zero. One if-statement.

Fix 6: Structured Tool Errors (pure code)

Before:

{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable."}

After:

{"error": true, "tool_name": "updateCharacter", "message": "This tool is temporarily unavailable.",
 "error_type": "timeout", "retryable": true}

With retryable: true, the LLM knows to try again instead of giving up. With error_type: "timeout", it knows the issue is transient.

Cost: Zero. String classification.

Fix 7: Circuit Breaker (pure code)

Count failures per LLM provider. After 3 consecutive failures, skip that provider and try the fallback. Prevents the agent from burning through 120 seconds of timeout on a dead provider.

Cost: Zero. Counter + threshold.

Fix 8: HTTP Client Reuse (pure code)

Store *http.Client on the provider struct, reuse across calls. Previously each call created a new client, a new TCP connection, a new TLS handshake.

Cost: Zero. Struct field.

The Results

Test	Baseline	After Fix	Delta	What Fixed It
QL-001	0%	0%	=	Needs pipeline architecture change
QL-002	80%	80%	=	Already working
QL-003	40%	80%	+40%	PrioritizeContext + PlanExecution
QL-004	100%	100%	=	Already working
QL-005	20%	80%	+60%	Max rounds + structured errors + conclusion preservation

Overall: 40% → 60%. Same model. Better input.

Latency went from 26s to 43s due to the PlanExecution LLM call (~3-5s per test). The HTTP reuse and circuit breaker savings show up under concurrent load, not in a 5-test sequential run.

What Didn't Improve — And Why

QL-001 (multi-step creation) stayed at 0%. This isn't a context problem — it's a pipeline architecture problem. Each entity is created in isolation, and the IDs returned by each step are discarded before the next step runs:

Fixing this requires collapsing the 5-layer pipeline into a unified agent with cross-step state — a larger architectural change, not a context fix.

The lesson: Context optimization has a ceiling. Past that ceiling, you need architecture changes. But the ceiling is higher than most people think — we still had 20% improvement available before hitting it.

What's Still Missing

Three pieces of infrastructure were built but not wired:

Component	Status	Gap
VerifyOutput	Logs quality issues	Doesn't retry on failure
ScoreMemoryUsage	Computes relevance scores	Scores never applied to future retrieval
PlanExecution	Generates plan before loop	Plan not tracked during execution

All three are open loops. The infrastructure detects problems but doesn't act on them. Closing these loops is the next 20% — getting from 60% to 80%+.

The Takeaway

Better input → better output. The LLM is the same.

If your agent is underperforming, check the context before blaming the model. In our case:

7 out of 8 fixes were pure code
Zero additional LLM cost (except one planning call at $0.003)
20% quality improvement without changing the model
The model was always capable — the context was holding it back

The highest-ROI investment in any agent system is context management. It's not glamorous. It's sort, filter, compress, truncate, prioritize. But it's the difference between 40% and 60% — and the foundation for everything else.

Part of the AI Agent Architecture series. See also: The 90% Problem for the broader framework, and Claude Code Deep Dive Part 3 for how Anthropic solves context at scale.