I keep writing variations of the same sentence. Agent memory has to terminate at a source of truth. An agent loop has to terminate at a check. Generative 3D has to terminate at a verifier. The probabilistic part proposes, a deterministic part disposes.
Four pieces, one shape. That is usually a sign there is a law underneath, not four coincidences. This is my attempt to write the law down, and then to turn it into a standard I can apply to the next system before it ships instead of after it breaks.
The companion piece argued that generative AI builds plausible shapes and that correctness has to come from structure and verification wrapped around the generator. That answers whether you need something outside the model. It does not answer how much, or where the line goes. This piece is about the line.
The quantity nobody names
Every probabilistic system has a number attached to it that almost no one states out loud: the fraction of its output whose correctness rests on a model getting it right, with nothing deterministic to anchor or check that part.
Call it the stochastic surface.
A pure end-to-end model has a stochastic surface of one. Everything it emits is a sample, and every sample is trusted as final. A pocket calculator has a stochastic surface of zero. A real system sits between, and the exact position is the single most important design decision in the thing, more than the model, more than the prompt, more than the framework.
Here is the claim, stated as plainly as I can:
The reliability of a probabilistic system is bounded by the size of its stochastic surface. Reliability engineering is surface reduction: pushing as much of the output as possible onto deterministic structure, and leaving the model only the irreducible part that genuinely has no anchor.
This is not a style preference. It follows from what a sample is.
Why the surface bounds reliability
A deterministic component has a property a sampler never has: when it is wrong, it is wrong the same way every time, and you can find out. A unit test fails. A constraint is violated. A schema rejects the payload. The error is locatable, which means it is fixable, which means over time that part of the system trends toward correct and stays there.
A sample has none of this. It is drawn fresh, it can be wrong differently every time, and, crucially, it carries no signal about whether it is wrong. A plausible-looking output and a correct output are indistinguishable from inside the generator. That is the entire problem in one sentence: the model cannot tell you it succeeded, so it cannot be trusted even when it did.
Now stack many such samples. If thirty percent of your output is on the stochastic surface, then thirty percent of every result is an unverified claim that could be confidently, undetectably wrong, and you have no map of which thirty percent. The failures do not announce themselves. They surface downstream, far from where they were born, as a doorless cottage the workflow was perfectly happy to call done.
So the rough mental model is:
reliability ≈ 1 − (unanchored stochastic fraction)
Not literally a formula you compute, but the right intuition. Every percent of output you move off the surface, by retrieving it instead of inventing it, by deriving it from a rule instead of sampling it, by checking it against a spec, is a percent that stops being an undetectable liability. You do not make a probabilistic system reliable by making the model better. You make it reliable by giving the model less of the output to be solely responsible for.
That reframes everything. It means the work is not "pick the best model." It is "draw the smallest possible circle around the part only a model can do, and build deterministic structure everywhere else." Two sides of a system show this most clearly: how you generate, and how you evaluate. They turn out to be the same move.
The generation side: anchor, don't invent
When you need a model to produce something, the lazy default is to describe it and let the model conjure it from the prior. That maximizes the stochastic surface on purpose. Almost always you can do better, and the options form a clean ladder ordered by how many bits the model has to invent.
Tier 1: retrieve, then edit
The cheapest and most underrated move: do not generate from scratch, find the nearest high-quality real example and modify only what the task requires.
In a game studio this is the difference between "generate a medieval cottage" and "here is a hand-built cottage our artists shipped, adapt it to a 15 by 15 footprint with a south door." The first asks the model to hallucinate an entire artifact. The second hands it a correct starting point and asks for a delta. The stochastic surface collapses from the whole object to the edit.
The principle generalizes far past games. Retrieval-augmented generation is this move for text. Asking a coding model to modify an existing, tested function instead of writing one blind is this move for code. In every case the logic is identical: the model's real workload is total information minus retrievable information. Anything already present in a real example is information the model does not have to invent, and therefore cannot get wrong. You spend a similarity search to buy down the surface. It is almost always a good trade.
The failure mode to respect: retrieval anchors you to the retrieved thing, so retrieval quality becomes the new floor. Garbage neighbors give garbage edits. But notice what changed, the failure moved from an invisible generative hallucination to a visible, checkable retrieval step you can inspect, score, and improve. That is surface reduction even when it is imperfect, because it converts an unanchored failure into an anchored one.
Tier 2: fine-tune to narrow the distribution
Retrieval changes the starting point. Fine-tuning changes the distribution itself.
This is the tier worth dwelling on, because it is the one most often reached for first and understood least. A base model samples from an enormous manifold of plausible-for-the-internet outputs. Fine-tuning on your own high-quality, domain-specific data reshapes that manifold so the model's default sample lands much closer to what your domain considers good. You are not teaching it new facts so much as moving the center of mass of its prior onto your distribution, and shrinking the variance.
For a studio, training a model on your own corpus of shipped, art-directed, style-consistent assets does something retrieval cannot: it makes the typical generation on-style and on-spec, not just the retrieved-and-edited one. It raises the floor everywhere, including the cases where you have no neighbor to retrieve. The economics also favor it once volume is high enough: a one-time training cost amortized across millions of generations, versus paying for a large frontier prompt every single time.
But be precise about what fine-tuning does and does not do to the stochastic surface, because this is where people over-trust it. Fine-tuning lowers the surface, it does not remove it. A model fine-tuned on perfect 15 by 15 cottages will produce cottages that are usually closer to 15 by 15. It still has no representation of "footprint equals 15 by 15" as a predicate to satisfy and check. It samples from a tighter distribution, but it is still sampling. The discrete constraint is still discrete, and a narrower continuous distribution is still continuous. Fine-tuning buys you a better-behaved sampler. It does not buy you a solver or a verifier, and treating a fine-tuned model as if it were one is exactly the over-trust this whole standard is built to prevent.
The right read: fine-tuning is the strongest generation-side lever, and generation-side levers have a ceiling. They make the proposal better. They never make the proposal self-disposing.
Tier 3: generate the residual only
After you have retrieved what you can and narrowed what you can, whatever is left, the genuinely novel part with no prior to anchor to, is what you let the model invent freely. That residual is where a generative prior earns its keep: plausible, varied, rich single forms. The sphere and the gatehouse from the companion piece are this tier done well.
The discipline is to make that residual as small as the task allows, and to wrap it, never to let it stand as the final word. Which is the other side of the system.
The evaluation side: rules first, model last
Now flip from producing output to judging it. The same law applies, and the anti-pattern is more seductive because it looks like progress: hand the whole output to an LLM and ask "is this good?"
That puts your evaluation's stochastic surface at one hundred percent. You have built a judge that cannot tell you when it is wrong, to assess a generator that cannot tell you when it is wrong. Two unanchored samplers in a trench coat. It demos beautifully and rots quietly.
The reliable structure is a hierarchy, ordered by how anchored each layer is, and you push as much weight as possible to the top:
Quantifiable, to a deterministic metric. Anything you can measure, you measure. Footprint dimensions. Block counts. Latency. Compile success. Test pass rate. Schema validity. Forbidden-element count equal to zero. This is the bedrock layer and it should carry the most weight the domain allows, because it has a stochastic surface of zero. It is right the same way every time.
Formalizable but not numeric, to a program check. Things that are not a number but are still decidable: is the door actually passable (run a pathfinder), does the graph have the required structure, does the config satisfy this invariant. Still deterministic, still locatable, still zero surface. This is the procedural and symbolic layer.
Irreducibly subjective, to a bounded model judge. What is genuinely left, "is this fun," "is this on-brand," "is this elegant", goes to an LLM. But scoped: a small, well-defined slice, with a rubric, calibrated against human ratings, and auditable after the fact. The model judges the residual, not the whole.
The number that matters is the share of your evaluation weight sitting in layers 1 and 2. That is your evaluation's anchored fraction, and it is the inverse of its stochastic surface. A good eval system is one where the subjective model-judged slice is small and shrinking.
And here is the part that makes evaluation more than a one-time setup. Every time you take a judgment that used to live in the model's head, "this looks too wide," and turn it into a rule, "footprint must equal the spec," you move it from layer 3 to layer 1 permanently. It never goes back. Evaluation built this way is a ratchet: a mechanism for steadily converting accumulated human judgment into deterministic checks, one notch at a time, each notch shrinking the surface and never releasing it. "Let the AI grade the AI" is the opposite, a surface stuck at one hundred percent with no mechanism to ever bring it down.
The two sides are one move
Step back and the generation ladder and the evaluation hierarchy are the same diagram viewed from two ends.
On both sides you are drawing a line. On one side of the line is the deterministic part: retrieved examples, narrowed distributions, constraint solvers, metrics, program checks. On the other side is the irreducible stochastic residual: the novel shape, the genuinely subjective call. That line is the stochastic surface. Generation and evaluation are just its two projections, one for producing, one for judging.
Every system I keep writing about is an instance of drawing that line well. Agent memory fails when a model's hedge gets stored as a fact with no source to check it against; the fix anchors memory to a source of truth, shrinking the surface. The advisor strategy spends a cheap model on the bulk and reserves expensive, decisive compute for the few points that must be right; that is surface reduction in the cost dimension. Retrieval versus grep, plan-generate-solve-verify for game content, rules-first evaluation, all the same law: the probabilistic component proposes, and a deterministic component, as much of one as the problem allows, disposes.
The standard
The point of a law is to use it before you ship, not to explain the wreckage after. So here is the checklist I am going to run on the next probabilistic system I build, and the one after that. Five questions.
What is the stochastic surface? Name the exact fraction of the output whose correctness rests on the model alone, unretrieved and unchecked. If you cannot point to it, you do not understand the system yet. Stop here until you can.
What did you anchor instead of invent? For every generated part, did you retrieve a real example first (tier 1), narrow the distribution with fine-tuning where volume justifies it (tier 2), and reserve free generation for the residual only (tier 3)? Each part that skipped the ladder is surface you chose not to reduce, and you should be able to say why.
What is checkable, and is it checked? Every predicate the output must satisfy that can be verified deterministically, is it? Sizes, counts, schemas, invariants, passability. Anything checkable but unchecked is the worst category: a failure you could have caught for free and did not.
Where does the model judge, and is that slice bounded? In evaluation, what share of weight is deterministic (layers 1 and 2) versus model-judged (layer 3)? Is the judged slice small, rubric-bound, calibrated, and auditable? A universal LLM judge is a red flag, not a feature.
Is the surface shrinking over time? Is there a ratchet, a path by which recurring human judgments become permanent rules and the surface trends down with use? A system whose surface is fixed will not get more reliable no matter how long it runs.
Score every component against these five and the weak points light up immediately. They are always the same shape: a part handed to the model that could have been retrieved, derived, or checked, and was not.
Where the standard stops
A law worth trusting comes with its boundary, so here is mine. This standard governs correctness tasks, and only those. A correctness task has a spec the output must satisfy: a size, a passable door, a compiling program, a routed circuit, a factual answer. There, a large stochastic surface is pure liability and shrinking it is the whole job.
But some tasks have no spec, and for those the standard inverts. Style exploration. Brainstorming. A first rough draft meant to be thrown away. Concept art whose only requirement is "show me something I would not have thought of." There the stochastic surface should be near one hundred percent, because there is no predicate to anchor to and constraints would only strangle the thing you wanted. Clamping a creative task with a verifier is the same category error as trusting a generator on a correctness task, run backwards.
So the discipline underneath the discipline is telling the two apart. Ask of any output: is there a spec it must satisfy, or am I sampling for novelty? If there is a spec, the standard applies and you shrink the surface as far as the problem allows. If there is not, let the model run, and do not pretend a verifier was missing.
The hard engineering was never the model. An adequate one is already here, and the next one will be better in ways that do not touch this. The hard engineering is drawing the line around it, and making that line as small as the problem allows.
This is the second piece in a pair. The first: Generative AI Builds Shapes, Not Games.
Related, the same law at other layers: Agent Memory Is a Cache Coherence Problem and Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy.
Top comments (0)