We Stopped Trusting Models. Then We Stopped Trusting Our Own Numbers.

#ai #agents #softwareengineering #devjournal

Nondeterminism isn't a bug to ban — it's a force to place

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

A while back I wrote that we'd stopped chasing better models — that for this project, swapping in a stronger model kept failing to fix problems that turned out to be about the system around the model, not the model itself. That post ended on a tidy note: the model wasn't the bottleneck, the system was.

This is the post where the same suspicion turns inward, toward our own measurements. Because once you stop trusting the model to be the answer, the next thing you lean on is your own measurements — the test counts, the gate statistics, the tallies that tell you whether the system is working. And over a stretch of building, I learned those can't be trusted blindly either. The three previous posts in this run were each a version of that discovery:

A test suite that passed while measuring the wrong environment.
A gate that blocked 198 times while being wrong often enough that the count could no longer serve as evidence of quality.
An agent that counted twelve when the real number was thirteen.

Three different instruments. A recurring failure shape underneath them: the thing I use to verify can itself be wrong, and it tends to be wrong in a way that looks like success. This post is about what I took from that — and, more importantly, about the wrong conclusion I almost drew from it.

The wrong lesson: "ban the uncertainty"

When you get burned three times by measurements you trusted, there's an instinctive reaction: trust nothing that isn't certain. Push every source of uncertainty out of the system. Treat nondeterminism — anything probabilistic, anything that might come out differently twice — as a defect to eliminate. If the language model is the uncertain part, minimize the language model. Make everything deterministic and tell yourself you can finally sleep soundly.

I leaned that way for a while. It's wrong, or at least too blunt to be useful. Taken seriously, "ban all nondeterminism" throws out the single most valuable thing the model brings to this system — its ability to propose, to explore, to suggest a fix I wouldn't have enumerated. You can't get that from a deterministic rule. The uncertainty isn't only a liability; in the right seat, it's the entire point.

So the question stopped being "how do I remove the uncertainty" and became "where does the uncertainty belong?"

The better lesson: place it, don't ban it

Here's the reframing that held up — and it took an embarrassingly long time to see, partly because I'd quietly assumed nondeterminism was a debt the whole time without ever checking the assumption.

A system like this has two kinds of seats.

Seats that propose and explore. What should we try next? What might this failure be? What's a candidate fix? These seats want nondeterminism. A probabilistic model generating possibilities is a feature here, not a risk. If it's occasionally wrong, the cost is low — being wrong is fine when you're only suggesting, because something downstream still has to approve you.

Seats that judge and record. Did the tests actually pass? Is this allowed through? What gets written down as true? These seats can't tolerate an unaccountable input. They need to be deterministic, reproducible, and checkable — because this is exactly where "almost right" becomes indistinguishable from "right," which is the trap the last three posts kept falling into.

Each of the three failures was the same shape: something that shouldn't have been the final authority had crept into a judging seat. The polluted environment let an unpinned variable decide a verdict. The miscalibrated gate let an unchecked heuristic sit in judgment and reject valid work. The agent's self-report nearly let a probabilistic counter be the last word on a number. In every case, the fix wasn't to purge uncertainty from the whole system — it was to get the wrong thing out of the judging seat, and make sure that seat was held by something deterministic, pinned, and witnessable.

Nondeterminism, it turns out, isn't a quality of the system to be turned up or down. It's a force to be placed. You let it run at the front, where things are proposed and explored. You keep it out of the back, where things are judged and recorded. The model proposes; a deterministic check decides. That one division became the thread that connected the fixes in this run.

Why this isn't just "trust deterministic things more"

It's tempting to read all this as "deterministic good, probabilistic bad." That's not it, and the distinction matters.

A probabilistic suggestion in a proposing seat is more valuable than a deterministic one, because it can reach things a rule can't. And a deterministic check is only as good as whether it's actually correct and checkable — the gate in the second post was deterministic and still wrong a lot of the time, and a deterministic judge that's consistently wrong can be worse than a coin flip, because it's a steady error you eventually stop questioning. So it isn't that determinism is virtuous and uncertainty is sinful. It's that they belong in different places, and the whole engineering problem is keeping them sorted:

Let the uncertain thing explore. Let deterministic checks judge — but only after the checks themselves have been checked.

The reason this took four posts and a fair amount of getting it wrong is that the failures don't announce which category they're in. A measurement that's quietly wrong looks exactly like one that's right, until you go and look. The sorting isn't automatic. You do it deliberately, instrument by instrument, and you usually find out you got it wrong only when a number you trusted turns out to have been the problem all along.

What this didn't prove

I want to close the series the way I've tried to write the whole thing — without inflating it.

This is a framing that helped one system: a local AI coding agent with a test-driven loop, where I happen to have the luxury of pinning judging seats to deterministic checks I can watch. I haven't proven it's the right decomposition for every AI system, and I'd be wary of anyone — including me — who turned "place nondeterminism, don't ban it" into a universal law. It's a working lens, not an established result. The evidence behind it is a handful of incidents, honestly reported, not a controlled study.

It also doesn't resolve the hard part: the boundary between "propose" and "judge" is not always obvious. Plenty of real decisions are a blend — a judgment that also has to weigh uncertain evidence, or a proposal that quietly commits you to something before anything else gets a vote. Where exactly to put the deterministic check, and how much to let the probabilistic part inform a judgment without becoming the judgment, is something I'm still working out. The previous post's open question — when a human-witnessed check can graduate to a trusted automated one — is part of the same unsolved area. I'm sharing the lens because it organized a mess for me, not because it's finished.

The takeaway, for the whole run

Four posts, one thread. We stopped trusting that a better model would save us. Then we learned not to trust our own measurements blindly either — not the passing tests, not the busy gate, not the agent's tidy count. What survived all that doubt, for me, was a single discipline that proved sturdy enough to stand on:

In systems like this, every layer that grants trust should be measured before it's trusted — and that measurement should be grounded, wherever possible, in something deterministic you can witness.

That's it. Not a model, not a framework, not a clever architecture. A place to stand. After doubting the model, the gate, the tally, and our own numbers, it's the one thing I found solid enough to build the next thing on.

This is the end of this short run, so I'll ask the broad version of the question. For those building systems with AI in the loop: where do you draw the line between the parts you let be uncertain and the parts you insist be deterministic — and how do you check that you drew it in the right place? I arrived at one answer through a series of mistakes. I'd rather learn the next one from other people's experience than from my own next mistake.

Thanks for following this run. The earlier ForgeFlow posts — on the local agent itself, on why we stopped chasing models, and on what breaks at scale — are linked from my profile.