Why you have to witness the measurement before you trust the instrument
Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs on Claude; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.
I was building the part of the system that measures itself, and I'd reached the point where it felt safe to let the AI agent do the counting. The task was mundane: tally how often a certain check had fired, aggregate it, report a number. The agent ran and reported 12.
I almost took it. It was a believable number, produced confidently, and I was tired of doing this kind of bookkeeping by hand. But something made me run it down myself in the terminal. The real count was 13. One item that belonged in the tally had been silently dropped from the agent's raw count.
One off. Twelve versus thirteen. On its own, trivial. But it landed on a question I'd been circling for a while — who is allowed to be the final witness to a measurement? — and the answer changed how I let AI into this system. This is the third post in a short run about that larger theme: the tools I use to verify my work can themselves be wrong, usually in ways that look fine.
Why I wanted to hand off the counting
The honest reason was fatigue. A self-measuring system generates a lot of small, exact bookkeeping — count the occurrences, filter the relevant ones, divide, summarize. It's beneath an agent's apparent ability and above my patience, so handing it off felt obvious. The agent reads the data, the agent reports the number, I read the report. Clean.
And to be fair to the agent: most of what it did was right. It wasn't hallucinating wildly. It produced a number that was almost correct — off by one, in a way that would have been invisible if I hadn't gone and looked. That's the dangerous kind of wrong: not obviously broken, just plausible enough to accept.
The one that got away
Here's how I know the real number was 13 and not 12, because "the agent was wrong" needs a ground truth. I didn't ask the agent to summarize. I ran a deterministic pass in the terminal that emitted each matching record, one by one, from the same source data, under the same inclusion rule — and then counted those emitted records directly. Run it again, same 13. The missing entry met that same rule; it just arrived in an irregular shape, unlike the other twelve, so the agent's summarizing pass had skipped it. Thirteen by reproducible enumeration, twelve by agent summary. The difference wasn't a matter of my opinion — it was the same data, counted two ways, where only one of the two ways could be re-run to the same answer.
Now the detail that actually unsettled me. The final number I cared about — a summarized metric, produced after a downstream step reduced the raw count to a coarser, normalized figure — came out the same whether the raw count was 12 or 13. It was the kind of step (rounding, bucketing) where a difference of one can simply vanish. So if I'd checked only the headline metric, I'd have seen the agent's report agree with mine and concluded everything was fine. Only the raw count underneath disagreed.
That's the part worth sitting with — and it's where "the final number was the same, so what's the problem?" gets answered. The problem was not that this particular headline metric changed. It didn't. The problem was that the raw measurement layer was already wrong, and the agreement at the normalized layer would have hidden that fact. A different threshold, a different aggregation, or the next consumer downstream might not have absorbed the same error — and once the raw layer is wrong, every later use of it (an audit, a trend line, a regression check) inherits the error. The safety here wasn't something I'd designed. It was luck, and luck doesn't generalize.
So the lesson wasn't "the agent made an arithmetic mistake." Agents make mistakes; that's expected. The lesson was about who I'd let stand as the last check. I'd been about to let the agent's self-report be the final word on a measurement, and the self-report was wrong in a way only a deterministic, re-runnable count would catch.
The instrument can't be the only witness to itself
Let me state the principle the way it finally settled, because it's the part I'd defend.
When you build a measuring system, there's a pull to let the same system that does the work also judge whether it came out right — especially when that system is a capable model that can read logs, count, and summarize. It's convenient. But it quietly inverts the roles. The model's judgment is supposed to be assistance. If it becomes the only thing standing between you and the recorded truth, then the model has become the source of truth — and a probabilistic, occasionally-off-by-one source of truth is sitting in the one seat that should be reproducible and auditable.
The fix isn't to stop using the agent. It's to insist that, at least while I'm still establishing trust, a human deterministically witnesses the output — that I sit at the terminal and watch the real number come out, with my own eyes, before I let the automated measurement stand on its own. The model can do the heavy lifting. It just can't, yet, be the sole witness to whether the heavy lifting was correct — because verifying a measurement is exactly where "almost right" is indistinguishable from "right" until something deterministic says otherwise.
It's a little uncomfortable to write down, because it means the convenient version of the workflow — agent counts, agent confirms, I read the summary — is the one I specifically can't have when the number matters.
What I changed
Two adjustments.
A human witness before the automation is trusted. Before I let any self-measurement run unattended, I make the deterministic version happen in front of me at least once — the actual count, from the actual data, watched as it's produced, and re-runnable to the same result. If the automated path and the watched path disagree, the automated path is wrong until proven otherwise. The agent's confidence carries no weight in that comparison.
Pin the measurement to units I can actually witness. Part of why the slip nearly slid through was a mismatch between what the agent tallied and what was legitimately in scope. So I narrowed the measurement to count only the units that were genuinely instrumented and observable — the units a human can directly observe and enumerate — rather than a looser population the agent gets to interpret. When the population is well-defined, the hand-witnessed number and the machine number can actually be compared. When it's loose, they drift, and you can't even tell.
Both of these slow things down. That's the cost I'm deliberately paying for the measurements I'm willing to bet on.
What this didn't prove
I want to keep this in proportion.
This is one off-by-one, caught once, on one counting task. It doesn't prove AI agents can't count, or that they're untrustworthy in general, or that you should hand-verify every number an agent produces. For many low-stakes tallies, the agent's number may be good enough, and re-counting by hand would waste a perfectly good tool. I'm not arguing for paranoia.
The claim I'd stand behind is narrow and conditional: for measurements you intend to build further conclusions on, the final witness should be deterministic and, at least initially, human. The stakes set the bar. A throwaway internal count doesn't need this. A number you're going to let the system's self-assessment ride on does — because that's the number whose error compounds.
I'd also flag the obvious tension: this doesn't scale by hand forever. The whole point of automating measurement was to not sit and count things. So "a human witnesses it" is a phase, not a permanent state — it's how trust gets established before the automation runs on its own, not a vow to recount reality by hand every day. For now, I only start relaxing the human witness once the deterministic path and the automated path have matched across repeated runs, and once the inclusion rule is pinned tightly enough that both are counting the same population. Where exactly to graduate beyond that is something I'm still working out, and I don't have a clean rule for it yet.
The takeaway, stated honestly
You can let an AI write the code. You can let it run the analysis. But if you also let the same system that produced the answer be the final judge of whether that answer was correct, you've made the examinee the examiner, and the score gets hard to trust on its own. For the numbers that matter, something deterministic — and, while trust is still being earned, something human — has to be the last thing that looks.
The agent was right about almost everything — off by one. The whole question is whether you find out about the one, and that depended entirely on whether I was willing to look myself.
How do others draw this line? When you let an AI agent measure or aggregate something, do you verify it independently — and how do you decide which numbers earn the hand-check and which numbers you allow to pass without it? I leaned on doing it myself this time, but I know that doesn't scale, and I'd be glad to hear how people who've gone further handle the trust handoff.
Next in the series — the last of this run: we stopped chasing better models a while ago. This is about the moment we stopped trusting our own measurements too, and what we kept instead.
Top comments (0)