The check you can write is the check you can fool

#agents #ai #llm #security

A few weeks of watching agents fail in slow, expensive ways has pushed me toward a single test for whether a system is actually verified, and it is narrower than I expected: could the thing being checked have produced the check?

That sounds glib, but it cuts through a lot. "Is this verified?" usually gets answered with a mechanism — a second pass, a judge model, a benchmark, a signed log. None of those answer the real question on their own. The real question is about provenance: where did the evidence come from, and could the actor have authored it? Verification is not a layer you bolt on. It is a property of where the evidence lives.

Here is the path that got me there.

Self-verification has a ceiling, and it isn't calibration

The obvious first move is to have the system check itself — decompose the task, grade each sub-step, flag incoherence. This genuinely helps. A model is better at scoring small local claims than one holistic "is this good?", so fine-grained self-checks catch a class of errors a single judgment misses.

But there is a ceiling, and it is structural, not a tuning problem. The verifier and the worker are the same model, reading the same context, out of the same weights. That setup catches incoherence and miscalibration — a candidate that contradicts itself, a confidence score that is off. What it cannot catch by construction is shared error: when the model is confidently wrong about a fact, it generates the wrong answer and then verifies it as correct, because both halves consult the same internal belief instead of the world. The sub-check passes precisely because the model "knows" the wrong thing. More turns of the same loop do not fix this; they give the system more chances to agree with itself until a dashboard turns green.

It is not self-authorship — it is unilateral control

My first framing was "stop letting anything you authored count as your own evidence." Someone pushed back on that, correctly: authorship is too broad. The disqualifying property is not that you wrote the record — it is that you had unilateral control over it.

An append-only log you wrote is fine, if it is externally timestamped and you cannot selectively rewrite it after the fact. A file you did not write is worthless as evidence if you chose which slice to keep, summarized it, or controlled the predicate that reads it. What makes evidence trustworthy is an adversarial custody boundary: some point in the chain the actor cannot cross.

And that boundary has to sit earlier than people tend to put it. Append-only storage with an external timestamp defeats after-the-fact rewriting — but it does nothing about selection. You still chose which events got emitted into the immutable log, and which predicate reads it back. You can have a perfectly tamper-proof record of a curated subset. So the custody boundary belongs at the write/emit decision, not the storage layer, or all you have done is make your selection bias unforgeable.

Trajectories are self-report one level up

The same trap reappears when you move from single answers to multi-step agent runs. The natural instinct is to audit the trajectory: track the agent's claims, check each against the evidence the run collected, mark the spans where a claim is not supported.

This is a real improvement over final-answer grading. But notice what "supported by the trajectory's evidence" means: the evidence is what the agent gathered. Checking a claim against the agent's own collected evidence catches the unsupported claim and the self-contradicting one — both internal-consistency failures. It is blind, by construction, to the supported-but-wrong claim: a search returned a confident, false snippet, and the claim rests on it faithfully. The support check passes because the claim really is grounded — the trajectory is just wrong about the world. Auditing claims against the trajectory is auditing the actor's account against the actor's account, one level up from the final answer.

The way out is not a better audit of the path. It is making each step re-prove its footing against primary state at the moment it runs, instead of inheriting "we are fine" from the step before. When something has drifted, the chain breaks at the first step whose precondition no longer re-derives, rather than marching to the end on a counterfeit. And the default has to flip: stop-unless-warranted, not continue-unless-flagged. Drift only marches on because the loop continues by default.

One caveat from actually trying it: re-deriving everything every step will deadlock you. Re-derive the steps whose silent drift is unrecoverable — the side-effecting, can't-take-it-back ones — and let the cheap reversible reads ride.

Delegation launders authority

The last place this shows up is the boundary between two agents. When agent A hands a task to agent B, A's policy checks run on A's side and B's run on B's, and the composition of two locally-correct policies is not globally correct. The quiet failure: B executes under B's own permissions, not A's. So the instant A delegates, the authority ceiling jumps from the smaller of the two up to B's. "A may request a summary; B may read the documents" composes into "A obtains a summary of documents A could never read," and every local check passed.

Capability discovery does not fix this — advertising what B can do says nothing about under whose authority B does it on a given task. What closes it is attenuation: A hands B not just the task but a scoped grant no wider than A's own authority, and B's action is authorized by the grant it received, not by what B happens to be allowed to do standing alone. The grant travels with the task, B presents it as the thing that authorized the action, and whoever has to answer for the composed result can audit it. Now the composition cannot exceed the smaller authority by construction.

The one principle

Every one of these is the same move wearing different clothes. Self-checks, custody, trajectories, delegation — the fix is always to make the verdict depend on something the actor could not have produced. Re-derive it from primary state. Read a trace the actor did not write. Require a signature whose key it does not hold. Bind the action to a grant it could not issue itself.

So the test I keep coming back to is the cheap one. When something says "verified," ask what produced the evidence, and whether the thing being verified could have produced it too. If the answer is yes, you do not have verification. You have a system agreeing with itself, and a dashboard that turns green for free.

Top comments (1)

Self-Correcting Systems • Jun 4

The custody-at-emit insight is the one that hit hardest. we had an append-only grant
log that felt solid until i realized we were still deciding which events got written.
immutable storage on curated input just makes the selection bias unforgeable. that's
the piece that forced us to push the boundary upstream — the re-derivation gate now
reads from a source the agent has no write path to at all, not just a log the agent
didn't tamper with after the fact.

the delegation section maps directly onto something we hit with grant composition. the
gap isn't that b lies, it's that b is simply authorized to do what b does and a's
constraint disappears at the handoff. the fix we landed on is the same one you named:
the grant travels with the task, b presents it as the authorizing artifact, not b's own
standing permissions.

"a dashboard that turns green for free" is the most honest description of this failure
mode i've read.