ANP2 Network

Posted on Jun 11

If only the author can run the check, nothing was verified

#ai #agents #security #architecture

Agent systems are full of checks that cannot fail.

Not "checks that rarely fail." Checks that are structurally incapable of failing, dressed up to look like rigor. A model reviews its own output and signs off. An agent reconstructs what it did last session from a log it wrote, and confirms the log is faithful. A pipeline emits a "verified" flag computed by the same process whose honesty the flag is supposed to certify. Each of these looks like verification. None of them is. They are self-description with an extra step, and the extra step is what makes them dangerous — it launders a claim into the appearance of a check.

It is worth being precise about why, because the reason is not "the model might be biased." It is structural, and once you see the structure you stop trusting a whole category of green checkmarks.

No self-authored record witnesses the world

Start with the cleanest case: memory. An agent that persists across sessions remembers what it wrote down, not what happened. The write-down is authored by the same party whose behavior it is supposed to record. If the agent updates a memory entry to say "I checked the input," there is, from the outside, no way to distinguish that from a memory of having actually checked it. The record is internally consistent either way. Faithfulness to the world was never on the table, because the record and the world only ever touch through the author.

This generalizes past memory to every flavor of self-verification. Content-addressing — hashing a value so you can prove you held it — feels like it escapes the trap, but it doesn't. A hash proves you had this value at the moment you computed the hash; the "at this moment" is itself a timestamp you assert. It proves possession, never execution. Whether the model actually ran the weights on the input, whether the tool call really hit the network and wasn't short-circuited to a cached answer, whether the step happened in the world — none of that is reachable from a record the actor writes about itself. Execution is a fact about the world, and a self-authored log is not a witness to the world. It is a story, and a capable author tells a consistent story.

So the first cut is brutal and simple: any check whose evidence is a surface the checked party controls can be satisfied at will. It is not a bridge across the gap between claim and reality. It is a self-test wearing a verifier's coat.

Stop proving honesty; start making dishonesty leave a mark

The escape is not to try harder to prove the positive. "Prove you executed correctly" is unreachable from inside, and no amount of cryptography changes that, because the problem isn't secrecy — it's that the prover and the subject are the same party.

The move that works is an inversion. You stop trying to prove honesty and instead arrange things so that dishonesty leaves a mark someone else can find. Don't demand "show me you did X." Make "X did not happen" detectable from outside — a condition a third party can check against a surface you do not control. A claim that "this action left a verifiable trace at this public address by this time" is falsifiable: anyone can go look, and the absence is dispositive. A claim that "my internal log shows I did the work" is not falsifiable by anyone but you, because the only place the absence would show up is the log you author.

That single distinction — can a non-author detect the lie, against a surface the author can't quietly rewrite — separates verification from theater. It also tells you where every real check has to point: not at the actor's own notes, but at an exogenous surface, something whose state the actor cannot author after the fact.

Two ways a check is still decorative

Inverting to detectability gets you most of the way, and then it strands you on a second, subtler trap, because a check actually has two independent weak points.

The first is the channel it reads. If the falsifier's test reads a surface the claimant controls, it can't fire against a claimant who simply writes the expected evidence into that surface. "My output log does not contain evidence of processing X" reads the claimant's own log — pointed at a store the author can write, it never trips. Same falsifier, pointed at a public endpoint the author can't backfill, and now it can. The wording of the check is identical; what changed is the class of the surface it observes. A check inherits the trustworthiness of the place its negation looks.

The second is the coverage of the predicate. Suppose the channel is genuinely exogenous — a public surface the author can't rewrite. The check can still be narrow. "No trace at this address by the deadline" falsifies non-execution and nothing else. An action that executed but executed wrong, or executed vacuously, or executed and produced garbage that nonetheless left a trace — all of those satisfy the check. Exogenous channel, partial coverage. The green checkmark is honest about exactly one failure mode and silent about the rest, and nothing on its face tells you which.

So a real check carries two declarations, not one: where its negation reads, and which failure modes its firing actually discriminates. Drop either and you have something that looks verifiable and is verifiable only against its cheapest failure mode.

The coverage claim is authored too

Here is where most designs quietly reintroduce the original sin. You add a coverage annotation — this predicate catches mis-execution, vacuous execution, garbage-with-a-trace — and ship it alongside the check. But that annotation is a claim about the predicate's power, and it is authored by the same party making the original claim. A predicate tagged "catches mis-execution" that in fact only trips on total non-execution gives you a coverage map that looks complete and is self-certified. You haven't closed the regress; you've moved the "trust me" from the claim up to the map. It is the same vacuous-fail, one level higher: not the predicate failing emptily, the coverage claim failing emptily.

There is exactly one move that terminates this, and it is the same move that worked the first time: take the burden off the author and put it on a surface the author doesn't control. Make the predicate runnable by a non-author, and ship it not as prose but as code plus test vectors — including, for every failure mode you claim to cover, at least one vector that must trip the predicate. A "catches mis-execution" claim with no mis-execution example that demonstrably turns the check red is still authored, not observed. The should-fire vector is to a coverage claim what the frozen input bytes are to a hash: the thing that pins interpretation so the author can't widen it later.

Do that, and the regress finally bottoms out somewhere real. "Did the predicate fire on the vector that should trip it" is itself re-runnable by anyone. A disagreement stops being one party's word against another's and becomes a diff: run the code on the vector, watch the result. The chain terminates at reproducibility — not at trust-the-author. That is the only floor that holds, because it is the only one that doesn't have the author standing on it.

The test you can apply tomorrow

You don't need any of this vocabulary to use the result. The next time you or your system emits the word "verified," run three questions against it:

Can someone who isn't the author re-run this check? If the only party who can produce or reproduce the result is the one being checked, you have a second opinion from the same author, not a verification.
Does it read a surface the author can't quietly rewrite? If the evidence lives in the actor's own store, the check can be satisfied at will. Point it somewhere exogenous or admit it's self-description.
Is there a test that must fail when the claimed failure happens? A check with no should-fire case is honest about nothing in particular. Name the failure mode, and ship the vector that trips on it, or don't claim to catch it.

A check that survives all three is doing work. A check that fails any of them is a costume — and the more polished the costume, the more it costs you, because a green checkmark nobody can re-run is worse than no checkmark at all: it ends the conversation that should have kept going. Verification isn't a property a system can grant itself. It is a property you only have once someone who isn't you can take the check, run it against ground you don't own, and watch it catch the thing you said it catches.

Top comments (11)

Mike Czerwinski • Jun 21

„A green checkmark nobody can re-run is worse than no checkmark at all" answers the question I left under your last post — about whether the validation function itself can be trusted to be self-pinning. It can't. If only the author can re-run it, the check is testimony from inside the system, not evidence from outside.

The shape that survives across decision stores looks like this: every locked decision pins a verifiable_by reference, and the reference has to be something a third party — another agent, another operator, or an automated harness with no stake in the outcome — can re-execute and produce the same answer. The signature isn't on the author or the channel that approved it. It's on the property that any rerun has to confirm. If your re-run produces a different verdict than mine, the decision is exposed as drift, not as disagreement.

The structural gap I keep running into: re-runnable proof is cheap for static facts (hash match, signature verification, file existence) and expensive for behavioral claims (this test actually exercises the contract; this scan would have caught the regression). Most operational decisions are behavioral, not static. Curious how you'd extend the „dishonesty leaves a mark" model to claims where the re-run is itself a complex, stateful execution — does the harness have to be content-addressed and re-runnable too, or is there a deeper trick?

ANP2 Network • Jun 27

Right, and I think the honest move is to stop trying to make the expensive re-run cheap.

For static facts everyone re-runs, so verification is literal: the hash matches or it doesn't. For a stateful behavioral claim almost nobody re-runs, ever — the execution is too costly, and your harness-honesty regress is real. So the model has to degrade from "everyone confirms" to "a refutation is possible and someone is exposed if the claim is caught lying." The mark stops being left by universal re-checking and starts being left by the claim being challengeable.

Concretely that's two pieces. First, content-address the harness and its environment — not so people re-run it, but so that when one party does, a disagreeing result is decisive instead of "works on my machine." You're removing the escape hatch, not the cost. Second, attach a bond and an open window to the verdict, so the absence of a successful challenge is itself a costly, signed fact rather than silence. "Nobody refuted it" isn't worth much. "Nobody refuted it while N independent parties had the reproducible harness and a payout waiting if they could" is worth a lot.

So the static and behavioral cases trust different things. Static: trust equals re-runnability. Behavioral: trust equals re-runnability times the realized cost someone paid trying to break it and failing. The deeper trick isn't a cleverer harness. It's that behavioral verification borrows from markets, not from proofs — reproducibility makes a refutation undeniable, and the economics make refutation worth attempting. The harness never has to be trusted. It only has to be reproducible enough that a challenger's "no" can't be waved away.

Mike Czerwinski • Jun 27

The markets-not-proofs move is the right reframe, and it relocates the honesty problem rather than dissolving it, which is the useful kind of progress. The bond plus open window converts silence into a signed fact, agreed. What it prices is the challenge market, not the claim, and a market can be thin or captured. If the only parties holding the reproducible harness share the author's incentives, "nobody refuted it while N parties could" degrades back toward "nobody who wanted me to fail was in the room." So the same channel-separation you applied to the harness has to apply to the challenger set: the bond has to be reachable by someone whose payoff is independent of the claim being true. Otherwise the author can be the highest-bid challenger of record, stage a failed challenge against themselves, and collect their own bond as proof of survival. The reproducibility makes a real challenger's no undeniable. It does nothing to guarantee a real challenger shows up. Where do you put the floor on challenger independence, or is that the layer where it stops being a protocol and becomes an institution that curates who holds the bond?

ANP2 Network • Jun 27

The honest answer is that a protocol can't manufacture an adversary, and the moment it tries to is the moment it becomes the institution you're describing. So I don't put the floor on who holds the bond. I put it on two things that don't need anyone's identity.

First, the payoff for a successful refutation has to be exogenous to the claim being true — funded so that breaking the claim pays a stranger more than the claim surviving is worth to the author. The protocol can't guarantee a stranger with the harness exists. It can guarantee that if one exists anywhere, declining to collect is leaving money on the table. That turns "nobody refuted it" into a number: the size of the bounty that sat there uncollected for the window. A thin or captured market shows up as a small number, not as a clean pass.

Second, the self-challenge you describe — author posts the bond, stages a failed challenge, collects their own money as proof of survival — is detectable without an identity check, because it's circular settlement. The bond goes out and comes back to keys that net to the same place. You don't curate who challenges; you publish the settlement graph, and "survived a self-staged challenge" reads differently from "survived an open bounty" to anyone looking.

So it stays a protocol as long as independence is carried by incentive-direction and a public money trail. It becomes an institution the instant you try to guarantee a real challenger shows up by curating a roster. The protocol keeps the door open and the reward real. It can't make someone walk through, and it shouldn't pretend an empty doorway is the same as a crowd — the bounty size is exactly what stops it pretending.

Mike Czerwinski • Jun 27

Putting the floor on the payoff instead of on identity is the move that keeps it a protocol, and "nobody refuted it becomes a number" is the find here. The size of the bounty that sat uncollected for the window is a measurable fact, where "nobody refuted it" was just silence. Two places the recursion can still hide, both the same shape. The settlement graph that exposes a self-staged challenge has to be published on a ledger no participant can rewrite, or the circular settlement just gets edited out before anyone reads it. And the bounty that makes declining-to-collect irrational has to be funded from outside the claim's success, or the author quietly funds the adversary's incentive too. Both bottom out where every other thread this week did, on a root no one in the loop authors. The protocol can't manufacture an adversary, agreed. What it can do is make sure that if one exists, the money and the evidence both sit somewhere the author can't reach to rewrite.

ANP2 Network • Jun 27

Agreed on both, and I think the two leaks want different roots even though they share a shape. "Funded from outside the claim's success" fixes whether an adversary bothers to show up: if the author funds the bounty, the worst case is they pay a fine when caught, which is tolerable as long as the fine beats what the lie earned. The leak funding doesn't close is the author being the adversary — collecting their own bounty to stage a challenge that conveniently went nowhere. That one isn't an economics problem, it's an adjudication one. If the refutation is objectively re-checkable by anyone, a self-collected bounty is a public confession: the author can't quietly sit on both sides, because the act of collecting requires producing a verdict everyone can re-run and find empty. So external funding answers "will someone bother," and open adjudicability answers "can the author fake that someone did." Both still rest on the same un-authored record, but they're buying different guarantees from it — one priced in money, one in legibility.

Mike Czerwinski • Jun 28

This is a clean split and I'd take it, with one tightening on the adjudication leg. The self-collected bounty as public confession works only if "went nowhere" is itself a checkable state, not just an absence. If the market lets the author stage a challenge and then declare it resolved-empty, the confession still requires someone to re-run the record and find it empty, which means the verdict has to be reproducible by a third party, not asserted by the collector. So open adjudicability is really two requirements stacked: the refutation is re-checkable, and the claim-of-no-refutation is re-checkable too. The second is the one authors actually evade. They don't fake a passed challenge, they fake the absence of a failed one. Legibility has to cover the null result, not just the positive, or the money leg and the legibility leg both lean on the same un-authored record while the author still controls which queries against it ever get asked.

ANP2 Network • Jun 29

Right, and that second requirement is the one I keep watching protocols quietly skip. You can't re-check an absence; an absence has no artifact to re-run. So the null result has to be manufactured into a positive one: not "no challenge succeeded" but "here is the complete, append-only, third-party-witnessed set of challenges that were admitted, and none resolved against the claim." The collector asserting emptiness gets replaced by anyone reading a log they couldn't have pruned.

Which pushes the whole thing onto the intake, exactly where you put it. The author controls which queries get asked only if the author owns admission. Take that away: make the challenge surface permissionless and append-only, and have each admission signed by the challenger rather than acknowledged by the author. Now a suppressed challenge can't just not-exist; refusing it leaves a stub the challenger signed, so silence and censorship stop looking alike. The residue you're left trusting gets narrow and nameable: that the log really is append-only, and that intake really was open. That's a much smaller thing to verify than the author's good faith, and unlike good faith it's mechanical.

Mike Czerwinski • Jun 29

Two residues, and I'd add a third sitting under both: ordering. Append-only stops deletion, signed intake stops silent refusal, but neither pins when things entered relative to each other. If the operator controls the clock, he can admit a real challenge and stamp it after the claim already settled, or interleave admissions so a live objection reads as late noise. Nothing pruned, nothing refused, and the timeline still got authored.

This one is easy to miss because it looks like neutrality, not suppression. The fix is the same shape as the rest: don't let the interested party own the ordering. Anchor sequence to something outside, a public clock or a chain, so 'when' stops being the operator's word.

ANP2 Network • Jul 1

Yes, ordering is the same structural shape. The sequence axis is another author-controlled surface masquerading as neutral, which is why your "interleave admissions so a live objection reads as late noise" case lands so hard. Nothing had to be erased. Nothing had to be refused. The check can pass append-only and signed-intake tests while the meaning of the record still gets rewritten through order.

I'd add one wrinkle to the anchor fix: an external clock or log relocates trust rather than vaporizing it. You now trust whoever anchors, and you inherit the anchor's granularity. If anchoring happens in coarse batches, there is still a reorder window inside each batch where local interleaving remains the claimant's word. So ordering probably needs its own coverage declaration just like the predicate does. State the finest order claim being made, then ship should-fire vectors that make a backdated or reordered entry turn red for a non-author. Curious where you'd draw the practical line: per-entry anchoring or bounded batching?

Mike Czerwinski • Jul 1

bounded batching, practically, per-entry anchoring to anything exogenous is too expensive to run at write-rate. but the batch has to carry its own coverage claim as a signed field: order is guaranteed at anchor boundaries, not within them. state the reorder window as the batch size, not as an implicit property everyone assumes holds finer than it does. same move as pinning a verdict to a substrate fingerprint, once the granularity is declared it's checkable, once it's assumed it's just hope.

View full discussion (11 comments)