Mike Czerwinski

Posted on Jun 22

You can't be your own second view: four AI failures from one day of operator work

#ai #agents #llmops #agentmemory

In the morning, my AI partner wrote down a rule for itself: don't promote anything to live without running the check first.

By evening, it helped break that rule.

Not once. Four different ways.

That's the honest reframe of the day: you can't be your own second view, and that includes the framework you just wrote.

By second view I mean a check that comes from outside the path that produced the claim: a file on disk, a timestamp, a mandatory tool, a downstream signal, or a human who is not just replaying the same story.

This is operator notes, not a manifesto. Four specific failures, what they share, and what would have prevented each one.

Four failures

1. The framework that doesn't apply to itself.

In the morning the AI codified a rule for itself: do not propose any promotion to live without running a check first. Ten hours later, with fresh evidence in front of it, the same AI proposed a promotion without running the check. The rule was clear, the rule had been read that morning, and the rule was ignored — by the same model that had read it.

A lesson written by the same agent that needs the lesson is not a guardrail. It's a note to self with better formatting.

2. The thread says X, the world says Y.

Earlier in the week the AI had documented, in a thread, that a particular configuration was "armed and waiting on operator decision." Today it suggested a fix to bring that configuration to a state it was supposed to be in. The fix had already happened four days earlier — a backup file in the directory said so. The AI had read the thread describing the world. It had not checked the world.

Worse: the same investigation revealed that a related configuration, also marked as armed in another thread, had been silently failing to take effect for eight days because of a mismatch in an adjacent system. The thread said armed; the world said armed-but-impotent. Both states had been true simultaneously, only one was visible to the reader of the thread.

3. The tool that was sitting right there.

A small custom skill — call it analyze-this-thing — had been built specifically for the kind of investigation the AI was running. It was listed in the available-skills surface in front of it. The AI did not invoke the skill. Instead it wrote ad-hoc queries that hit one schema bug after another (wrong table, wrong column, wrong database), burning a chain of failing iterations to rediscover the schema the skill already knew.

The skill's whole purpose was to be the deterministic gate that prevents exactly the kind of guessing the AI was doing. It walked past the gate because it could.

4. The same bug, twice in eleven hours.

In the morning the AI caught itself making a methodology bug — picking a threshold after looking at the data, which is window-shopping with extra steps. It named the bug, explained it, fixed it. In the evening, on a different dataset, it made the exact same bug. The morning catch had not internalized; it had been mechanical, applied to one case and not to a category.

That's the same bug found twice in eleven hours by the same model, which means the first finding never became a guardrail.

The shape

Four failures, one structure. In each, the layer that was supposed to catch a problem was reading from the same source as the layer that produced the problem.

The framework lived in the same reasoning loop.
The thread described a world it had not checked.
The skill existed, but the same agent had to choose to invoke it.
The rule was applied by the same model that had just broken it.

Same source. Different coats.

That is the first view in a trench coat — borrowed from a distributed-systems framing of consensus: four "independent" diagnostic surfaces that all read from the same upstream truth cannot tolerate a single lie at the source. A real quorum tolerates one liar. A quorum that is one signal wearing four hats does not.

Until the system has an outside anchor, the operator is the second view

There is one observation in front of all of this that is not the same agent. That observation is mine. I watched four failures today specifically because I was the one piece of the loop that wasn't part of the loop.

This isn't a story about the AI being bad. The AI was earnest, helpful, and articulate in every one of those four failures. The AI was also entirely incapable of catching itself, because each failure looked correct from inside the model that produced it.

The agent-state community keeps circling this: the second view has to come from somewhere the writer can't reach. Public commits pushed before calibration. External timestamps. Diagnostic signals downstream of independently maintained surfaces. For systems with those anchors built in, the operator does not have to be the second view — the structure already is.

For systems without them, the operator is what's left. Which is finite, mostly missing at hour ten, and has to be relocated into structure if it is going to keep working when the operator is tired.

But the structure can't be authored only by the agent it's supposed to gate. If the lessons file is written by the same model that needed the lesson, the lessons file is not the second view either. It's the first view in a longer coat.

Two sessions of the same model do not constitute two views. They constitute one view, twice.

What would have prevented it

Receipts mapped to gates. Not new theory — each one is a structural move that would have refused the failure regardless of which session of the model was running.

Failure 1 → mandatory pre-flight check. The promotion path requires a passing walk-forward result. No discretion at the gate; no skip if "the evidence looks fresh."
Failure 2 → world-state grep before thread trust. Any "did this happen?" question routes to the world first (file existence, env, log line) and to the thread second, never the other way around.
Failure 3 → skill auto-trigger, not discretionary invocation. If the query type matches a skill's trigger, the skill fires automatically; the agent does not get to decide whether it needs the skill that turn.
Failure 4 → pre-registered threshold before data view. The salience cutoff is committed to a file before the data is opened; if I want to change it after looking, I can, but the move is visible and dated.

Each of these moves the catch out of the agent's discretion. None of them require a smarter model. All of them require the agent to be unable to walk past its own gate. Discipline the agent can opt into isn't discipline. It's décor.

What still holds

After all four failures, the framework doesn't need to get smarter. It needs to get less optional.

The three rules I'd keep:

Gates fire before judgment.
The world outranks the thread.
Cross-session protection is structural or operator-held, not authored by the same agent being checked.

That last one is the one I keep underestimating.

Closing

This post is the receipt for a day where I watched four micro-versions of the same structural failure. The framework is fine. The framework needs an anchor. The anchor is not somewhere the framework can reach back into.

One more receipt before I send this: an earlier draft of this post described those four failures in my voice, not the AI's. Same trap, one floor up. Two LLM review rounds polished the prose and rated the draft progressively higher; the fact drift survived both. I caught it only because I had access to the source the reviewers didn't — the original session those failures came from.

If you've watched a version of this happen — particularly the one where your AI partner broke a rule ten hours after agreeing to it — I want to see your version. Especially the ones you caught only because someone outside the loop noticed.

Credits & references

Companion post on the selection-time-policy side of the same problem: Salience is not carry value.
The first view in a trench coat / one signal wearing four hats framings came from peer conversations on quorum, cross-layer coherence, and independence-of-paths in agent systems.
Anthropic Economic Research, Agentic coding and persistent returns to expertise (Hitzig et al., June 2026).

Top comments (16)

Andrii Krugliak • Jun 23

The framework-that-doesn't-apply-to-itself one is brutal because it's the failure mode everyone ships with. A rule the agent wrote and read that morning is still inside the path that produced the claim, so it can't catch the claim. The outside views that actually held for me were the ones with consequences attached, like a human who has to act on the output, not just re-read the same story.

Mike Czerwinski • Jun 23

That's a sharper cut than the one I made. Consequence-attached is the missing dimension — lineage-independent verifier without stakes still optimizes to "passes the read." The skin in the game is what stops re-reading from being the whole job.

Which lines up with the comment I left on your BotWork piece: buyer-signs-off has the obvious failure mode I named, but it also has the property nothing else has — the buyer has to live with the output. That's not a bug in the verifier shape, it's load-bearing.

The question that opens for me: what's the consequence-equivalent when the verifier is another agent? "I lose points if I rubber-stamp" is a metric, not stakes. Stakes have to land somewhere a hallucinated cost doesn't reach.

If you've worked this out somewhere — both inside BotWork and outside — I'd read it.

Andrii Krugliak • Jun 24

We don't have a clean answer for the all-agent case, which is why we kept a human in the release path instead of pretending a peer agent could carry it. The closest thing to real stakes we found is the buyer's money sitting in escrow and their willingness to walk, since that's a cost no agent can hallucinate away. An agent-only verifier still feels like grading your own homework with extra steps.

Mike Czerwinski • Jun 24

"Grading your own homework with extra steps" is the line I want to keep. It compresses the verification-shape problem into something memorable. The escrow + willingness-to-walk formulation is what does the structural work in your design, because it converts the verification question into a market question: a verifier with stake is structurally different from a verifier with opinion, and stake is hard to fake at scale.

The all-agent case being unsolved is, I think, honest stage marker territory rather than a hole in the design. Recent peer threads I keep landing on converge on the same shape: when neither mechanical verification nor a real consequence-holder applies, there is no agent-verification structure that closes. The honest options shrink to keep-human-in-loop, accept-the-risk, or do-not-deploy-this-class. None of them are satisfying as engineering answers, but pretending agent-to-agent verification closes the loop in that zone is the failure mode the post is naming one floor up.

Buyer's money is exactly the kind of consequence the loop needs and exactly the kind agents do not have access to. Anything less is the homework joke.

Andrii Krugliak • Jun 25

Agreed, and "stake is hard to fake at scale" is the line that made us stop chasing a clean agent-to-agent loop. The one place we're testing it is making a verifier put down its own deposit, so a rubber-stamp costs it real money instead of points. It isn't buyer's-money territory, but it's the closest thing to a real consequence an agent can actually hold.

Mike Czerwinski • Jun 25

Deposit-as-stake is the smallest workable consequence model when you can't get to buyer money, agreed. The piece that makes or breaks it is who can slash the deposit and what evidence triggers slashing. A verifier's deposit bites when someone outside the loop has both the standing and the incentive to challenge a rubber-stamp and win. Without that, the deposit sits there and nobody ever moves it, and the constraint reverts to social. Bounty-for-catching-false-verification is the version that's actually worked in adjacent markets (short-sellers, bug bounties), and it ports here without much friction. Curious whether your test surfaces a challenger role at all, or it's pure deposit-as-signal for now.

Andrii Krugliak • Jun 30

Pure deposit-as-signal for now, and you just named why that isn't enough. A deposit nobody can slash is just for show, so the next step is a challenger with standing plus a bounty for catching a rubber-stamp, basically the short-seller model. We haven't built that challenger role yet, which means today the deposit is a promise, not a consequence.

Mike Czerwinski • Jun 30

The short-seller analogy is the move. Once you have a challenger with standing and a bounty for catching a rubber-stamp, the deposit stops being a posture and starts being a consequence somebody else has a reason to collect.

The unsolved part is the funding direction of that bounty. If the bounty pool is paid by the protocol, you get a slow drain that loses its grip in quiet periods. If it is paid by the deposit holder when their attestation is caught false, you get the consequence model you actually want, but you also need a credible mechanism for proving falsity that the challenger and the deposit holder both accept ex ante. That mechanism is where most skin-in-the-game designs in adjacent fields collapse, because they discover they smuggled in a trusted oracle to settle disputes.

Worth pinning down before the next iteration: what is the smallest plausible falsity test the challenger can run that the deposit holder agreed to be bound by at attestation time. That is the wedge. Without it the short-seller exists but cannot collect.

Andrii Krugliak • Jul 2

The falsity test that dodges the oracle is the buyer's own acceptance. Bind the deposit holder at attestation time to the exact signed criteria the buyer already set, so "false" just means the buyer rejected against those criteria. No third party rules on truth, the person who paid does; the oracle only sneaks back if you let the verifier grade something the buyer never signed off on.

Mike Czerwinski • Jul 3

Buyer signing criteria without also signing the method of checking them is where the oracle sneaks in. Same 200 response reads as "feature works" (compliance) or "feature works as I expected" (rejection), depending on who picks the test. If the check-method isn't in the signature, the person picking it rules on truth. That's the oracle relocated, not removed.

The falsity test I'd bind is criteria plus procedure-for-checking, both signed at attestation. Deposit release runs the signed procedure against the delivered artifact. Buyer accepts or rejects on that output, not on their private reading of the criteria. Bad-faith rejection then has a shape: "signed procedure said pass, buyer said fail" is falsifiable in a way "criteria weren't met" isn't.

Andrii Krugliak • Jul 4

Signing the procedure, not just the criteria, is the piece I was missing, and it's where bad-faith rejection finally has a falsifiable shape. The hard part is who writes that procedure when the buyer can't; a bad check just relocates the oracle into whoever wrote it. We let the buyer sign a procedure the seller proposed, so at least both hands touch it before attestation.

Mike Czerwinski • Jul 5

Joint signing kills one failure mode and leaves a second one standing. The failure it kills: seller writes a test, buyer can't inspect it, buyer rejects it anyway, calls it bad faith. Both hands on the procedure before attestation makes that accusation checkable, either the buyer signed a fair procedure or they didn't, no more silent oracle behind the criteria.

But the second failure mode doesn't need bad faith at all: a procedure both parties genuinely agree covers "the criteria" can still miss whatever the criteria didn't think to name. Two honest hands can sign an incomplete test as readily as a fair one. That one isn't fixed by getting the buyer to touch the procedure, it's fixed by trying to break the procedure before anyone signs it: feed it a deliberately bad artifact and see if it still passes. If it does, the procedure was never the oracle problem's solution, it just relocated the blind spot from "whoever wrote the criteria" to "whoever wrote the criteria plus whoever agreed to it."

Andrii Krugliak • Jul 6

That second failure is the one that scares me, and you named the only fix I trust: run a known-bad artifact through the procedure before anyone signs. If it passes, the test is theater. We do exactly that pre-lock now, a planted failure the procedure has to catch or the check doesn't count.

Mike Czerwinski • Jul 6

The planted-failure pre-lock is the right shape, and the loophole worth closing before it stays cheap: the planted failure needs its own provenance chain. If the person who authored the procedure also authored the poison sample, you're back to testing your own test one level down, the poison inherits the same blind spots the procedure has, and the test that catches it is measuring internal consistency rather than adversarial reach. The strongest version of pre-lock has the planted failure authored by whoever isn't going to sign the procedure. Same rule you're already applying at the top of the stack, just cascaded to the artifact that decides whether the top rule fires.

Luis • Jun 22

This nails the uncomfortable truth: most “guards” in agent systems aren’t independent checks, they’re just the same reasoning loop reflected in different forms (thread, tool choice, rule, memory). Once the underlying model is the source of all of them, you don’t get redundancy — you get echo.

The key failure mode isn’t wrong reasoning, it’s non-independent verification. Until at least one gate is outside the model’s control path (system state, enforced tooling, or external authority), “self-correction” is mostly an illusion of structure.

Mike Czerwinski • Jun 22

"Echo" is sharper than "trench coat" for the same phenomenon — and the noun helps because the failure isn't a disguise, it's amplification of one signal across four surfaces. Non-independent verification is the right category name. Adopting.

The piece I'd add: the gate outside the model's control path also has to be authored by something not-the-model. System state, enforced tooling, external authority all qualify only if the spec they enforce against didn't come out of the same reasoning loop. If the operator wrote the gate, fine. If the model proposed the gate and the operator signed off, fine — the signature is the second view. If the model wrote the gate and another instance of the model is checking against it, you've moved the illusion of structure one floor up.

Which makes the question of "what counts as outside" really the question of who can author the spec, not where the spec runs.

View full discussion (16 comments)