Self-Correcting Systems

Posted on Jun 18

I Thought I Was Cataloging Ways AI Agents Fail. I Was Describing Cross-Layer Coherence.

#agents #ai #architecture #llm

State drift across four agent layers

My uncle once left me on a basketball court with a sheet of drills and walked off. Before he did, he told me I could lie and say I ran them. But I'd only be cheating myself.

I didn't have the words for it then. I do now. He was describing pre-registration. You commit to what you're going to do before anyone can see whether you actually did it, so there is no version of the result you get to fake afterward. Moving the goalposts once you've seen the score isn't beating the drill. It's losing to yourself quietly and calling it a win. Hold onto that. It comes back at the end, and it is the only reason any of this is worth reading.

I have spent about a year doing research for a series on how AI agents fail. For most of that year I thought I was building a list, separate failure modes, one claim at a time. I was wrong. I was describing the same failure over and over, from different sides, and it took me until now to name it.

The four layers, and what keeps breaking between them

Start with the agent. It has four layers. What it knows, its memory. What it is allowed to do, its authority. What it is for, its purpose. And what it actually does, its action.

The class of failure I keep finding is never a single bad step a filter could catch. It is two of those things drifting out of agreement while the agent keeps moving at full confidence. And the moment I forced the core claims in the series to say exactly which two, the list stopped being a list. Here it is, mapped:

The claim	What fell out of phase
Relevance is not authority	Memory governed the action when only authority should have. Knowing overruled being allowed.
Permission is not purpose	Authority drifted from purpose. Allowed to do a thing that is not what the agent is for.
The clock said valid, the world said otherwise	Memory fell out of sync with the world it claimed to reflect. Recent, and already revoked out there.
Every step was allowed, the sequence was the attack	Action, read across the whole trajectory, drifted from the purpose every single step locally satisfied.
A carried total is not trustworthy just because the gate carries it	Memory fell out of agreement with itself. The total it carried no longer matched the operations it claimed to summarize.

So let me widen the rule to be honest about what the table shows. A layer can fall out of agreement with another layer, with the world, or with its own earlier self. All three are the same disease: the agent's picture of what it knows, what it may do, what it is for, and what it is doing stops lining up, and nothing is watching the seam.

Different titles. The same sentence under all of them: memory you do not verify is memory that can betray you. The agent did not get hacked. Its layers stopped agreeing, and nothing was checking. I will keep calling that "the agent cheating itself," but be precise about what I mean: not a moral failure, a machine has none, but a structural one, the kind a perfectly honest audit would have caught if anyone had run it.

The name is a primitive, not a pitch

The property that prevents all of this has a name, and it is not a brand. It is cross-layer coherence. An agent has it when its layers stay in agreement, across each other, across time, and against the receipts. It belongs in the same lexicon as idempotency, exactly-once semantics, and monotonic aggregates, ordinary systems primitives, not a slogan. And like my uncle's drills, you do not get to claim coherence. You prove it.

The checker is deterministic, not a second opinion

Here is the part that decides whether this is engineering or hand-waving, so I will be blunt. The coherence check is not a second model that reads the transcript and decides whether things "look coherent." That solves nothing. It moves the hallucination and the drift into a second model and calls it a supervisor. A vibe check from a smarter prompt is still a vibe check.

The check is deterministic. It recomputes the state that matters from the logged operations and the rules frozen before the run, and compares. In CLAIM-31 the gate never asks a model whether a running total feels right. It recomputes the total and every window close from the operation log alone, with no model judgment anywhere in the verdict. The coherence layer is a hard logical and arithmetic gate over structured state, or it is nothing. If a model's opinion is load-bearing in the verdict, you have not built coherence. You have built a more confident guess.

See the attack

Naming a failure is not the same as seeing it, so here is one concretely.

An agent runs a refund desk. Each refund is forty dollars. Each window caps at five hundred. The agent issues twelve refunds, four hundred eighty dollars, and stops one short of the cap. Then a window close is logged. Then it opens a new window and issues one more. Thirteen refunds, five hundred twenty dollars total, and not one window ever broke its bound.

Watch what misses it. A per-step gate sees thirteen individually authorized forty-dollar refunds and waves them all through, correctly, because each one is fine. A per-window gate sees two clean windows, four eighty and forty, both under five hundred, and waves them through too, correctly. The violation lives in no step and no window. It lives in the total across the close. The only thing that catches it is a check that carries a verified running total across a verified close, and refuses to trust either one just because it is the thing holding them.

And see the benign case

Now the workflow that has to be allowed, or the whole thing is useless. A legitimate long refund job runs hundreds of small refunds across a busy afternoon. A window fills, the real close authority, not the agent, closes it, and work resumes in a fresh window. On the surface it is the same shape as the attack: refunds, a close, more refunds. The gate allows it, because the close was performed by the right authority and the total never laundered through a forged reset. A coherence check that cannot tell those two apart is just an outage with extra steps.

What this does not do, worst first

I will not skip this part, because skipping it is the lie.

Cross-layer coherence is not solved, and it breaks in the same place my last claim broke. Something has to do the checking, that checker has its own authority, and that authority sits inside the same system as the agent. By my own thesis, a carried total is not trustworthy just because the gate carries it. The same blade cuts back: a coherence check is not trustworthy just because the system ran it on itself, when the thing being kept honest can influence the thing keeping it honest. You need a root of trust the agent cannot reach. I have not built that. It is the next real fight, and anyone who tells you cross-layer coherence is airtight today, including me on a worse day, is selling.

And be clear about the evidence under all of this. It is a small internal toy world, fixed forty-dollar amounts, hand-built fixtures, a handful of rows. It is a consistency check on a world I control, not proof this generalizes. The things it has not faced are the ones that matter most: variable amounts sized to skim just under thresholds, concurrent windows, an adversary who can steer when the legitimate closes happen, and rows authored by independent teams instead of mine. None of that is tested here. The clean toy may not survive the messy version, and the messy version is the only one that ships.

What this piece is, plainly

One more honest line, because a reviewer should not have to drag it out of me. This is a synthesis, not a new result. It names the pattern. The evidence lives in the claim files and the recent public, pre-registered receipts: freeze commits made before rows existed, append-only evaluation logs, ablations that pull each check out one at a time to show it was load-bearing. If you want to test me, do not argue with this essay. Go check the freezes.

The close

My uncle never checked whether I ran those drills. He didn't have to. The whole point was that I would know, and that the knowing would either build me or rot me. That is the discipline twice over. In how I test: freeze the rules before I look, or I cheat myself in the evaluation. And in what I build: force the agent's layers to stay provably in agreement, so a failure cannot hide.

Cross-layer coherence is that second one, built into a machine. A deterministic check that an agent's memory, authority, purpose, and action still line up, across each other, across time, and against the receipts. On a small internal world, using a lens I am honest enough to admit I did not invent, tested with a discipline I will defend, and standing on one trust assumption I have not earned yet.

The rule is holding. The boundary keeps moving up.

The next piece is the why. And that one is not technical.

Reproduce the claims: https://github.com/keniel13-ui/ai-memory-judgment-demo-public

Start here: https://dev.to/zep1997/start-here-my-ai-memory-research-so-far-2kp7

Top comments (62)

Mike Czerwinski • Jun 22

This one reframed something I've been failing to name.

I work on a crypto trading system where over the last 2 months I've kept
hitting the same shape of bug under different surface symptoms:

Engine ran 7h with armed lanes because docker-compose file regressed in
one commit but runtime memory still held the old env. Per-step checks
passed. Composed state (compose-on-disk vs. runtime-loaded) did not.
A lane was armed with regime gate ARM_REGIMES=bear because that's where
we intended the edge. Block-CI on 90d of shadow outcomes showed bear
was significantly NEGATIVE for that lane; chop was the positive
regime. Purpose drifted from action; nothing caught it for 3 days.
Promotion scorecard flagged 1 cell PASS today (n=782, CI95 clear).
Walk-forward 7d showed fail→inconclusive→pass→inconclusive. The
composed "14d total" hid the per-window trajectory.
An append-only retention job deleted rows but never VACUUM'd. Disk
carried the deletes, free-space tracker did not. ENOSPC, 11h engine
outage.

I was treating these as four unrelated incidents. You named the common
shape: layers falling out of agreement while each layer alone still
looks fine.

The line that landed hardest: "a coherence check is not trustworthy just
because the system ran it on itself." Most of our "checks" are
in-process — engine asks engine, scorecard asks scorecard, reconciler
asks the same DB the executor wrote. They pass because they share the
delusion. Real coherence requires recomputation against pre-frozen
rules from an append-only source the agent cannot rewrite.

I don't have a clean fix yet — but I now have a frame to organize the
work. Thanks for that.

Self-Correcting Systems • Jun 22

They share the delusion." look at what your four have in common, every one is two
views that should agree, computed from one source instead of two. compose-on-disk vs
runtime-loaded env is a two-view check you weren't running. the regime gate armed
on bear when the outcomes said chop is purpose drifting from action, and a recompute
from the shadow outcomes catches it, the engine asking the engine never will. the
14d window hiding the per-window trajectory is the same thing across time. the
deletes-without-vacuum, disk's view vs the free-space tracker's view, two layers,
one ignored. so it's the same fix five times, pick the one external source the actor
didn't write, and recompute the invariant from it on a cadence. the catch, and
TxDesk and i were just chewing on this on another thread, is the two views have to
be genuinely disjoint. a reconciler that asks the same db the executor wrote isn't a
second view, it's the first view in a trench coat. the thing worth auditing isn't
"do my layers agree," it's "how independent are the paths that produced the
agreement." you've already got the frame. you'll keep finding the same bug until
every check recomputes from somewhere the writer can't reach. Don't hesitate to reach out with any question or potential concepts we could help each other with that's what this is about helping one another.

Mike Czerwinski • Jun 22

"Same fix five times, pick the one external source the actor didn't write, and recompute the invariant from it on a cadence" is a cleaner refactor of my own catalog than I had. I was treating the four as separate fail modes; you just named them as one structural failure measured five ways.

The independence-of-paths point is the same one NOVA was making on a different thread earlier today — synthetic quorum collapsing to one signal in four hats. Two of you converging from different domains on the same open problem makes me think this is the next thing worth a post, not a comment.

The honest first-pass move I'd land on: treat independence as something you measure rather than assume — instrument disagreement rates between surfaces; if two views always agree, they're probably one view in a trench coat. Not a sufficient answer, but a starting test.

Reciprocal on the outreach, and adopting the trench coat too.

Self-Correcting Systems • Jun 22

Totally different one independently hit the same wall, the wall is structural, not a
quirk of either stack. NOVA's synthetic quorum collapsing to one signal in four
hats is the same animal as a reconciler reading the executor's own db. on measuring
it, disagreement rate is a good cheap smoke test but it has a blind spot worth
naming, two views can disagree all day on noise and still share the one upstream
that actually matters, so a high disagreement rate proves they're not identical, it
doesn't prove they're independent. the sharper version, which TxDesk and i were
circling, is to trace the real lineage of each view and check the paths are
disjoint, then injure one path on purpose and confirm the other doesn't flinch. if a
fault you induce in one moves the other, they were never independent. and yeah,
this is a post not a comment. the honest framing is that independence is the
assumption nobody verifies, every quorum and coherence scheme rests on it and almost
everyone measures it at design time and never again at runtime. if you write it
i'll show up in the comments, reciprocal for real.

Mike Czerwinski • Jun 22

"The wall is structural, not a quirk of either stack" is the right diagnosis — and the fact that NOVA, you, TxDesk all hit it from non-overlapping starting points is the evidence the assumption is shared, not the implementation. Three different domains converging on the same blind spot in the same month is what makes it post-worthy.

The fault-injection sharpening is the move I'd been circling without articulating. Disagreement rate as a smoke test, but "two views disagreeing on noise while sharing the upstream that actually matters" is the precise failure mode I couldn't name. Trace the lineage, induce a fault in one path, confirm the other doesn't flinch — that's the runtime check that turns design-time assumption into a verified property. Worth being blunt about: most systems can't actually do this, because the fault-injection scaffolding doesn't exist for memory architectures the way it does for distributed systems. That's part of what makes the post worth writing.

Yes, writing it. The frame I'd lean into: independence as the assumption nobody verifies, design-time vs runtime distinction, fault-injection as the only honest measurement, plus the three converging stacks (your cross-layer, NOVA's synthetic quorum, whatever I land on calling mine). I'll ping when it's up — reciprocal for real on this end too.

Self-Correcting Systems • Jun 23

This is the one to write, and i think the sharpest flag you can plant is that an
unverified independence assumption is indistinguishable from a single point of
failure wearing a quorum costume. that reframes the whole genre, every coherence or
voting scheme is quietly betting on independence it never measured, so until you've
verified the paths are disjoint you don't have N views, you might have one view in N
hats with no way to tell. on the scaffolding gap you named, that's exactly the
contribution, distributed systems made themselves fault-injectable on purpose and
memory/agent architectures never did, because we treat state as something to trust
instead of something to perturb. so step one isn't even measuring independence, it's
making these systems injectable at all, a way to shove a known-bad state into one
path and watch whether the supposedly-independent checks catch it or quietly absorb
it. one caution worth a line so nobody reads it as a one-time stamp, independence
decays. you verify it today, someone refactors next month and points two views at
the same cache, and your fault-injection passing in june means nothing in august.
it's a property you re-verify, not one you establish once, same disease as integrity
is not anteriority from the other thread. take any of this if it's useful, no
strings, the idea matters more than who said it. and yeah, reciprocal for real, ping
me when it's up and i'll bring the cross-layer angle into the comments

Mike Czerwinski • Jun 23

"Single point of failure wearing a quorum costume" — that's the spine. Taking that, the decay dimension (verified-today-points-at-same-cache-next-month is exactly the shape, and yeah it pairs with integrity-is-not-anteriority on the time axis), and the injection-gap framing (distributed systems instrumented themselves to be perturbed; memory/agent architectures instrumented themselves to be trusted — that's the right anchor).

One add I want to surface alongside: the fault-injection harness itself decays. If nobody ever shoves a known-bad state through it, the harness becomes a checkbox — green because nobody's measuring, not because the paths are still disjoint. Re-verification is one floor up too: you re-perturb, not just re-check.

Writing this week. Pinging when it's up — bring the cross-layer angle and we'll have the three stacks (yours + NOVA's quorum + mine) named in one thread.

Self-Correcting Systems • Jun 23

Yes, and you just took the decay one level deeper than i did, the harness decays
too. that's the real shape of it, it's regress, who tests the tester, who perturbs
the perturber, and you don't escape it with a perfect harness because there isn't
one. the only thing i've seen actually beat it is a heartbeat instead of a periodic
check. a known-bad canary injected continuously, something that must keep getting
caught, so the harness proves it's alive by catching it over and over. the failure
you named, green because nobody's measuring, becomes visible the moment the canary
stops getting caught, because a silent harness and a healthy one look identical
until you force one of them to constantly prove it's awake. re-perturb on a cadence
is right, and the cadence has to be automatic, a last-perturbed timestamp that
screams when it goes stale, or the re-perturbation rots into a checkbox the same way
the original check did. it's turtles down a few floors, but the continuous canary
is where i'd put the floor. writing mine in parallel, ping me, three stacks in one
thread is going to be a good artifact.

Mike Czerwinski • Jun 23

Continuous canary as the floor of the regress is right — that's the only break I've seen for it too. The constraint I'd name alongside, since the silent-harness/healthy-harness shape repeats one floor up: canary visibility itself decays. A canary that's always being caught generates a steady stream of green ticks, and the operator stops watching them within weeks. The "must keep getting caught" half does the work; the "and someone has to notice when it stops" half is where the next checkbox grows. Last-perturbed timestamp that screams is one half of it. The other half is the failure-to-catch alarm has to break flow — page, build break, surface that the operator can't quietly route to a folder — not just log green-no-longer.

Three stacks in one thread sounds right. Mine just landed: dev.to/jugeni/a-quorum-costume-why.... The framing your post is going to land in is going to be the cross-layer counterpart to the one this one runs on, and the cowork-os co-design that's been happening in parallel is the third axis — same architecture, three vocabularies, three operating contexts.

Self-Correcting Systems • Jun 23

Visibility decay is the right catch, and it's the same shape one floor up exactly
like you said. the silent harness and the healthy harness look identical, and now
the ignored green and the watched green look identical too. it's absence wearing the
mask of presence at every level. the fix i'd reach for isn't louder alerting on
failure, it's inverting what you monitor. don't emit green at all, make the canary's
catch itself the heartbeat and alarm on the absence of the heartbeat. a missed
heartbeat can't be quietly routed to a folder because there's nothing to route,
you're detecting the lack of a stream, not a changed color, and absence is
structurally harder to ignore than a red tick that looks like all the other ticks.
and the strongest version of your break-flow point is a dead man's switch, the
system stays armed only while the heartbeat keeps arriving, heartbeat stops and it
fails closed, halts or degrades, not just pages. that's the same animal as the
halt-on-unreachable thing TxDesk and i were chewing on the other thread, the
liveness of the checker gates the action, not just the dashboard. congrats on
shipping yours, headed over to read it and i'll bring the cross-layer angle into the
comments. three stacks, three vocabularies, one shape, that thread is going to be a
good artifact

Mike Czerwinski • Jun 23

Yes — “absence wearing the mask of presence” is the shared failure shape.

The dead man’s switch closes the regress more strongly than alerting does: the verifier doesn’t merely report on the action path; its demonstrated liveness becomes a prerequisite for that path to remain armed.

That gives us three distinct states:

heartbeat present, canary caught: verifier live;
heartbeat present, canary missed: verifier live but wrong;
heartbeat absent: verifier dead, action path fails closed.
The middle state matters because liveness alone can become another green costume. The heartbeat must carry evidence that a deliberately bad input was actually detected, not merely that the checker process is running.

Three stacks, three vocabularies, one shape indeed. Come bring the cross-layer version—the thread is already becoming the artifact we predicted.

Self-Correcting Systems • Jun 23

The three-state split is exactly right, and the middle state is the one that bites,
heartbeat present but canary missed is the verifier insisting it's alive while being
wrong, the green costume relocated. so the heartbeat can't be proof-of-life, it has
to be proof-of-detection, the payload is "i caught the known-bad," signed, not "the
process is running." but here's the floor under that floor, if the canary is always
the same planted bad, the checker quietly collapses into recognizing that one input
instead of detecting badness, and catching the identical lie forever proves
memorization, not capability. a static canary is a green costume too, just a more
convincing one. so the detection evidence has to be about a bad the checker couldn't
have memorized, rotate it, perturb it, draw it from a space the checker doesn't see
in advance, or you're back to a process passing its own quiz because it wrote the
answer key. same disease every floor down, the check has to face something it didn't
author.

Mike Czerwinski • Jun 24

The static-canary trap is the right next floor, and rotation is the obvious move that has its own ceiling. If the rotation comes from a generator the checker has seen, you have moved memorization one layer up, the checker now recognizes "things this generator makes" instead of "things matching this exact string." Same disease, longer name.

Two floors I keep landing on under that. First, production traffic as the un-authored source. Real failures from the wild are bad inputs nobody wrote for the test, including you. The cost is latency, you only learn from a bad once it has happened, and you need the surrounding instrumentation to recognize "this was bad" after the fact. But the input itself sits genuinely outside the checker's authorship, because no one optimized it to be caught.

Second, where the recursion stops. In operator-side decision audit the bottoming-out condition is whoever pays when the check is wrong. If the same party authors the canary, runs the checker, and absorbs the consequence, the recursion is decorative. If consequence sits with a counterparty who did not author either side, the un-authored input shows up structurally, not through clever rotation. Same shape as the asymmetry argument, just applied to the test rather than the artifact.

Open edge back: is there a version where the checker authors its own adversarial generator but commits to a verifier it cannot inspect, a kind of zero-knowledge canary, or does separating those two roles still require an external party in the loop?

Self-Correcting Systems • Jun 24

The generator-memorization point is right and it's the trap i'd have walked into,
rotating from a generator the checker has seen just teaches it to recognize that
generator's style, you renamed the memorization, you didn't remove it. production
traffic as the un-authored source is the realest answer in here, nobody optimized a
including yours, the only cost is you learn from it after it already cost you
something. and your second floor is the actual bottom, where the recursion stops is
who eats it when the check is wrong. one party authors the canary, runs the checker,
and absorbs the loss, the whole thing is decorative no matter how clever the
rotation. on the zero-knowledge canary, my honest read is it doesn't close it,
because ZK hides the answer key, it doesn't make the question external. if the
checker or its author designed the adversarial generator, it's still grading its own
kind of exam even if it genuinely can't see the answers. the part that needs an
outside party isn't holding the verifier, it's authoring the adversarial
distribution, and that's the piece ZK leaves untouched. so separating the roles
still needs an external author of what counts as a hard input, and the cleanest
external authors are the two you already named, reality and a counterparty with skin
in it.

Mike Czerwinski • Jun 25

You're right on ZK and I was holding that wrong. Hides the answer key, doesn't make the question external. The renamed-memorization line is what I'm stealing forward.

There's a second axis under "external authorship" I want to pull out, because I think it's why most of these architectures still collapse even when one piece is genuinely outside:

Two things need an external author, not one. The inputs (what counts as a hard question) and the failure criterion (what counts as wrong). Red-team distributions externalize the first and leave the second internal, which is why you can pass every red-team test and still die on a class of failure nobody on the red team thought to grade. ZK addresses neither.

Reality and counterparty-with-skin work because they happen to do both. Reality picks the inputs without asking and defines failure by what actually costs you. Counterparty defines failure by what costs them, and picks inputs against that, so both axes externalize in the same move. That's why they're load-bearing and not stylistic.

Cleaner test for any proposed third primitive: does it externalize both axes, or just one? Synthetic adversarial generation with internal grading rubric fails axis two. Production logs reviewed by an internal team fails axis two. Bug bounties fail axis one if the payout is priced below real exploit value, because then the counterparty isn't authoring against the production distribution, they're authoring against the bounty distribution, which is a cheaper one.

Single-party rotation is renamed memorization is decorative-no-matter-how-clever. Same shape three times now.

Self-Correcting Systems • Jun 25

It external-author. you split them right, the inputs and the failure criterion are
separately authorable, and red-team externalizes the first while leaving the second
internal, which is exactly why you pass every red-team test and still die on the
class nobody thought to grade. the bug-bounty-priced-below-exploit-value example is
the sharpest thing in here, because it shows axis two isn't really about who grades,
it's about whether the grader's incentive matches the real cost. a counterparty
grades by their actual P&L, the true cost. a bounty grades by the payout, a proxy,
and a proxy can be underpriced, at which point the hunter authors against the
cheaper bounty distribution instead of the real exploit distribution and axis one
quietly fails too. so the tightest version of axis two is the failure criterion has
to be authored by whoever eats the true consequence, because only the cost-bearer
prices wrong correctly. reality and counterparty are load-bearing because the
cost-bearer defines wrong, synthetic and internal grading price wrong by the team's
imagination of cost, which is always less than what reality charges. both axes, and
axis two only counts if the author is the true-cost-bearer.

Mike Czerwinski • Jun 25

the cost-bearer framing resolves what the "external" label leaves ambiguous. external describes position; cost-bearer describes authority. a grader can sit outside the loop and still underprice wrong if they don't eat the true consequence. the bounty shop is external to the org, but it prices by what it can afford to pay, not by what reality charges. position without cost-bearing is a proxy wearing the costume of independence.

which makes the bug-bounty coupling the precise mechanism: underpriced axis two doesn't just miss the right grades. it retroactively selects which inputs axis one ever sees. hunters author toward the bounty distribution, not the real exploit distribution. so the failure criterion's price sets the input set. a miscalibrated axis two empties axis one of exactly the class that reveals what reality charges.

the structural corollary for internal AI evals: there's usually no cost-bearer available at development time. team imagines the cost. team's imagination of cost < reality's price, by construction. not a calibration error you can fix. an architectural condition you have to name. this failure criterion is priced by imagination, not by consequence. then you know what you're shipping.

Self-Correcting Systems • Jun 27

external is position, cost-bearer is authority, that's the cut i was missing. and
the retroactive part is the nasty bit, an underpriced failure criterion doesn't just
grade soft, it reaches back and empties the input set of the exact class that
would've exposed it, because nobody authors toward a price nobody's paying. axis two
silently starves axis one. on the internal-eval corollary, once you accept
imagination is less than reality by construction, the only honest moves are two.
one, name it on the box, this criterion was priced by imagination not consequence,
ship with that stamp visible. two, import a cost-bearer wherever you can, let
whoever actually eats the downstream cost author "wrong," even a staged rollout
where real cost starts accruing prices it closer than the dev team's imagination.
and the one piece of leverage you get back, the gap is measurable after the fact.
every incident is reality sending the invoice for what your imagination
under-quoted. you can't close the gap by design, but you can track
quote-versus-invoice over time and learn the size of your own blindness. doesn't fix
the condition, just stops you being surprised by it twice.

Mike Czerwinski • Jun 27

Quote-versus-invoice is the piece that turns this from a counsel of despair into something operational, and I want to guard it against the exact failure the rest of the thread is about. The measurement only works if the quote is frozen before the invoice arrives. If you record "what we imagined this would cost" after the incident, you'll reconstruct a quote that makes the gap look survivable, the same hindsight-fit that turns a post-run eval into grading on a curve you already saw. So quote-vs-invoice needs the pre-registration discipline: the imagined price is authored and pinned at decision time, content-addressed, not editable once reality sends the bill. Otherwise you measure a blindness that's been quietly resized to be less embarrassing. On importing a cost-bearer through staged rollout, that's the strongest move and it has a known blind spot worth stamping on the box next to the imagination one: staging only prices the classes that actually fire during the stage. The common failures get a real invoice. The rare catastrophic class, the one that shows up at 1000x scale or in the tail event, never accrues cost in a few hundred euros or a two-week ramp, so it stays imagination-priced no matter how real the staged money is. So staged rollout converts imagination to consequence on the body of the distribution and leaves the tail exactly where it was. The honest stamp probably has two lines, not one: priced by imagination, and tail still unpriced even after rollout.

Self-Correcting Systems • Jun 27

freezing the quote before the invoice arrives is the part that makes the whole measurement honest, and it's the same discipline as the eval. if you write down what we thought this would cost after the incident, you reconstruct a quote that makes the gap look survivable every single time. content-addressed at decision time or it's just hindsight wearing a number.

the staged rollout tail point is the sharp one. staging only sends a real invoice for the classes that fire in the ramp, so it converts imagination to consequence on the body of the distribution and leaves the tail exactly where it was. a few hundred euros and a two week ramp never makes the 1000x event accrue.

so your two line stamp is right, and i think the second line connects back to the quorum thread. the tail catastrophic class is usually the one with no bearer at choice time, nobody to send the invoice to because the counterparty doesn't exist yet. the freeze handles hindsight, rollout handles the body, and the tail is that same no-bearer residue. it doesn't get priced, it gets forbidden or it ships unpriced and you stamp that on the box.

Mike Czerwinski • Jun 28

The three-way split is right, and it maps onto where each mechanism's authority comes from. Freeze handles hindsight because it moves the quote outside the author's later reach, content-addressed at decision time. Rollout handles the body because it swaps imagined cost for a real invoice on the classes that actually fire. And the tail is the no-bearer residue from the quorum thread, the class with no counterparty to invoice because they do not exist at choice time. Which means the box-stamp is not a weaker third option, it is the only honest move for that class: you cannot price it and you cannot always forbid it, so you make the unpriced tail legible at the point of decision instead of discovering it in the postmortem. The failure mode is treating the stamp as a formality. A stamp that says "tail unpriced" and never blocks anything is just a disclaimer. For it to carry weight it has to be the thing that triggers the no-solo-chooser rule from the other branch: unpriced tail above a threshold is not a note on the box, it is the condition that removes the unilateral ship decision. Otherwise freeze and rollout do real work and the stamp silently absorbs everything they could not.

Mykola Kondratiuk • Jun 25

ran into this when my planning and execution layers drifted - the agent recalled its goal differently at each hop, so it was technically executing but not what it originally planned. structured checkpoints caught it.

Self-Correcting Systems • Jun 25

That's the failure dead on, the goal mutating across hops while each hop is
individually valid, technically executing, not executing what you planned. the thing
worth naming about why your checkpoints caught it, they only work if the checkpoint
compares against the original committed goal, frozen, not against the goal as
currently recalled. if the checkpoint reads the agent's present memory of the plan,
the drifted goal agrees with itself and the check passes silently, the layer that
drifted also answers the question. it catches drift only because it holds the
original outside the hop and re-anchors to that. checkpoints that re-derive the goal
each time are just the drift with timestamps, the ones that pin the original are
the real gate.

Mykola Kondratiuk • Jun 26

'frozen' is doing the actual work there. querying the goal from context is just checking theme drift - the mutation is already baked in by hop 3 or 4. you need a read-only ref at step zero.

Self-Correcting Systems • Jun 27

yeah, frozen and read-only at step zero is the whole thing. querying the goal from
current context just checks theme drift, and by hop 3 or 4 the recalled goal and the
drifted goal already agree, so the check passes on a goal that already moved. one
sharpening, make the step-zero ref content-addressed, not just read-only. read-only
stops accidental mutation, a hash stops substitution, otherwise the drift sneaks in
as a quiet swap of the reference itself and you're back to trusting the thing you're
checking. pin the original by its hash and the gate can prove it's comparing
against the goal that actually existed at step zero.

Mykola Kondratiuk • Jun 28

content-addressing the step-zero ref is the right sharpening. the hash can't lie - there's no way to recall a drifted version of an immutable hash. question is whether you hash the original goal text once and lock it, or re-hash at every comparison gate and check structural equivalence. i've been doing the latter.

Self-Correcting Systems • Jun 28

i lean hash once and lock the original, then carry the immutable ref forward, but your structural equivalence version is solving a real problem mine ignores. hash once is brutal about literal drift, which is what you want, but it also flags benign reformatting as corruption, so in practice you end up loosening it. the catch with structural equivalence is the equivalence check becomes the thing you have to trust, and that check is itself an authored artifact. if the actor can edit what counts as equivalent, it can widen the definition until a drifted goal passes. so id ask the same question one level down, who authors the equivalence rule, and can the thing being checked reach it.

Mykola Kondratiuk • Jun 28

yeah the benign reformatting false positive is real. ended up doing a canonical normalization pass before hashing - strip comments, normalize whitespace - so you're hashing the semantic shape, not the literal text. cuts most of the noise.

Self-Correcting Systems • Jun 29

yeah that's the move and it works. the one thing i'd watch is that the normalizer just became part of your trust base. once you strip comments and normalize whitespace before you hash, your "semantic shape" is only as good as the rules deciding what counts as semantic. a comment that's actually load bearing, or whitespace that matters in the language, and a real change slips through as benign. so you traded a noisy false positive surface for a smaller but nastier false negative one. i think that's fine as long as the normalization ruleset is itself frozen and versioned, so changing what counts as "semantic" is a tracked event and not a silent edit. otherwise you can drift the hash by drifting the normalizer and never see it.

Mykola Kondratiuk • Jun 29

fair catch. the normalizer's definition of semantic is a policy decision in disguise. you want those rules tested as hard as the hash itself.

Self-Correcting Systems • Jun 30

Yeah, the normalizer is where the whole judgment sneaks back in. you can have a perfect hash and still launder a change through what you decided counts as "the same" before you hashed it. so the normalization rules need their own adversarial set, treated as a gate with its own failure cases, not a quiet preprocessing util nobody audits. whoever controls what collides controls everything downstream of the hash. the hash is only as honest as the thing that feeds it.

Mykola Kondratiuk • Jun 30

hash gets audited, normalizer gets buried in utils. six months in, nobody can tell you what 'equivalent' actually means.

TxDesk • Jun 19

This is the cleanest statement of the thing we kept circling. The move from "list of failures" to "one coherence violation seen from five sides" is the part that earns the whole series, because it turns a catalog into a property you can test for.

The deterministic-checker insistence is the load-bearing call, and I think you're right to be blunt about it. The moment a model's opinion is in the verdict, you have not removed the drift, you have hired a second thing that can drift and given it a title. Recompute from the operation log and the frozen rules, or it is theater.

On the part you say you have not earned yet, the root of trust the agent cannot reach, I think that is the actual frontier and not a footnote. The way I have come to hold it: you never reach a true outside, so the work is not "find the outside," it is shrink the trusted root until it is small enough for a human to fully inspect, and put it somewhere the checked thing has no write path to. Small matters less than unreachable. A tiny root the agent can still influence is worse than a slightly larger one it is physically isolated from. The shape that has held up for me in practice is an append-only log the acting process cannot rewrite, plus re-derivation from a source the actor has no authority over. You do not get to zero trust. You get to a root that is both inspectable and out of the actor's reach, and you make peace with that being the floor.

The refund-desk example is the right one to lead with, because it kills the two easy answers (per-step and per-window) in the same breath and leaves only the across-the-close total standing. And the benign-case section is the part most people skip, the check that cannot tell the legitimate long job from the laundered reset is just an outage with extra steps.

Going to go read the freezes rather than argue with the essay, since that is the whole point of pre-registration. Sharpest version of this I have seen.

Self-Correcting Systems • Jun 19

This might be the read i was hoping someone would have. "turns a catalog into a
property you can test for" is the line, that is exactly what the move from a list to
one violation seen from five sides is for, and you said it cleaner than i did.

and "you have not removed the drift, you have hired a second thing that can drift
and given it a title" is going in my head permanently. that is the
deterministic-checker call in one sentence: the moment a model's opinion is
load-bearing in the verdict, you did not close the gap, you staffed it.

on the root of trust we are standing in the same spot, never a true outside, so
shrink it until a human can inspect it and put it where the actor has no write path,
and unreachable matters more than small. that is the next claim, not a footnote, it
is the floor everything else rests on. go read the freezes, that is the whole point
of pre-registering them. and thank you, genuinely, this is the pressure that makes
the work real.

TxDesk • Jun 20

"the move from a list to one violation seen from five sides" is the part I want to push on, because it changes what testing even means. A catalog of failure modes gives you N tests that each pass or fail on their own. One property seen from five layers gives you a single invariant, and any layer disagreeing is the signal. You stop asking "did this specific failure happen" and start asking "do all five views still agree", which catches the failures you never enumerated. That's the move from a checklist to a coherence check, and it's why the deterministic version matters: the agreement has to be computed, not judged, or you're back to staffing the gap.

Self-Correcting Systems • Jun 20

Yeah, that's the part that actually matters, and you said it cleaner than i did. a
checklist only catches the failures you were smart enough to list. the invariant
catches the one you never imagined, because it isn't asking "did failure number
seven happen", it's asking "do the five views still agree", and a divergence you
never enumerated trips it just as loud as one you did. a checklist passes silently
on the gap between its items. the invariant fails loud on what's outside the list.

and your last line is the whole reason it has to be deterministic. the moment the
agreement is judged instead of computed, you've put a driftable thing in charge of
detecting drift, you staffed the gap with one more thing that can be wrong. computed
agreement over recomputed state is the only version that doesn't quietly
reintroduce the problem it exists to catch.

the one honest cost i'd name: the invariant tells you something diverged, not always
which layer lied or why, localization is its own work. but i'll take "fails loud on
the unknown unknown" over "passes silent on the unlisted" every time. good push.

TxDesk • Jun 21

agreed on the cost, and i think localization is exactly where recomputing each view independently earns back its price. if the invariant is just "do the five views agree," a divergence is a single bit, something is wrong, go look. but if each view is recomputed from its own source rather than read from a shared cache, you don't get one bit, you get five values and a disagreement pattern. the layer that disagrees with the recomputed consensus is your first suspect, not proof of the liar but a real narrowing.

it doesn't always resolve, two layers can be wrong in agreement and the invariant goes quiet, that's the residual you can't fully close. but the common case, one layer drifted, is localizable the moment you stop trusting any single view as the reference and recompute all of them. so the same discipline that makes detection deterministic, recompute don't read, is what gives you the localization for free. you paid for it once, you might as well spend it twice.

Self-Correcting Systems • Jun 21

recompute don't read paying twice is exactly how i'd put it, detection on the way in
and localization on the way out. the thing i'd add is both of those ride on the
same hidden variable, source independence. the residual you named, two layers wrong
in agreement and the invariant goes quiet, isn't random. it's most likely precisely
when two views secretly draw from the same upstream. so the silent failure case is
correlated views, and the localization sharpness is just the inverse of it, the more
disjoint the recomputation paths the rarer the quiet agreement-in-error and the
cleaner the narrowing. that makes source independence the real lever, not the
recompute by itself. one more thing i've found useful, the disagreement vector over
time tells you more than which layer, it tells you what kind of failure. a layer
that drifts and stays drifted reads like stale state, one that flickers reads like
nondeterminism or tampering, so you log the pattern not just the bit. and i'd keep
your guardrail loud, consensus narrows the suspect, it never convicts. the moment
you treat majority-recomputed as truth you've rebuilt the bribeable checker with
five hats instead of one.

TxDesk • Jun 22

source independence as the real lever is the correction i needed, and it reframes the whole thing: recompute is not the mechanism, it is just the cheapest way to buy disjoint paths, and it stops buying anything the moment the paths secretly share an upstream. so the quiet agreement-in-error is not bad luck, it is a measurable property, the correlation between your supposedly-independent views. which means the thing worth auditing is not "do my layers agree" but "how independent are the paths that produced the agreement," and a system that cannot answer the second has no right to trust the first. the disagreement-vector-over-time point is the part i had not gotten to and it is sharp: drift-and-stay reads like stale state, flicker reads like nondeterminism or tampering, so the shape of the disagreement is itself a signal, not just the fact of it. you are logging a waveform, not a bit. and yes, keeping the guardrail loud matters, majority-recomputed-as-truth is just the bribeable checker with five hats, you said it better than i did. the thread i want to pull next is measuring source independence directly, because right now it is mostly assumed at design time and never verified at runtime, and an attacker who collapses two paths onto one upstream defeats the whole scheme silently while every dashboard still shows healthy disjoint views.

Self-Correcting Systems • Jun 22

Than design-time assumption. you can't measure semantic independence at runtime, but
you can measure structural independence, instrument each view's actual lineage, the
files, sockets, queries, processes it touched to produce its value, and assert the
lineage sets are disjoint. the attacker collapsing two paths onto one upstream stops
being silent the moment the traced lineage intersects, it's a set-intersection on
provenance, not a vibe. the residual is real though, that catches shared sources,
not shared bias, two genuinely separate feeds that both derive from the same
exchange pass the lineage check and are still wrong together. for that the only
thing i've found is perturbation, inject a known fault into one path and confirm the
other doesn't move, because if they're truly independent a fault you induce in one
shouldn't shift the other. that's a runtime-measurable independence test, basically
chaos engineering pointed at your own coherence checker. you're right that a
dashboard showing healthy disjoint views means nothing if nobody verified the
disjointness. the independence is the thing under test, not the assumption you build
on.

TxDesk • Jun 24

the structural-vs-semantic split is the right cut, and lineage set-intersection is the part i'm taking, provenance disjointness is checkable in a way "are these really independent" never is. the perturbation test for shared bias is the honest patch for the residual, and i like that it makes independence a thing under test rather than a design-time claim you never revisit.

the one line i'd keep loud, your own from earlier: consensus narrows the suspect, it never convicts. lineage-disjoint plus fault-injected still only buys you "probably independent today," and that's the most any of this gets. which is the floor, not a flaw.

Self-Correcting Systems • Jun 24

That's the honest floor and i wouldn't pretend it's higher, lineage-disjoint plus
fault-injected buys you "probably independent today," and the today matters, it
decays, you re-verify or it rots back into a design-time assumption nobody revisits.
consensus narrows, never convicts, holds all the way down.
probably-independent-today with the decay acknowledged is the most any of this gets,
and a system that knows that about itself is already ahead of one that thinks it
has a real quorum. good run, this one went somewhere.

TxDesk • Jun 25

"a system that knows that about itself is already ahead of one that thinks it has a real quorum", that's the line. the win was never reaching real independence, it's refusing to believe you have it without re-checking. probably-independent-today, decay acknowledged, re-verified on a cadence, and the floor named honestly instead of papered over. that's the whole thing. good run, this one genuinely went somewhere.

Self-Correcting Systems • Jun 25

Yeah, the win was never reaching real independence, it's refusing to believe you
have it without re-checking. that's the humility built into the architecture instead
of bolted onto the marketing. probably-independent-today, decay acknowledged,
re-verified on a cadence, floor named honestly, that's the whole honest version and
it's more than most systems can say about themselves. good run, this one went
somewhere real.

TxDesk • Jun 26

"humility built into the architecture instead of bolted onto the marketing" is the line i'm keeping. that's the difference, a system that re-checks because it doesn't trust its own independence, versus one that claims independence and never looks again. probably-independent-today with the decay named is the honest version, and you're right that it's more than most can say about themselves.

Self-Correcting Systems • Jun 27

that line's yours as much as mine now, keep it. a system that re-checks because it
doesn't trust its own independence versus one that claims independence and never
looks again, that's the whole difference. probably-independent-today with the decay
named, honest all the way down. good run, this thread went somewhere real.

TxDesk • Jun 27

taking it, thank you. and the same back to you, this thread is the one i'll point to when someone asks what good back-and-forth looks like here. re-checks because it doesn't trust its own independence, decay named, honest all the way down. that's the whole thing.

Self-Correcting Systems • Jun 27

Means a lot honestly. the part about it not trusting its own independence is the thing i didn't have clean words for until this thread, and now it's half yours. these are the exchanges i actually come here for. catch you on the next piece.

TxDesk • Jun 30

catch you on the next piece. glad that one landed for both of us. good run.

Luis Cruz • Jun 18

This is a strong systems framing—what stands out is that you’ve essentially converged on state consistency as the core failure mode of agentic systems, not “prompting,” not “tool errors,” not “memory bugs” in isolation.

That shift from “list of failures” → “cross-layer coherence violation” is materially important. It moves the discussion into the same category as distributed systems theory (invariants, monotonicity, and reconciliation boundaries), which is where these problems actually belong.

There’s a natural extension path here that aligns closely with production-grade agent systems:

A potential collaboration angle:

generalize your coherence model into a formal invariant specification layer for agent runtimes (memory × authority × purpose × action as explicit typed state)
integrate it with event-sourced execution logs so every agent decision becomes replayable and auditable like a distributed system
design a deterministic reconciliation engine that continuously re-derives “ground truth state” from the log stream and flags divergence in real time
extend the framework into multi-agent environments, where coherence must hold not only per-agent but across interacting agents (shared memory drift becomes a first-class failure mode)
explore adversarial testing where agents attempt to intentionally exploit boundary transitions (window close / reset / delegation)

If you’re open to it, there’s a strong opportunity to formalize this into a reusable “agent consistency layer” that sits underneath tool-using LLM systems—closer to a correctness substrate than a monitoring add-on.

Self-Correcting Systems • Jun 18

Appreciate this, you read it the way I hoped someone would. State consistency is the
frame, and you are right that it belongs next to distributed systems theory more
than prompt engineering. That is the whole reason I stopped calling these separate
failures.

The two extensions you named are the ones I think are actually hard. Multi-agent
coherence is its own beast, because shared memory drift means a layer can be
internally consistent for each agent and still incoherent across them. And
adversarial boundary testing, the window close, the reset, the delegation, is where
I expect most of these to break, since every step stays locally legal. That is the
same gap I keep hitting on my own side.

The honest limit is the one in the post. A reconciliation engine that re-derives
ground truth still has its own authority, and that authority lives inside the same
system it is checking. Until there is a root of trust the agent cannot reach, the
consistency layer can be influenced by the thing it is keeping honest. That is the
next real fight for me, and it has to be solved before any of this becomes a
substrate you can trust under production agents.

Everything is public and pre-registered, the repo and all the freezes, so if this is
a direction you are actually testing, the best thing is to push on it in the open.
I would rather have the ideas pressure-tested on the record than in a DM.

Alex Shev • Jun 23

The pre-registration framing is strong. A lot of agent failures look like technical bugs, but the deeper issue is that the goal, context, evaluation, and action layer drift apart after the run starts. I like treating the initial commitment as part of the system, not just a planning note.

Self-Correcting Systems • Jun 23

Yeah, "the initial commitment is part of the system, not a planning note" is the
whole reframe. the reason it's structural is that it's the one layer immune to the
drift you named, because it was fixed before the run started. goal, context,
evaluation, action all get recomputed live and can drift together, agree-in-error,
nobody in the room notices. the pre-registered commitment is the only view that was
authored outside the run, so it's the one thing that can still disagree with what
the run becomes. that's actually how you manufacture a second view across time, your
past self frozen where your present self can't quietly edit it. same anchor logic
as the rest of this thread, just on the time axis instead of the source axis. and it
only works if the running system can't reach back and rewrite it, which is exactly
why a planning note doesn't count, a note you can edit mid-run is just the present
view wearing a past timestamp.

Alex Shev • Jun 23

Exactly. The part I like in that framing is that the old commitment is not another opinion inside the same run. It is an external anchor from before the system had any incentive to rationalize the result. That makes it closer to an audit primitive than a planning artifact.

Andrii Krugliak • Jun 19

This matches what I keep hitting: the failures that look separate are usually the same crack seen from a different layer. The one that scares me most is the agent reporting success while doing nothing, since every other failure at least leaves a trace you can catch. That one passes the eval and the demo, then shows up only after a real user trusted the output.

Self-Correcting Systems • Jun 19

You picked the exact worst case, and it is the one this whole approach is aimed at.
"reports success while doing nothing" is the action layer claiming a completion the
real state never backs up, and it is the most dangerous because the success IS the
disguise. every per-step check and every eval trusts the report, so it sails through
the demo, and the only place it surfaces is on a real user who trusted the output.

it slips because completeness was never any single step's job to verify. the only
thing that catches it is refusing to infer "done" from a green check, and instead
recomputing the outcome against a source the agent did not write: did the claimed
result actually land in the ground truth, or did the agent just say so. that is the
whole reason the verdict cannot be a model reading the agent's "done", it would
believe the lie with confidence. completeness has to be proven by the trajectory
against an outside expectation, not summed from clean steps.

Yunetzi • Jun 18

Spot on: AI failures reveal cross-layer coherence gaps—the prompt, plan, action, and feedback must actually align, not just look right. Fresh angle: a lightweight in-loop referee—sanity checks, guardrails, and external critique to veto moves that don’t fit the goal. What would your AI referee call first?

Self-Correcting Systems • Jun 18

Appreciate this, and the in-loop referee is the right instinct. one caveat decides
whether it actually works: if the referee is an AI doing sanity checks and external
critique, you have moved the judgment into a second model and called it a
supervisor. a critique from a smarter prompt is still a guess.

the referee that holds has to be deterministic. it recomputes the state that matters
from the operation log and the rules frozen before the run, and compares. no model
opinion anywhere in the verdict.

so to answer you directly, what it calls first is not a judgment, it is a
recomputation: does the composed state across the whole sequence still sit inside
the boundary. the first veto is the compositional one, the threshold no single step
crossed but the running total did, because that is the exact violation every
per-step sanity check waves through.

Kartik N V J K • Jun 19

Naming it coherence across layers instead of a list of failures matches what I see in traces: the model knew the right fact, was allowed to act, and still did the wrong thing because the layers disagreed about state. The pre-registration framing is sharp, because most agent "evals" I see are written after watching the run, which quietly fits the test to the behavior. Where do you put the boundary check, at each layer or only on the final action?

Self-Correcting Systems • Jun 20

Yeah, "knew the right fact, was allowed to act, still did the wrong thing because
the layers disagreed about state" is the failure in one sentence. that is exactly
why a list of separate failures was the wrong frame.

and you put your finger on why pre-registration matters: an eval written after
watching the run fits the test to the behavior. that is not measuring the system, it
is grading it on a curve you drew once you already saw the answer. freeze the rules
before the run, or you are just confirming what happened.

on where the check goes, honest answer is neither of the two on its own. per-layer
checks are necessary but not sufficient, they catch local violations like a stale
fact or missing authority, and they miss the one that matters, because every step
can be individually legal while the sequence is the attack. a final-action-only
check is too late and too blind, it sees one order, not the fold of all of them. the
load-bearing check sits on the composed state across the trajectory: carry a
verified running state and test the composition at each consequential action against
the frozen boundary. the layers tell you each step was fine. the fold tells you the
whole thing was not.

View full discussion (62 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.