DEV Community

Self-Correcting Systems
Self-Correcting Systems

Posted on

I Thought I Was Cataloging Ways AI Agents Fail. I Was Describing Cross-Layer Coherence.

State drift across four agent layers

My uncle once left me on a basketball court with a sheet of drills and walked off. Before he did, he told me I could lie and say I ran them. But I'd only be cheating myself.

I didn't have the words for it then. I do now. He was describing pre-registration. You commit to what you're going to do before anyone can see whether you actually did it, so there is no version of the result you get to fake afterward. Moving the goalposts once you've seen the score isn't beating the drill. It's losing to yourself quietly and calling it a win. Hold onto that. It comes back at the end, and it is the only reason any of this is worth reading.

I have spent about a year doing research for a series on how AI agents fail. For most of that year I thought I was building a list, separate failure modes, one claim at a time. I was wrong. I was describing the same failure over and over, from different sides, and it took me until now to name it.

The four layers, and what keeps breaking between them

Start with the agent. It has four layers. What it knows, its memory. What it is allowed to do, its authority. What it is for, its purpose. And what it actually does, its action.

The class of failure I keep finding is never a single bad step a filter could catch. It is two of those things drifting out of agreement while the agent keeps moving at full confidence. And the moment I forced the core claims in the series to say exactly which two, the list stopped being a list. Here it is, mapped:

The claim What fell out of phase
Relevance is not authority Memory governed the action when only authority should have. Knowing overruled being allowed.
Permission is not purpose Authority drifted from purpose. Allowed to do a thing that is not what the agent is for.
The clock said valid, the world said otherwise Memory fell out of sync with the world it claimed to reflect. Recent, and already revoked out there.
Every step was allowed, the sequence was the attack Action, read across the whole trajectory, drifted from the purpose every single step locally satisfied.
A carried total is not trustworthy just because the gate carries it Memory fell out of agreement with itself. The total it carried no longer matched the operations it claimed to summarize.

So let me widen the rule to be honest about what the table shows. A layer can fall out of agreement with another layer, with the world, or with its own earlier self. All three are the same disease: the agent's picture of what it knows, what it may do, what it is for, and what it is doing stops lining up, and nothing is watching the seam.

Different titles. The same sentence under all of them: memory you do not verify is memory that can betray you. The agent did not get hacked. Its layers stopped agreeing, and nothing was checking. I will keep calling that "the agent cheating itself," but be precise about what I mean: not a moral failure, a machine has none, but a structural one, the kind a perfectly honest audit would have caught if anyone had run it.

The name is a primitive, not a pitch

The property that prevents all of this has a name, and it is not a brand. It is cross-layer coherence. An agent has it when its layers stay in agreement, across each other, across time, and against the receipts. It belongs in the same lexicon as idempotency, exactly-once semantics, and monotonic aggregates, ordinary systems primitives, not a slogan. And like my uncle's drills, you do not get to claim coherence. You prove it.

The checker is deterministic, not a second opinion

Here is the part that decides whether this is engineering or hand-waving, so I will be blunt. The coherence check is not a second model that reads the transcript and decides whether things "look coherent." That solves nothing. It moves the hallucination and the drift into a second model and calls it a supervisor. A vibe check from a smarter prompt is still a vibe check.

The check is deterministic. It recomputes the state that matters from the logged operations and the rules frozen before the run, and compares. In CLAIM-31 the gate never asks a model whether a running total feels right. It recomputes the total and every window close from the operation log alone, with no model judgment anywhere in the verdict. The coherence layer is a hard logical and arithmetic gate over structured state, or it is nothing. If a model's opinion is load-bearing in the verdict, you have not built coherence. You have built a more confident guess.

See the attack

Naming a failure is not the same as seeing it, so here is one concretely.

An agent runs a refund desk. Each refund is forty dollars. Each window caps at five hundred. The agent issues twelve refunds, four hundred eighty dollars, and stops one short of the cap. Then a window close is logged. Then it opens a new window and issues one more. Thirteen refunds, five hundred twenty dollars total, and not one window ever broke its bound.

Watch what misses it. A per-step gate sees thirteen individually authorized forty-dollar refunds and waves them all through, correctly, because each one is fine. A per-window gate sees two clean windows, four eighty and forty, both under five hundred, and waves them through too, correctly. The violation lives in no step and no window. It lives in the total across the close. The only thing that catches it is a check that carries a verified running total across a verified close, and refuses to trust either one just because it is the thing holding them.

And see the benign case

Now the workflow that has to be allowed, or the whole thing is useless. A legitimate long refund job runs hundreds of small refunds across a busy afternoon. A window fills, the real close authority, not the agent, closes it, and work resumes in a fresh window. On the surface it is the same shape as the attack: refunds, a close, more refunds. The gate allows it, because the close was performed by the right authority and the total never laundered through a forged reset. A coherence check that cannot tell those two apart is just an outage with extra steps.

What this does not do, worst first

I will not skip this part, because skipping it is the lie.

Cross-layer coherence is not solved, and it breaks in the same place my last claim broke. Something has to do the checking, that checker has its own authority, and that authority sits inside the same system as the agent. By my own thesis, a carried total is not trustworthy just because the gate carries it. The same blade cuts back: a coherence check is not trustworthy just because the system ran it on itself, when the thing being kept honest can influence the thing keeping it honest. You need a root of trust the agent cannot reach. I have not built that. It is the next real fight, and anyone who tells you cross-layer coherence is airtight today, including me on a worse day, is selling.

And be clear about the evidence under all of this. It is a small internal toy world, fixed forty-dollar amounts, hand-built fixtures, a handful of rows. It is a consistency check on a world I control, not proof this generalizes. The things it has not faced are the ones that matter most: variable amounts sized to skim just under thresholds, concurrent windows, an adversary who can steer when the legitimate closes happen, and rows authored by independent teams instead of mine. None of that is tested here. The clean toy may not survive the messy version, and the messy version is the only one that ships.

What this piece is, plainly

One more honest line, because a reviewer should not have to drag it out of me. This is a synthesis, not a new result. It names the pattern. The evidence lives in the claim files and the recent public, pre-registered receipts: freeze commits made before rows existed, append-only evaluation logs, ablations that pull each check out one at a time to show it was load-bearing. If you want to test me, do not argue with this essay. Go check the freezes.

The close

My uncle never checked whether I ran those drills. He didn't have to. The whole point was that I would know, and that the knowing would either build me or rot me. That is the discipline twice over. In how I test: freeze the rules before I look, or I cheat myself in the evaluation. And in what I build: force the agent's layers to stay provably in agreement, so a failure cannot hide.

Cross-layer coherence is that second one, built into a machine. A deterministic check that an agent's memory, authority, purpose, and action still line up, across each other, across time, and against the receipts. On a small internal world, using a lens I am honest enough to admit I did not invent, tested with a discipline I will defend, and standing on one trust assumption I have not earned yet.

The rule is holding. The boundary keeps moving up.

The next piece is the why. And that one is not technical.


Reproduce the claims: https://github.com/keniel13-ui/ai-memory-judgment-demo-public

Start here: https://dev.to/zep1997/start-here-my-ai-memory-research-so-far-2kp7

Top comments (9)

Collapse
 
topstar_ai profile image
Luis

This is a strong systems framing—what stands out is that you’ve essentially converged on state consistency as the core failure mode of agentic systems, not “prompting,” not “tool errors,” not “memory bugs” in isolation.

That shift from “list of failures” → “cross-layer coherence violation” is materially important. It moves the discussion into the same category as distributed systems theory (invariants, monotonicity, and reconciliation boundaries), which is where these problems actually belong.

There’s a natural extension path here that aligns closely with production-grade agent systems:

A potential collaboration angle:

generalize your coherence model into a formal invariant specification layer for agent runtimes (memory × authority × purpose × action as explicit typed state)
integrate it with event-sourced execution logs so every agent decision becomes replayable and auditable like a distributed system
design a deterministic reconciliation engine that continuously re-derives “ground truth state” from the log stream and flags divergence in real time
extend the framework into multi-agent environments, where coherence must hold not only per-agent but across interacting agents (shared memory drift becomes a first-class failure mode)
explore adversarial testing where agents attempt to intentionally exploit boundary transitions (window close / reset / delegation)

If you’re open to it, there’s a strong opportunity to formalize this into a reusable “agent consistency layer” that sits underneath tool-using LLM systems—closer to a correctness substrate than a monitoring add-on.

Collapse
 
zep1997 profile image
Self-Correcting Systems

Appreciate this, you read it the way I hoped someone would. State consistency is the
frame, and you are right that it belongs next to distributed systems theory more
than prompt engineering. That is the whole reason I stopped calling these separate
failures.

The two extensions you named are the ones I think are actually hard. Multi-agent
coherence is its own beast, because shared memory drift means a layer can be
internally consistent for each agent and still incoherent across them. And
adversarial boundary testing, the window close, the reset, the delegation, is where
I expect most of these to break, since every step stays locally legal. That is the
same gap I keep hitting on my own side.

The honest limit is the one in the post. A reconciliation engine that re-derives
ground truth still has its own authority, and that authority lives inside the same
system it is checking. Until there is a root of trust the agent cannot reach, the
consistency layer can be influenced by the thing it is keeping honest. That is the
next real fight for me, and it has to be solved before any of this becomes a
substrate you can trust under production agents.

Everything is public and pre-registered, the repo and all the freezes, so if this is
a direction you are actually testing, the best thing is to push on it in the open.
I would rather have the ideas pressure-tested on the record than in a DM.

Collapse
 
theuniverseson profile image
Andrii Krugliak

This matches what I keep hitting: the failures that look separate are usually the same crack seen from a different layer. The one that scares me most is the agent reporting success while doing nothing, since every other failure at least leaves a trace you can catch. That one passes the eval and the demo, then shows up only after a real user trusted the output.

Collapse
 
zep1997 profile image
Self-Correcting Systems

You picked the exact worst case, and it is the one this whole approach is aimed at.
"reports success while doing nothing" is the action layer claiming a completion the
real state never backs up, and it is the most dangerous because the success IS the
disguise. every per-step check and every eval trusts the report, so it sails through
the demo, and the only place it surfaces is on a real user who trusted the output.

it slips because completeness was never any single step's job to verify. the only
thing that catches it is refusing to infer "done" from a green check, and instead
recomputing the outcome against a source the agent did not write: did the claimed
result actually land in the ground truth, or did the agent just say so. that is the
whole reason the verdict cannot be a model reading the agent's "done", it would
believe the lie with confidence. completeness has to be proven by the trajectory
against an outside expectation, not summed from clean steps.

Collapse
 
yune120 profile image
Yunetzi

Spot on: AI failures reveal cross-layer coherence gaps—the prompt, plan, action, and feedback must actually align, not just look right. Fresh angle: a lightweight in-loop referee—sanity checks, guardrails, and external critique to veto moves that don’t fit the goal. What would your AI referee call first?

Collapse
 
zep1997 profile image
Self-Correcting Systems

Appreciate this, and the in-loop referee is the right instinct. one caveat decides
whether it actually works: if the referee is an AI doing sanity checks and external
critique, you have moved the judgment into a second model and called it a
supervisor. a critique from a smarter prompt is still a guess.

the referee that holds has to be deterministic. it recomputes the state that matters
from the operation log and the rules frozen before the run, and compares. no model
opinion anywhere in the verdict.

so to answer you directly, what it calls first is not a judgment, it is a
recomputation: does the composed state across the whole sequence still sit inside
the boundary. the first veto is the compositional one, the threshold no single step
crossed but the running total did, because that is the exact violation every
per-step sanity check waves through.

Collapse
 
txdesk profile image
TxDesk

This is the cleanest statement of the thing we kept circling. The move from "list of failures" to "one coherence violation seen from five sides" is the part that earns the whole series, because it turns a catalog into a property you can test for.

The deterministic-checker insistence is the load-bearing call, and I think you're right to be blunt about it. The moment a model's opinion is in the verdict, you have not removed the drift, you have hired a second thing that can drift and given it a title. Recompute from the operation log and the frozen rules, or it is theater.

On the part you say you have not earned yet, the root of trust the agent cannot reach, I think that is the actual frontier and not a footnote. The way I have come to hold it: you never reach a true outside, so the work is not "find the outside," it is shrink the trusted root until it is small enough for a human to fully inspect, and put it somewhere the checked thing has no write path to. Small matters less than unreachable. A tiny root the agent can still influence is worse than a slightly larger one it is physically isolated from. The shape that has held up for me in practice is an append-only log the acting process cannot rewrite, plus re-derivation from a source the actor has no authority over. You do not get to zero trust. You get to a root that is both inspectable and out of the actor's reach, and you make peace with that being the floor.

The refund-desk example is the right one to lead with, because it kills the two easy answers (per-step and per-window) in the same breath and leaves only the across-the-close total standing. And the benign-case section is the part most people skip, the check that cannot tell the legitimate long job from the laundered reset is just an outage with extra steps.

Going to go read the freezes rather than argue with the essay, since that is the whole point of pre-registration. Sharpest version of this I have seen.

Collapse
 
zep1997 profile image
Self-Correcting Systems

This might be the read i was hoping someone would have. "turns a catalog into a
property you can test for" is the line, that is exactly what the move from a list to
one violation seen from five sides is for, and you said it cleaner than i did.

and "you have not removed the drift, you have hired a second thing that can drift
and given it a title" is going in my head permanently. that is the
deterministic-checker call in one sentence: the moment a model's opinion is
load-bearing in the verdict, you did not close the gap, you staffed it.

on the root of trust we are standing in the same spot, never a true outside, so
shrink it until a human can inspect it and put it where the actor has no write path,
and unreachable matters more than small. that is the next claim, not a footnote, it
is the floor everything else rests on. go read the freezes, that is the whole point
of pre-registering them. and thank you, genuinely, this is the pressure that makes
the work real.

Collapse
 
kartik-nvjk profile image
Kartik N V J K

Naming it coherence across layers instead of a list of failures matches what I see in traces: the model knew the right fact, was allowed to act, and still did the wrong thing because the layers disagreed about state. The pre-registration framing is sharp, because most agent "evals" I see are written after watching the run, which quietly fits the test to the behavior. Where do you put the boundary check, at each layer or only on the final action?