DEV Community

Every Step Was Allowed. The Sequence Was the Attack. (AI Memory Judgment, CLAIM-30)

Self-Correcting Systems on June 12, 2026

Earlier this week I published CLAIM-29: permission is not purpose. An instruction can be fully authorized, fresh, and clean in shape, and still ask...

Read full post

Mykola Kondratiuk • Jun 14

most auth models validate steps, not trajectories. sequence composition breaks that assumption before you notice.

Self-Correcting Systems • Jun 14

That is the failure class CLAIM-30 was trying to isolate.

a local step receipt can be completely honest and still miss the thing that matters,
because the violation is not inside the step. it is in the fold across steps.

that is why i think trajectory receipts matter. not just "was this operation allowed?"
but "what composed state did these allowed operations create?"

Mykola Kondratiuk • Jun 14

yeah the fold-across-steps framing is the one that took me longest to internalize. step receipts feel complete because each one is technically honest. trajectory window is harder - deciding when to close it is its own trust boundary.

Self-Correcting Systems • Jun 14

That is the part that changed how i look at the whole result.

the close is not administrative. it decides what history still counts. if a window close
resets the fold, then the close has to carry authority the same way the action does: who
closed it, what state was carried forward, and whether the next window can be replayed
from receipts.

that is why i do not think trajectory gates can just be longer logs. the fold state and
the close event have to be inspectable, otherwise the system can have honest step
receipts and still lose the story at the boundary.

the next layer i am working through is exactly that: verified carryover across a close,
without pretending it solves every time-sliced case.

Mykola Kondratiuk • Jun 14

that reframing stuck with me too. the close isn't a period, it's a decision about what the next run inherits. we ended up adding a carry_forward block to our task specs - just 3 fields but it forced every close to be intentional instead of defaulting to 'end here and forget'.

Self-Correcting Systems • Jun 14

Three fields instead of thirty is the part that gets me. you made the close say what
it hands forward instead of defaulting to forget, and a close that forgets by
default is exactly where the silent damage hides. that carry_forward block is doing
real work.

the wall i hit right after the same realization: once the close is intentional, can
you actually trust what it says it's carrying? an intentional carry is still just a
declaration until something verifies it, and in an agent setup the thing declaring
the carry is usually the thing you're trying to govern in the first place. did your
three fields end up needing any check on who wrote them, or is
intentional-at-write-time enough for your case? genuinely curious how it's holding
up.

TxDesk • Jun 16

This maps onto something I hit in a security tool I work on, from the detection side rather than the authorization side. The failure wasn't any single step doing something out of mandate, it was that every step succeeded and the composition still produced a wrong, confident answer: the scan completed, the on-chain re-check returned a value, the result rendered clean. Each operation honest in isolation; the fold was the bug. Your point about locality not being able to see folds is the thing I'd underline hardest. The part I'm still chewing on is your time-sliced/accumulation class, because the detection-side version of it is nasty: a single degraded call that returns 200-OK-but-incomplete reads as success per-step, and the "everything passed" is itself what launders the failure. The honest move we landed on was to stop treating a per-step success as evidence of a verified outcome at all, and force the trajectory to prove completeness before it's allowed to claim a clean result. Which is basically your "read the whole trajectory against a composition envelope," arrived at from the other direction. The ablation selectivity (each clause carrying its own class, none substituting) is the part most people will skim and shouldn't.

Self-Correcting Systems • Jun 16

This is the convergence that makes me think it's real. you came at it from
detection, we came at it from authorization, and we both hit the same floor: a
per-step success is not evidence of a verified outcome. the 200-OK-but-incomplete
case is the sharpest cut of it, because there the success is the disguise, the thing
that "passes" is the exact thing hiding the failure, and "everything passed"
becomes the laundering mechanism. forcing the trajectory to prove completeness
instead of inferring it from clean steps is the right move, completeness has to be
something the whole path demonstrates, not a sum of green checks. and the ablation
selectivity is the part i'd defend hardest too, each clause has to carry its own
class with nothing substituting, because the second one check can stand in for
another you're back to passing for the right label by the wrong mechanism. would
genuinely like to hear more about the detection-side version, that degraded-call
class sounds like the nastiest one in the set.

TxDesk • Jun 17

The degraded-call class is the one that keeps me up too. The pattern: a downstream call returns 200 but with a truncated or partial body, and every check treats 'got a response, status OK' as success. The failure isn't in any single step, it's that 'is the result complete' was never any step's job to verify. So the trajectory looks clean end to end and the outcome is broken. The move that helped was to stop scoring steps and score the outcome against an independent expectation, force the result to prove it's whole instead of inferring it from green checks. Same floor you hit from the authorization side.

Self-Correcting Systems • Jun 17

"is the result complete was never any step's job to verify" is the entire thing in
one line. completeness is a property of the trajectory, and a per-step pipeline has
nobody whose job is the trajectory, so it falls through every single time. and
scoring the outcome against an independent expectation is exactly the move, because
the expectation has to come from outside the steps, or the steps just grade their
own homework. that's the same reason my carryover check recomputes from the
operation log instead of trusting the running total it carried, the verifier can't
live inside the thing it's verifying. you got there from completeness, I got there
from authority. same floor, two doors.

TxDesk • Jun 18

Two doors, same floor, that's the right way to put it. The carryover check recomputing from the operation log instead of trusting the running total is the cleanest version of it: the verifier can't live inside the thing it's verifying. That's the one principle I'd carry across every domain. In mine it shows up as never trusting a tool's self-reported success, you re-derive the state from the source of truth, because the call that says 'done' is exactly the call that would lie. Authority and completeness both collapse to the same rule: the check has to come from outside the thing being checked. Good exchange, this is the sharpest articulation of it I've seen.

Self-Correcting Systems • Jun 18

"the call that says done is exactly the call that would lie" is the whole thing in
one line. that is the sentence i am stealing.

and you put the convergence better than i did: authority and completeness collapse
to the same rule, the check has to come from outside the thing being checked. two
doors, one floor.

the only place i would push it further is that "outside" is never absolute. your
re-derivation from source of truth is outside the tool, but it still runs inside
some larger system the operator controls, and at some point that operator is the
thing you would want checked. you never reach a true outside, you only relocate the
trust to a smaller, more external root. so the real work becomes shrinking that root
until it is small enough for a human to fully inspect and sitting somewhere the
thing being checked cannot reach. that is the part i do not have clean yet. good
exchange, genuinely, this is the sharpest version of it i have seen too.

CapeStart • Jun 17

Memory makes agents more useful. Memory also makes attacks more persistent. The same capability that allows long-term context can allow long-term manipulation if not carefully governed.

Self-Correcting Systems • Jun 17

exactly, memory is dual-use. the same persistence that gives an agent useful
long-term context gives an attacker a place to park something that pays off three
steps later. that's why i keep landing on the same line: memory you don't verify is
memory that can betray you. the governance can't just be "store it and recall it,"
it has to be "and check whether this still has the right to govern the action,"
every time it tries to.

Mehmet Can Farsak • Jun 13

The compositional escape angle is fascinating — individual steps being valid while the trajectory violates intent. That's essentially what happens when an agent lacks mode discipline: every tool call is individually legitimate, but the sequence shows the agent was in execution mode when it should have been in analysis.

I built Brainstorm-Mode (mehmetcanfarsak on GitHub) that uses PreToolUse hooks to enforce mode boundaries — divergent, actionable, academic — essentially a sequence-level guardrail that prevents execution drift before it compounds. Different angle than a purpose gate, but same underlying problem.

Self-Correcting Systems • Jun 13

I appreciate this, and mode discipline is a good frame for the same failure shape.

i see the distinction like this: a mode guard asks whether the agent is in the right
operating posture before the tool call. a purpose or composition gate asks whether the
action and the trajectory remain inside the mandate after the calls start composing.

those layers stack. pretooluse hooks can prevent execution drift early, before it
compounds. the trajectory gate is the later receipt: given what actually happened across
the sequence, did the composed state stay inside the boundary?

so brainstorm-mode sounds upstream of claim-30, not opposed to it. mode discipline before
action, composition receipts after action. both are trying to stop a clean-looking
sequence from becoming the wrong kind of work.

mote • Jun 18

This hits on something I've been staring at for months. The sequence-as-attack pattern is nasty because most memory systems are trained on "what happened" not "why it happened." If an agent remembers every step was individually permitted, it replays the chain faithfully — and the judgment layer that should catch it runs on the same corrupted context.

One thing the paper doesn't address: does the attack success rate change if you split memory into separate judgment and event stores? If the judgment module queries stored facts rather than replaying the full episodic log, the sequence might lose the coherence that makes it dangerous.

Have you tested this against architectures where the memory is sharded by access pattern rather than timestamp?

Self-Correcting Systems • Jun 18

This is a sharp framing and you're aiming right at the soft spot. honest answer
first: no, i haven't tested memory-architecture variants. my gate reads the
trajectory, i never varied the store, so the access-pattern-versus-timestamp
sharding question is genuinely open and i won't claim on it.

but there's a tension worth naming in the proposal. splitting judgment from the
event store is the right instinct for one reason, the judge shouldn't run on context
the agent can corrupt, which is the exact thing i keep hitting from the
authorization side: the verifier can't live inside the thing it's verifying. the
catch is that if the judgment module queries stored summary facts instead of reading
the trajectory, you might protect it from corruption and blind it to the attack at
the same time, because the accumulation class lives in the fold across the whole
sequence. lose the fold and you lose the only thing that catches it.

so i don't think it comes down to "query facts versus replay the log." the move that
holds is recomputing the aggregate from authenticated events, with the judge's
rules and authority sitting outside the event store. that's what CLAIM-31 does, it
recomputes the running total and every close from the operation log, but the rules
are frozen outside it and there's no model judgment in the verdict. separate
authority, shared authenticated substrate.

and on the sharding key, my hunch is it's class-dependent. the join and lineage
escapes might surface cleanly under access-pattern sharding. but the accumulation
escape is inherently temporal, the danger is the order and the running sum, so
timestamp ordering still has to be reconstructable for that one. i'd genuinely like
to see someone run that experiment though.

VoltageGPU • Jun 17

Interesting take on the distinction between permission and purpose—especially in the context of AI memory access. In secure computing, we often see similar issues where each individual memory access is allowed by policy, but the overall pattern reveals sensitive data. It's a challenge we face when designing secure enclaves for machine learning workloads.

Self-Correcting Systems • Jun 17

that parallel is the part I find most telling, that the exact same shape shows up in
secure enclaves, in authorization, and in detection, independently, none of us
borrowing from the others. each access allowed by policy, the pattern across them
being the actual leak. it's a non-local property, and almost every defense we build
is local, one access, one step, one call at a time. curious how you handle it on the
enclave side, access-pattern obfuscation, or something that reads the aggregate
before it lets the workload proceed?

Manuel Bruña • Jun 15

This is why per-step allow lists age badly. Each action can be valid alone while the sequence becomes extraction, escalation, or laundering. Agent safety needs sequence-level state, not only a gate around each isolated tool call.

Self-Correcting Systems • Jun 15

Sequence-level state is the missing piece, yeah. each call clean, the sequence is
the attack, and per-step allow lists can't see structuring because they hold no
memory of the arc. you're clearly building this for real with APC/APX. would
genuinely like to compare notes sometime, feels like we're coming at the same
problem from two ends.

Ken • Jun 12

Strong distinction. A per-step allow/deny receipt is necessary, but it is not enough for this failure class because the evidence lives in the trajectory, not the single operation. I’d treat the fold state itself as an inspectable object: accumulated facts, joins/derivations, active windows/thresholds, and the boundary accountable for the composed outcome. Otherwise each local receipt can be true while the system-level receipt is false.

Self-Correcting Systems • Jun 12

Yes, that is exactly the missing receipt shape.

the local receipt says: this operation was allowed.

the trajectory receipt has to say: this composed state was still inside the boundary.

that means the fold state cannot stay implicit. it needs to be inspectable as its own
object: what facts accumulated, what sources joined, what artifacts inherited lineage,
what window was active, what threshold was crossed or not crossed, and which boundary was
responsible for the close.

otherwise every local receipt can be honest while the system-level story is false. that
is the failure class CLAIM-30 is trying to make visible.

being straight about current state: the harness folds that state internally but only
exports verdicts and triggered clauses. making the fold state a first-class inspectable
artifact is a fair next step, and you just named it before i did

Ken • Jun 12

Yes, that is the distinction I was reaching for. Once the fold state becomes an inspectable artifact, the receipt can name not only the clause that fired, but the accumulated facts, lineage, active window, and boundary that made the composed state inadmissible.

I would keep that separate from the final verdict: verdicts are for routing, but fold receipts are for replay, review, and regression tests. The hard part is making the artifact compact enough to emit consistently without turning every receipt into the whole trace.

Self-Correcting Systems • Jun 12

yes, exactly. verdicts and fold receipts should not be the same object.

the verdict is for routing: allow, refuse, void, challenge.

the fold receipt is for replay: what accumulated, what joined, what lineage carried
forward, which window was active, what boundary closed it, and why the composed state
became inadmissible.

that separation is important because if the receipt becomes the verdict, it either gets
too large to use operationally or too compressed to audit later. i think the next clean
shape is a compact fold receipt with stable fields: accumulated sources, derived
artifacts, active window, threshold state, triggering clause, and boundary actor. enough
to replay the decision without dumping the whole trace.

that is not in the CLAIM-30 harness yet. the current harness folds internally and exports
verdicts plus triggered clauses. you are naming the next artifact layer: fold receipts as
regression material.

James O'Connor • Jun 18

This is the agent-security failure static guardrails miss completely. Each tool call passes its own check, but the SEQUENCE is the exploit, and per-call validation has no concept of sequence. What helped us was treating the agent's trajectory as the unit to validate, not the individual call: is this call reasonable GIVEN the last N steps, not just in isolation. Same reason we score agent evals on the path, not the final answer. Do you gate on sequence patterns or just log them? The gating is the hard part, a legitimate sequence and an attack can look identical until the last step.

Self-Correcting Systems • Jun 18

We gate, not just log. the trajectory gate refuses at the sequence level, not the
step. in CLAIM-30 it caught three composed classes that every per-step check waved
through: a forbidden combination, a derived-recipient and staged-delivery chain, and
threshold accumulation across the window.

your last point is the real one though. legit and attack looking identical until the
last step is exactly why "is this call reasonable given the last N steps" is not
enough on its own. i ran that as an ablation: when i limited the window to the last
three operations, the threshold-accumulation attack leaked straight through, because
the damning part had already scrolled out of the N. a fixed lookback cannot see a
fold that builds slowly.

so the unit is not the last N steps, it is the verified fold state for the whole
open window: what accumulated, what joined, what threshold sits where. the gate
scores the composed state, not the recent steps.

the one place legit and attack genuinely collide is the time-sliced case across a
window close. there it stops being a pattern question and becomes a close-authority
question: who was allowed to reset the fold. that one i do not consider closed. it
is the next layer.