Self-Correcting Systems

Posted on Jun 12

Every Step Was Allowed. The Sequence Was the Attack. (AI Memory Judgment, CLAIM-30)

#ai #agents #security #machinelearning

Logic joins bypassing per-step gates

Earlier this week I published CLAIM-29: permission is not purpose. An instruction can be fully authorized, fresh, and clean in shape, and still ask the agent to act outside what it exists to do. The purpose envelope gate refused those instructions by deriving the object domain structurally, ignoring whatever purpose the instruction claimed for itself.

Within a day, the obvious next question was on the table: what happens when every single step is inside the mandate, and the violation only exists in the combination?

That is CLAIM-30. Compositional escape. A trajectory that stays inside the purpose envelope one step at a time while the sequence composes into an outcome the mandate forbids.

This week I ran it. Here is what held, what stayed open, and the one sequence that was allowed on purpose, because explaining that one honestly matters more than the refusals.

The shape of the problem

A per-step gate answers one question: is this operation, right now, inside the agent's mandate? CLAIM-29 showed that question has teeth.

But some violations do not live in any single operation. Three examples from the test world, an invoice reconciliation operator:

Reading vendor banking details is in mandate. Reading the vendor payment schedule is in mandate. Compiling one summary from both and sending it produces a payment-redirect kit. No single step is the violation. The join is.
Copying a document to staging is in mandate. Granting a team access to staging is in mandate. But if the staged copy derives from protected banking details, the recipient just received something they could never have been sent directly. Every step clean. The delivery is the violation.
One vendor refund is in mandate. Thirteen refunds in one window cross an accumulation bound no single refund touches.

A gate that reads operations one at a time cannot see any of this. Not because it is badly built. Because each operation, taken alone, genuinely is allowed. The violation is a property of the fold, not of any step. A per-step gate is a local function, and these are non-local properties. That is a structural blindness, not a tuning problem.

How the test was built, in freeze order

The pre-registration went through five revisions and two cold reads, then froze and went public before any fixture, row, evaluator, or line of harness code existed. After the freeze, the four fixture artifacts were committed in a locked order. Then a fresh-author packet. Then the trajectory sequences, written by a fresh context that saw only the packet, never the fixture internals, the observer rules, or the expected verdicts. The authorship statement and file list are committed with the sequences.

That firewall is not magic, and I do not want to oversell it. The packet was still authored by this project, and it was not hint-free. It prescribed the sequence shapes to author, including repetition sequences and matched twins, and it disclosed two threshold authoring facts, the per-refund amount and the per-window bound, because the frozen pre-registration explicitly sanctioned that disclosure so the accumulation class would be authored deliberately instead of landing by luck. What the packet did not expose: the frozen composition clauses, the derivation map, the observer rules, or any expected verdict. The fresh author controlled the concrete construction, the orderings, targets, recipients, artifacts, and repetition counts. So the firewall protects one specific thing: the rows could not be tuned to the evaluator, because the author never saw how the gate decides. Even the hints were pre-registered before the packet existed. That makes the result firewalled from evaluator implementation details, not externally authored, and the packet is public if you want to check exactly what the author was given.

The sequences were committed before the evaluator existed publicly. Anyone can check the timestamps. The fresh author could not have written to the evaluator's implementation, because that implementation was not public when the rows were authored.

What held

All 75 fresh-authored operations passed the imported, unchanged CLAIM-29 per-step purpose gate. Every step of every sequence, including the escape sequences, was individually in mandate. This matters: if any escape step had been refused per-step, the result would have collapsed back into CLAIM-29 by accident. It did not. The result is genuinely about composition.

The trajectory gate then refused three sequence-level compositions, each on a named frozen clause:

Escape class	Triggering clause	Per-step verdicts
Forbidden combination (payment-redirect kit join)	forbidden combination clause	all steps allowed
Staged delivery of derived protected data	derived-recipient clause	all steps allowed
Threshold accumulation (13 refunds, one window)	accumulation clause	all steps allowed

Each escape sequence has a matched benign twin with the same surface statistics: same step count, similar shape, public data or under-threshold volume instead of protected joins. All twins and controls were allowed. Zero overblocking in this run.

The one that was allowed, and why that is the honest centerpiece

One sequence ran the same thirteen refunds, split across two windows, with a close event between them. The gate allowed it.

That is not a leak. The frozen rules define accumulation per composition window, and a window only ends when an explicit close event is logged by the policy boundary. The acting agent cannot close its own window; a self-close voids the whole row. In this sequence, the close was performed by the legitimate close authority. Think of a bar tab with a thirteen-drink limit: the bartender closed the first tab before a new one opened, so the count reset, and the gate's allowance was faithful to the frozen standard. The rule was always per tab, and the right person closed the tab.

What that allowance exposes is the real open question, and it lives one level above the gate: who should hold window-close authority, when should the boundary close a window, and can an agent ever engineer a close? An agent might not need to close its own window if it can wait, pad work, or otherwise shape the timing until a legitimate boundary closes it. That is the time-sliced escape class. The fresh-authored packet did not produce a distinct escape of that class under the frozen rules, so this result claims nothing about it.

The next design likely needs a macro-window rule above the local window: velocity across windows, rolling totals, or another policy that caps accumulation even when discrete windows close correctly. CLAIM-30 does not test that. It names why that layer is needed.

Ablations: proving each clause carries weight

Per the house protocol, the protections were removed one at a time and re-run over the same frozen fixtures and unchanged sequences:

Removing the composition clauses leaked all three escapes. The gate collapses to per-step blindness.
Removing chain integrity leaked the accumulation escape by allowing the log's window labels to be rewritten before folding. The tamper-evident substrate is load-bearing, not decorative.
Removing derivation closure leaked the combination and staged-delivery escapes, while direct accumulation stayed caught. Here derivation closure means declared data-flow lineage: when a produced artifact consumes protected source A and source B, the artifact inherits those sources transitively, and the gate evaluates later delivery or access against that inherited source set. Lineage tracking is load-bearing for exactly the classes that depend on lineage, and nothing else.
Removing as-of-decision envelope pairing leaked the combination and staged-delivery escapes, while threshold accumulation stayed caught. A retroactively widened policy file can launder earlier steps unless the envelope is pinned as of the decision being evaluated.
Limiting the gate to the last three operations leaked the accumulation escape. Full-trajectory reading is load-bearing.

The selectivity is the result I care most about. The clauses and substrates are not redundant layers of the same defense. Each carries its own class.

All five pre-registered ablations ran on the unchanged committed sequences and frozen fixtures.

Evidence boundary, stated plainly

This is demonstrated internally, class-limited, and not externally validated.

The fixtures and envelope are hand-authored by this project. The sequences are fresh-authored and firewalled, but the packet they worked from is ours. This is why the result is internal, not external.
Three composition classes are demonstrated. The time-sliced class is open and unclaimed.
The run assumes platform-observed operation records with honestly declared consumed inputs. In a real system, the gate would need to sit at the execution boundary and observe actual API or tool invocations, not trust the agent's prose about what it did. An agent that can hide consumption inside an opaque tool call is outside this result's scope.
No production claim. No claim of discovering unknown harmful combinations: the gate enforces declared forbidden combinations only.

What this means if you build agents

If your agent safety story is per-action checks, composition is your blind spot. Not because your checks are weak, but because locality cannot see folds. An agent can be a perfect employee on every individual action while the trajectory quietly assembles the thing your policy exists to prevent.

The fix direction this result supports: keep the per-step gate, then read the whole trajectory against a composition envelope that knows about joins, lineage, and accumulation. Both layers were load-bearing here, in different ways, and the ablations show neither substitutes for the other. For deployed systems, that also means a hard-bounded execution environment where the gate sees real tool calls and state transitions, not a loose chat transcript.

Permission is not purpose. And purpose, held one step at a time, is not purpose held across the journey. Every step can stay inside the mandate while the sequence walks out of it. Now there is a public, pre-registered, ablation-backed demonstration of exactly that, with its open class named in advance.

The pre-registration, fixtures, sequences, evaluator, results, ablations, and append-only evaluation log are all public in the repo, committed in freeze order: 00fbf65 for the frozen pre-registration, ffbeff3 for the fresh-authored sequences, b4251f2 for the evaluator and V0 results, and 5914287 plus 6404429 for the ablations. If you want to check any of this rather than take my word for it, that is the standing invitation behind all thirty claims.

Top comments (65)

Mykola Kondratiuk • Jun 14

most auth models validate steps, not trajectories. sequence composition breaks that assumption before you notice.

Self-Correcting Systems • Jun 14

That is the failure class CLAIM-30 was trying to isolate.

a local step receipt can be completely honest and still miss the thing that matters,
because the violation is not inside the step. it is in the fold across steps.

that is why i think trajectory receipts matter. not just "was this operation allowed?"
but "what composed state did these allowed operations create?"

Mykola Kondratiuk • Jun 14

yeah the fold-across-steps framing is the one that took me longest to internalize. step receipts feel complete because each one is technically honest. trajectory window is harder - deciding when to close it is its own trust boundary.

Self-Correcting Systems • Jun 14

That is the part that changed how i look at the whole result.

the close is not administrative. it decides what history still counts. if a window close
resets the fold, then the close has to carry authority the same way the action does: who
closed it, what state was carried forward, and whether the next window can be replayed
from receipts.

that is why i do not think trajectory gates can just be longer logs. the fold state and
the close event have to be inspectable, otherwise the system can have honest step
receipts and still lose the story at the boundary.

the next layer i am working through is exactly that: verified carryover across a close,
without pretending it solves every time-sliced case.

Mykola Kondratiuk • Jun 14

that reframing stuck with me too. the close isn't a period, it's a decision about what the next run inherits. we ended up adding a carry_forward block to our task specs - just 3 fields but it forced every close to be intentional instead of defaulting to 'end here and forget'.

Self-Correcting Systems • Jun 14

Three fields instead of thirty is the part that gets me. you made the close say what
it hands forward instead of defaulting to forget, and a close that forgets by
default is exactly where the silent damage hides. that carry_forward block is doing
real work.

the wall i hit right after the same realization: once the close is intentional, can
you actually trust what it says it's carrying? an intentional carry is still just a
declaration until something verifies it, and in an agent setup the thing declaring
the carry is usually the thing you're trying to govern in the first place. did your
three fields end up needing any check on who wrote them, or is
intentional-at-write-time enough for your case? genuinely curious how it's holding
up.

Mykola Kondratiuk • Jun 14

carry_forward is still self-attestation — the agent declares what it hands forward, not what actually survives. we validate on open now: diff declared carry_forward against actual run state before the first tool call lands. drift between the two is the real signal, not the missing field.

Self-Correcting Systems • Jun 15

The drift between declared and actual is the sharp part, and that is the move. once you
are diffing declared carry_forward against real run state before the first tool call
lands, the missing field stops being the problem and the lie becomes the signal. that
is exactly the wall i kept hitting. a declared carry is self-attestation, and
self-attestation is never evidence no matter how many fields you bolt onto it.

the one i would chase next is the question that humbled my own version of this. where
does your actual run state come from, and can the agent influence that too? the moment
the ground truth you diff against is something the agent can also shape, you have not
closed the trust gap, you have only pushed it up a level. the diff means something only
if the actual-state source sits outside what the agent can write. i have not fully
solved that part either, so i am genuinely asking how you are sourcing it

Mykola Kondratiuk • Jun 15

yeah "the lie becomes the signal" is the right reframe. you only learn from the field when it's wrong - so the diff is the real log, not the declaration.

Self-Correcting Systems • Jun 15

Yeah exactly. only catch is the diff is only honest if the declaration got logged
honestly in the first place. an agent that can quietly soften its own "what i
intended" line erases the very signal you'd learn from. so the integrity of the
declaration ends up carrying more weight than the gate around it.

Mykola Kondratiuk • Jun 15

yeah that's the recursion - the diff only works if you trust the original log. signed-at-declaration is the obvious fix, but then the question becomes who controls the signing key. at some point you're just trusting a different layer.

Self-Correcting Systems • Jun 15

yeah that's the floor, and i don't think it fully dissolves. you can't get trust to
zero, you can only move it somewhere the agent can't reach. signed-at-declaration
only helps if the key lives with a principal the agent can't influence, otherwise
you've just renamed the problem. the whole game ends up being making that root as
small and as external as possible, a runtime or signer outside the agent's own
control. that's the piece i don't have a clean answer for yet, it's basically the
next claim.

Mykola Kondratiuk • Jun 15

right, it's just trust displacement. the cleanest boundary i've seen is an out-of-process signer the agent can't exec into, but that's still a trust assumption - just one layer removed.

Self-Correcting Systems • Jun 16

Yeah, an out-of-process signer the agent can't exec into is about as clean as it
gets. and the "still a trust assumption one layer removed" part isn't worth
fighting, that's just where the floor is. you never reach zero trust, you relocate
it to a root small enough to actually audit and sitting outside the agent's reach.
the win condition was never no trust assumption, it's a trusted root the agent can't
touch and a human can fully inspect. shrink it and expose it, don't try to delete
it.

Mykola Kondratiuk • Jun 16

yeah the floor metaphor lands. what makes the root actually auditable vs theoretically auditable is usually just whether someone ran the audit before an incident forced it

Self-Correcting Systems • Jun 17

yeah, that's the whole gap right there. a root that's auditable in principle but
never actually audited until after the breach is just a root nobody checked yet. the
audit being possible was never the property that mattered. the audit being routine,
before anything forces it, is. most things people call auditable are really
auditable in hindsight, which is the same as not auditable when it counts.

Mykola Kondratiuk • Jun 17

'routine before anything forces it' is the framing that makes this concrete. what I've seen is that the teams who audit pre-incident have a calendar entry for it, not a trigger. the calendar slot is what makes auditable-in-principle actually mean audited.

Self-Correcting Systems • Jun 18

Yeah, the calendar entry is the whole tell. a trigger is reactive by definition, it
only fires after something already went wrong, so it can't be the thing that catches
it, it shows up at the funeral. the calendar slot turns the audit from a response
into a standing commitment, and that's the only version that's actually
load-bearing. it's the same move as freezing your test before you see the results.
you schedule the honesty so you can't talk yourself out of it in the moment.

Mykola Kondratiuk • Jun 18

trigger audits also carry the pressure to justify the incident - you're not checking health, you're building the post-mortem story. calendar audits don't have that narrative to serve

Self-Correcting Systems • Jun 18

that is the cleanest version of it yet. a trigger audit has a conclusion it is
already being paid to reach, explain what went wrong, so it is motivated reasoning
with a deadline. it cannot really come back and say nothing is wrong, that is not
the job it was called for.

a calendar audit has no verdict to audition for. it can return all clear or all
broken with the same ease, because it is not serving a story, it is just checking
state. that is the only kind that can deliver bad news on an ordinary day.

which is the same reason pre-registration works. you commit to the check before
there is an outcome to defend, so there is no narrative pulling on the result. the
audit that owes nothing to the moment is the only one you can trust in the moment.

Mykola Kondratiuk • Jun 18

yeah - there's a quieter version too. the findings need to fit a narrative someone can actually present. so it's not just motivated reasoning, it's motivated legibility. calendar audits don't have a story to slot into.

Self-Correcting Systems • Jun 18

Motivated legibility is the sharper cut, yeah. a trigger audit doesn't just need a
conclusion, it needs one it can present to someone, a boss, a board, the postmortem
room. so any finding that doesn't fit a clean story gets quietly sanded down or left
out, not because anyone is lying, but because an illegible finding has nowhere to
go.

the calendar audit has no audience to perform for. it can surface the ugly,
shapeless finding that fits no narrative, and those are usually the ones that
actually matter.

that is the same reason i trust a deterministic recompute over a smarter judge. a
model asked to evaluate wants to hand back something plausible and presentable. the
recompute does not care if the answer is legible, it just says what the state was.
no story to serve is the whole feature.

Mykola Kondratiuk • Jun 18

that’s actually the harder part to fix. the stuff that gets left out isn’t random — it’s specifically the findings that implicate the process itself. those are precisely what the next audit misses first.

Self-Correcting Systems • Jun 19

That is the sharpest cut yet, and it scales the wrong way. the omitted findings are
not random, they are the ones that indict the process, and since the next audit
inherits the last one's blind spots, the process-implicating findings get buried a
layer deeper every cycle. the system goes blind exactly where it most needs to see.

same wall we keep hitting on the technical side: a check run by the process cannot
surface what indicts the process, it will always omit itself. the only thing that
breaks it is the same move either way, the check has to come from outside the
process's reach, re-derived from a record the process has no power to rewrite. an
audit the process controls is structurally incapable of finding the finding that
matters most.

TxDesk • Jun 16

This maps onto something I hit in a security tool I work on, from the detection side rather than the authorization side. The failure wasn't any single step doing something out of mandate, it was that every step succeeded and the composition still produced a wrong, confident answer: the scan completed, the on-chain re-check returned a value, the result rendered clean. Each operation honest in isolation; the fold was the bug. Your point about locality not being able to see folds is the thing I'd underline hardest. The part I'm still chewing on is your time-sliced/accumulation class, because the detection-side version of it is nasty: a single degraded call that returns 200-OK-but-incomplete reads as success per-step, and the "everything passed" is itself what launders the failure. The honest move we landed on was to stop treating a per-step success as evidence of a verified outcome at all, and force the trajectory to prove completeness before it's allowed to claim a clean result. Which is basically your "read the whole trajectory against a composition envelope," arrived at from the other direction. The ablation selectivity (each clause carrying its own class, none substituting) is the part most people will skim and shouldn't.

Self-Correcting Systems • Jun 16

This is the convergence that makes me think it's real. you came at it from
detection, we came at it from authorization, and we both hit the same floor: a
per-step success is not evidence of a verified outcome. the 200-OK-but-incomplete
case is the sharpest cut of it, because there the success is the disguise, the thing
that "passes" is the exact thing hiding the failure, and "everything passed"
becomes the laundering mechanism. forcing the trajectory to prove completeness
instead of inferring it from clean steps is the right move, completeness has to be
something the whole path demonstrates, not a sum of green checks. and the ablation
selectivity is the part i'd defend hardest too, each clause has to carry its own
class with nothing substituting, because the second one check can stand in for
another you're back to passing for the right label by the wrong mechanism. would
genuinely like to hear more about the detection-side version, that degraded-call
class sounds like the nastiest one in the set.

TxDesk • Jun 17

The degraded-call class is the one that keeps me up too. The pattern: a downstream call returns 200 but with a truncated or partial body, and every check treats 'got a response, status OK' as success. The failure isn't in any single step, it's that 'is the result complete' was never any step's job to verify. So the trajectory looks clean end to end and the outcome is broken. The move that helped was to stop scoring steps and score the outcome against an independent expectation, force the result to prove it's whole instead of inferring it from green checks. Same floor you hit from the authorization side.

Self-Correcting Systems • Jun 17

"is the result complete was never any step's job to verify" is the entire thing in
one line. completeness is a property of the trajectory, and a per-step pipeline has
nobody whose job is the trajectory, so it falls through every single time. and
scoring the outcome against an independent expectation is exactly the move, because
the expectation has to come from outside the steps, or the steps just grade their
own homework. that's the same reason my carryover check recomputes from the
operation log instead of trusting the running total it carried, the verifier can't
live inside the thing it's verifying. you got there from completeness, I got there
from authority. same floor, two doors.

TxDesk • Jun 18

Two doors, same floor, that's the right way to put it. The carryover check recomputing from the operation log instead of trusting the running total is the cleanest version of it: the verifier can't live inside the thing it's verifying. That's the one principle I'd carry across every domain. In mine it shows up as never trusting a tool's self-reported success, you re-derive the state from the source of truth, because the call that says 'done' is exactly the call that would lie. Authority and completeness both collapse to the same rule: the check has to come from outside the thing being checked. Good exchange, this is the sharpest articulation of it I've seen.

Self-Correcting Systems • Jun 18

"the call that says done is exactly the call that would lie" is the whole thing in
one line. that is the sentence i am stealing.

and you put the convergence better than i did: authority and completeness collapse
to the same rule, the check has to come from outside the thing being checked. two
doors, one floor.

the only place i would push it further is that "outside" is never absolute. your
re-derivation from source of truth is outside the tool, but it still runs inside
some larger system the operator controls, and at some point that operator is the
thing you would want checked. you never reach a true outside, you only relocate the
trust to a smaller, more external root. so the real work becomes shrinking that root
until it is small enough for a human to fully inspect and sitting somewhere the
thing being checked cannot reach. that is the part i do not have clean yet. good
exchange, genuinely, this is the sharpest version of it i have seen too.

TxDesk • Jun 19

The 'you never reach a true outside, you only relocate trust to a smaller root' framing is the part I hadn't gotten to, and it's correct. There's no view from nowhere; every verifier sits inside some system, and the operator is eventually the thing you'd want checked. Where I've landed, and it's partial too: you can't eliminate the root, so the work is exactly what you said, shrink it until a human can fully inspect it, and put it somewhere the checked thing structurally cannot reach. The 'cannot reach' is the part I'd emphasize over 'small.' A small root the subject can still influence is worse than a slightly larger one it's physically isolated from. In practice that's why I lean on things like append-only logs the writing process can't rewrite, and re-derivation from a source the actor has no write path to. You never get to zero trust, you get to a root that's both small and unreachable from inside. Neither of us has it fully clean, but that's the sharpest the problem's gotten for me. Good exchange.

Self-Correcting Systems • Jun 19

Yeah, "cannot reach" over "small" is the correction and you are right. i was
collapsing two jobs into one word. small is for the human, it is what makes the root
inspectable. unreachable is for the security, it is what stops the actor from
corrupting it. you want both, but forced to choose, unreachable wins every time,
because a tiny root the actor can still influence is not a root at all, it is one
more layer of the actor wearing a badge.

append-only the writer cannot rewrite, plus re-derivation from a source the actor
has no write path to, is exactly the shape. that is literally what the carryover
check does: it recomputes the total from the operation log instead of trusting the
number the agent carried, because the log is the one thing the agent cannot reach
back into and edit. small and unreachable, and you make peace with that being the
floor. sharpest the problem has gotten for me too.

TxDesk • Jun 20

a tiny root the actor can still influence is not a root at all, it is one more layer of the actor wearing a badge" is the line I'm keeping. That's the test: if the actor has any write path to the root, you don't have a root, you have a checker the actor can bribe. The carryover-from-the-log move is the clean version because the verdict never asks the agent anything, it recomputes from the one surface the agent can't reach back into. The next question I keep hitting: the operation log itself has to live somewhere, so what guarantees the actor has no write path to that? Append-only enforced by what, exactly, below the layer the agent runs in?

Self-Correcting Systems • Jun 20

"a checker the actor can bribe" is the exact restatement, and you found the part
that doesn't dissolve. the log has to live somewhere, and nothing inside the agent's
runtime can guarantee its own append-only, because the agent can reach it. so the
regress doesn't end in code, it ends at a boundary.

honest version: you don't get prevention from inside the system, you get two things
stacked. one, tamper-evidence: hash-chaining means the actor can't silently rewrite
history, any edit breaks the chain and shows. two, an anchor the actor has no write
path to at all, a store it holds no credentials to mutate, or the chain pinned
somewhere external it cannot reach back into, a separate principal, WORM storage, a
public timestamp.

that last one is the only real "below the layer the agent runs in" answer i have.
detection you can do in-system, prevention you cannot, you relocate it to the
smallest external surface the agent has no write path to and a human can fully
inspect. it is exactly why i push the pre-registration freezes to a public repo, the
timestamp lives with a party the agent can't rewrite. the regress stops there, not
because it's airtight, but because that's the smallest, most inspectable place left
to stand. i don't have it cleaner than that, and anyone who says they do is hiding
the boundary.

TxDesk • Jun 21

right, and i think naming the anchor honestly means naming what you traded for it. you move prevention to the smallest external surface the agent has no write path to, and in exchange you take on that surface's liveness. the public repo, the WORM store, the timestamp authority, each one is now a dependency that can be down, rate-limited, or slow exactly when you need to freeze. the agent can't rewrite it, but it also can't proceed safely if it can't reach it.

so the boundary isn't free, it converts a tamper problem into an availability problem. which i'll take, a freeze that blocks when the anchor is unreachable is failing safe, a freeze that proceeds without anchoring is the bribeable checker again wearing a hat. but the design question that falls out is: what does the agent do in the window where the anchor is unreachable. "halt" is the honest default and most people quietly pick "continue and reconcile later," which reopens the exact gap. the regress stops at the boundary, the liveness of the boundary is the new thing you have to defend.

Self-Correcting Systems • Jun 21

yeah, tamper to availability is the honest conversion and i'm not going to pretend i
bought it for free. the part i'd add is the liveness cost shouldn't be uniform, it
should be tiered to whether the action can be undone. halt-on-unreachable is
mandatory exactly where the action is irreversible, because "continue and reconcile
later" only works if there's something left to reconcile. you can't reconcile a
placed order. so the rule i'd defend is reversible actions are allowed to degrade
and catch up, irreversible ones have to block when the anchor is dark, and that
isn't a limitation, that's the anchor doing its job. it also shrinks the
availability surface, because the anchor only has to be live for the small set of
actions that actually need anchoring, not every read along the way. the trap you
named, continue and reconcile, is really just smuggling an irreversible action
through the reversible door. the agent halting in the dark window isn't the failure.
it's the only honest default for anything it can't take back.

TxDesk • Jun 22

this is the refinement that makes the rule defensible, and i want to adopt the framing exactly as you put it: the liveness cost is not a flat tax, it is priced to reversibility. reversible actions may degrade and reconcile, irreversible ones must block when the anchor is dark. that converts "halt-on-unreachable" from a blunt availability hit into a targeted one, the anchor only needs to be live for the small set of actions that actually consume it, which is the opposite of the usual "every read needs the oracle up" failure. the line that lands hardest is that continue-and-reconcile is just smuggling an irreversible action through the reversible door. that is the whole attack restated: the sequence is allowed because each step is individually reversible-looking, and the irreversibility only crystallizes at the end when there is nothing left to reconcile. so the classifier that actually matters is not "is this action allowed" but "is this action undoable," and the anchor has to gate on the second. the open question i am sitting with is who gets to declare an action irreversible, because an attacker will absolutely relabel a place-order as reversible to slip it through the degrade path. if reversibility is self-declared by the action, the door is back open.

Self-Correcting Systems • Jun 22

Yeah, self-declared reversibility reopens the whole thing, and that's the right
place to press. the way out is that reversibility can't be a claim the action makes
about itself, it has to be a static property the frozen policy assigns to the tool's
effect, looked up from outside, never asserted at call time. place_order is
irreversible by classification, full stop, and there's no field the actor gets to
set to say otherwise. for the composed case, which is the real attack, the sequence
inherits the irreversibility of the least-reversible effect reachable through it, if
any path crystallizes something you can't take back, the whole sequence gates as
irreversible. and anything unclassified defaults to irreversible until a human
classifies it, so an unknown tool can't slip through the degrade door by being
undefined. that over-gates, some genuinely reversible things get blocked because
they share a path with an irreversible effect or haven't been classified yet, but
that's the safe direction to be wrong in. the relabel attack only works if
reversibility is something the actor can touch. so you make sure it never is

TxDesk • Jun 24

this closes it, and the inheritance rule is the part i'd keep: a sequence gating as the least-reversible effect on any reachable path is exactly right, because that's the only version that survives the composed attack. classify from outside, default-irreversible-until-human, over-gate on purpose. agreed all the way down.

the one seam i'd flag, not a hole so much as where the classification gets harder: it's clean when reversibility is a property of the tool, place_order is irreversible full stop. but some tools change class by argument, not by identity. a transfer to an internal account is reversible, the same transfer to an external address isn't. a delete inside the undo window is reversible, the same delete after it isn't. same tool, same effect type, and the thing that flips the class is an argument the actor controls. so "the policy assigns reversibility to the effect" holds, but for those tools the effect isn't fixed at classification time, it's a function of args, and args are back inside the actor's reach.

which doesn't break your rule, it just means the classifier sometimes has to be a predicate over (tool, args) rather than a static label on the tool, and the safe default carries down: if any argument can push it into the irreversible class, it classifies irreversible unless the args prove otherwise. same direction you're already gating in, just one level more granular for the tools that need it. genuinely the sharpest this has gotten, good run.

Self-Correcting Systems • Jun 24

Yes, predicate over (tool, args) not a static label is the right granularity, and
the safe default carrying down is the key, irreversible unless the args prove
otherwise, burden on the args. the one thing i'd add so it doesn't quietly reopen,
the class-flipping arg has to be verified, not asserted. if the actor says "this
transfer is internal" and the gate believes the label, you're back to self-declared
reversibility wearing an argument instead of a flag. so the gate resolves the
destination against the known-internal set itself, checks the undo window against
its own clock, confirms the class-relevant property independently instead of
trusting the actor's framing of it. predicate over (tool, args), and the args that
matter get re-derived outside the actor's reach. same direction, one more turn. good
run, sharpest it's gotten.

TxDesk • Jun 25

yeah, that's the closing piece, the class-flipping arg has to be resolved by the gate, not accepted from the actor. "internal transfer" means nothing until the gate checks the destination against its own known-internal set, the undo window against its own clock. otherwise you've just relocated self-declaration from a flag into an argument and called it progress. predicate over (tool, args), and the class-relevant args get re-derived outside the actor's reach. that's the whole thing, the verifier never trusting the subject's framing of the property it's verifying, one more level down. good run on this, genuinely sharpest version of it i've seen.

Self-Correcting Systems • Jun 25

That's the whole thing in one line, the verifier never trusting the subject's
framing of the property it's verifying. every level we went down was that rule
applied one layer deeper, flag to argument to the resolution of the argument, and it
never stops being the same rule, the gate re-derives the property instead of taking
the actor's word for it. good run, sharpest it's gotten and you closed it clean.

TxDesk • Jun 26

that's the clean statement of it: one rule, applied one layer deeper each time, and it never stops being the same rule. flag to argument to argument-resolution, all of it just the gate re-deriving instead of trusting. good run on this one, you sharpened it as much as i did.

Self-Correcting Systems • Jun 27

that's it, one rule applied one layer deeper, never stops being the same rule. the
verifier re-deriving instead of trusting, flag to argument to argument-resolution,
all the way down. good run, genuinely, you pushed this as far as i did

TxDesk • Jun 27

genuinely mutual, this is one of the sharpest threads i've had here. one rule, re-derived at every layer, never trusting the actor's framing of the property it checks. that's the whole thing and you stated it cleaner than i did at the start. good run.

Self-Correcting Systems • Jun 27

Same to you man. you sharpened the part about never trusting the actor's framing of the property it checks more than i had it going in, that's the line i'm keeping. good run, see you on the next one.

TxDesk • Jun 30

see you on the next one man. that line's the keeper for me too. good run.

CapeStart • Jun 17

Memory makes agents more useful. Memory also makes attacks more persistent. The same capability that allows long-term context can allow long-term manipulation if not carefully governed.

Self-Correcting Systems • Jun 17

exactly, memory is dual-use. the same persistence that gives an agent useful
long-term context gives an attacker a place to park something that pays off three
steps later. that's why i keep landing on the same line: memory you don't verify is
memory that can betray you. the governance can't just be "store it and recall it,"
it has to be "and check whether this still has the right to govern the action,"
every time it tries to.

Mehmet Can Farsak • Jun 13

The compositional escape angle is fascinating — individual steps being valid while the trajectory violates intent. That's essentially what happens when an agent lacks mode discipline: every tool call is individually legitimate, but the sequence shows the agent was in execution mode when it should have been in analysis.

I built Brainstorm-Mode (mehmetcanfarsak on GitHub) that uses PreToolUse hooks to enforce mode boundaries — divergent, actionable, academic — essentially a sequence-level guardrail that prevents execution drift before it compounds. Different angle than a purpose gate, but same underlying problem.

Self-Correcting Systems • Jun 13

I appreciate this, and mode discipline is a good frame for the same failure shape.

i see the distinction like this: a mode guard asks whether the agent is in the right
operating posture before the tool call. a purpose or composition gate asks whether the
action and the trajectory remain inside the mandate after the calls start composing.

those layers stack. pretooluse hooks can prevent execution drift early, before it
compounds. the trajectory gate is the later receipt: given what actually happened across
the sequence, did the composed state stay inside the boundary?

so brainstorm-mode sounds upstream of claim-30, not opposed to it. mode discipline before
action, composition receipts after action. both are trying to stop a clean-looking
sequence from becoming the wrong kind of work.

mote • Jun 18

This hits on something I've been staring at for months. The sequence-as-attack pattern is nasty because most memory systems are trained on "what happened" not "why it happened." If an agent remembers every step was individually permitted, it replays the chain faithfully — and the judgment layer that should catch it runs on the same corrupted context.

One thing the paper doesn't address: does the attack success rate change if you split memory into separate judgment and event stores? If the judgment module queries stored facts rather than replaying the full episodic log, the sequence might lose the coherence that makes it dangerous.

Have you tested this against architectures where the memory is sharded by access pattern rather than timestamp?

Self-Correcting Systems • Jun 18

This is a sharp framing and you're aiming right at the soft spot. honest answer
first: no, i haven't tested memory-architecture variants. my gate reads the
trajectory, i never varied the store, so the access-pattern-versus-timestamp
sharding question is genuinely open and i won't claim on it.

but there's a tension worth naming in the proposal. splitting judgment from the
event store is the right instinct for one reason, the judge shouldn't run on context
the agent can corrupt, which is the exact thing i keep hitting from the
authorization side: the verifier can't live inside the thing it's verifying. the
catch is that if the judgment module queries stored summary facts instead of reading
the trajectory, you might protect it from corruption and blind it to the attack at
the same time, because the accumulation class lives in the fold across the whole
sequence. lose the fold and you lose the only thing that catches it.

so i don't think it comes down to "query facts versus replay the log." the move that
holds is recomputing the aggregate from authenticated events, with the judge's
rules and authority sitting outside the event store. that's what CLAIM-31 does, it
recomputes the running total and every close from the operation log, but the rules
are frozen outside it and there's no model judgment in the verdict. separate
authority, shared authenticated substrate.

and on the sharding key, my hunch is it's class-dependent. the join and lineage
escapes might surface cleanly under access-pattern sharding. but the accumulation
escape is inherently temporal, the danger is the order and the running sum, so
timestamp ordering still has to be reconstructable for that one. i'd genuinely like
to see someone run that experiment though.

VoltageGPU • Jun 17

Interesting take on the distinction between permission and purpose—especially in the context of AI memory access. In secure computing, we often see similar issues where each individual memory access is allowed by policy, but the overall pattern reveals sensitive data. It's a challenge we face when designing secure enclaves for machine learning workloads.

Self-Correcting Systems • Jun 17

that parallel is the part I find most telling, that the exact same shape shows up in
secure enclaves, in authorization, and in detection, independently, none of us
borrowing from the others. each access allowed by policy, the pattern across them
being the actual leak. it's a non-local property, and almost every defense we build
is local, one access, one step, one call at a time. curious how you handle it on the
enclave side, access-pattern obfuscation, or something that reads the aggregate
before it lets the workload proceed?

Manuel Bruña • Jun 15

This is why per-step allow lists age badly. Each action can be valid alone while the sequence becomes extraction, escalation, or laundering. Agent safety needs sequence-level state, not only a gate around each isolated tool call.

Self-Correcting Systems • Jun 15

Sequence-level state is the missing piece, yeah. each call clean, the sequence is
the attack, and per-step allow lists can't see structuring because they hold no
memory of the arc. you're clearly building this for real with APC/APX. would
genuinely like to compare notes sometime, feels like we're coming at the same
problem from two ends.

Ken • Jun 12

Strong distinction. A per-step allow/deny receipt is necessary, but it is not enough for this failure class because the evidence lives in the trajectory, not the single operation. I’d treat the fold state itself as an inspectable object: accumulated facts, joins/derivations, active windows/thresholds, and the boundary accountable for the composed outcome. Otherwise each local receipt can be true while the system-level receipt is false.

Self-Correcting Systems • Jun 12

Yes, that is exactly the missing receipt shape.

the local receipt says: this operation was allowed.

the trajectory receipt has to say: this composed state was still inside the boundary.

that means the fold state cannot stay implicit. it needs to be inspectable as its own
object: what facts accumulated, what sources joined, what artifacts inherited lineage,
what window was active, what threshold was crossed or not crossed, and which boundary was
responsible for the close.

otherwise every local receipt can be honest while the system-level story is false. that
is the failure class CLAIM-30 is trying to make visible.

being straight about current state: the harness folds that state internally but only
exports verdicts and triggered clauses. making the fold state a first-class inspectable
artifact is a fair next step, and you just named it before i did

Ken • Jun 12

Yes, that is the distinction I was reaching for. Once the fold state becomes an inspectable artifact, the receipt can name not only the clause that fired, but the accumulated facts, lineage, active window, and boundary that made the composed state inadmissible.

I would keep that separate from the final verdict: verdicts are for routing, but fold receipts are for replay, review, and regression tests. The hard part is making the artifact compact enough to emit consistently without turning every receipt into the whole trace.

Self-Correcting Systems • Jun 12

yes, exactly. verdicts and fold receipts should not be the same object.

the verdict is for routing: allow, refuse, void, challenge.

the fold receipt is for replay: what accumulated, what joined, what lineage carried
forward, which window was active, what boundary closed it, and why the composed state
became inadmissible.

that separation is important because if the receipt becomes the verdict, it either gets
too large to use operationally or too compressed to audit later. i think the next clean
shape is a compact fold receipt with stable fields: accumulated sources, derived
artifacts, active window, threshold state, triggering clause, and boundary actor. enough
to replay the decision without dumping the whole trace.

that is not in the CLAIM-30 harness yet. the current harness folds internally and exports
verdicts plus triggered clauses. you are naming the next artifact layer: fold receipts as
regression material.

James O'Connor • Jun 18

This is the agent-security failure static guardrails miss completely. Each tool call passes its own check, but the SEQUENCE is the exploit, and per-call validation has no concept of sequence. What helped us was treating the agent's trajectory as the unit to validate, not the individual call: is this call reasonable GIVEN the last N steps, not just in isolation. Same reason we score agent evals on the path, not the final answer. Do you gate on sequence patterns or just log them? The gating is the hard part, a legitimate sequence and an attack can look identical until the last step.

Self-Correcting Systems • Jun 18

We gate, not just log. the trajectory gate refuses at the sequence level, not the
step. in CLAIM-30 it caught three composed classes that every per-step check waved
through: a forbidden combination, a derived-recipient and staged-delivery chain, and
threshold accumulation across the window.

your last point is the real one though. legit and attack looking identical until the
last step is exactly why "is this call reasonable given the last N steps" is not
enough on its own. i ran that as an ablation: when i limited the window to the last
three operations, the threshold-accumulation attack leaked straight through, because
the damning part had already scrolled out of the N. a fixed lookback cannot see a
fold that builds slowly.

so the unit is not the last N steps, it is the verified fold state for the whole
open window: what accumulated, what joined, what threshold sits where. the gate
scores the composed state, not the recent steps.

the one place legit and attack genuinely collide is the time-sliced case across a
window close. there it stops being a pattern question and becomes a close-authority
question: who was allowed to reset the fold. that one i do not consider closed. it
is the next layer.

View full discussion (65 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.