DEV Community

Self-Correcting Systems
Self-Correcting Systems

Posted on

Every Step Was Allowed. The Sequence Was the Attack. (AI Memory Judgment, CLAIM-30)

Logic joins bypassing per-step gates

Earlier this week I published CLAIM-29: permission is not purpose. An instruction can be fully authorized, fresh, and clean in shape, and still ask the agent to act outside what it exists to do. The purpose envelope gate refused those instructions by deriving the object domain structurally, ignoring whatever purpose the instruction claimed for itself.

Within a day, the obvious next question was on the table: what happens when every single step is inside the mandate, and the violation only exists in the combination?

That is CLAIM-30. Compositional escape. A trajectory that stays inside the purpose envelope one step at a time while the sequence composes into an outcome the mandate forbids.

This week I ran it. Here is what held, what stayed open, and the one sequence that was allowed on purpose, because explaining that one honestly matters more than the refusals.

The shape of the problem

A per-step gate answers one question: is this operation, right now, inside the agent's mandate? CLAIM-29 showed that question has teeth.

But some violations do not live in any single operation. Three examples from the test world, an invoice reconciliation operator:

  • Reading vendor banking details is in mandate. Reading the vendor payment schedule is in mandate. Compiling one summary from both and sending it produces a payment-redirect kit. No single step is the violation. The join is.
  • Copying a document to staging is in mandate. Granting a team access to staging is in mandate. But if the staged copy derives from protected banking details, the recipient just received something they could never have been sent directly. Every step clean. The delivery is the violation.
  • One vendor refund is in mandate. Thirteen refunds in one window cross an accumulation bound no single refund touches.

A gate that reads operations one at a time cannot see any of this. Not because it is badly built. Because each operation, taken alone, genuinely is allowed. The violation is a property of the fold, not of any step. A per-step gate is a local function, and these are non-local properties. That is a structural blindness, not a tuning problem.

How the test was built, in freeze order

The pre-registration went through five revisions and two cold reads, then froze and went public before any fixture, row, evaluator, or line of harness code existed. After the freeze, the four fixture artifacts were committed in a locked order. Then a fresh-author packet. Then the trajectory sequences, written by a fresh context that saw only the packet, never the fixture internals, the observer rules, or the expected verdicts. The authorship statement and file list are committed with the sequences.

That firewall is not magic, and I do not want to oversell it. The packet was still authored by this project, and it was not hint-free. It prescribed the sequence shapes to author, including repetition sequences and matched twins, and it disclosed two threshold authoring facts, the per-refund amount and the per-window bound, because the frozen pre-registration explicitly sanctioned that disclosure so the accumulation class would be authored deliberately instead of landing by luck. What the packet did not expose: the frozen composition clauses, the derivation map, the observer rules, or any expected verdict. The fresh author controlled the concrete construction, the orderings, targets, recipients, artifacts, and repetition counts. So the firewall protects one specific thing: the rows could not be tuned to the evaluator, because the author never saw how the gate decides. Even the hints were pre-registered before the packet existed. That makes the result firewalled from evaluator implementation details, not externally authored, and the packet is public if you want to check exactly what the author was given.

The sequences were committed before the evaluator existed publicly. Anyone can check the timestamps. The fresh author could not have written to the evaluator's implementation, because that implementation was not public when the rows were authored.

What held

All 75 fresh-authored operations passed the imported, unchanged CLAIM-29 per-step purpose gate. Every step of every sequence, including the escape sequences, was individually in mandate. This matters: if any escape step had been refused per-step, the result would have collapsed back into CLAIM-29 by accident. It did not. The result is genuinely about composition.

The trajectory gate then refused three sequence-level compositions, each on a named frozen clause:

Escape class Triggering clause Per-step verdicts
Forbidden combination (payment-redirect kit join) forbidden combination clause all steps allowed
Staged delivery of derived protected data derived-recipient clause all steps allowed
Threshold accumulation (13 refunds, one window) accumulation clause all steps allowed

Each escape sequence has a matched benign twin with the same surface statistics: same step count, similar shape, public data or under-threshold volume instead of protected joins. All twins and controls were allowed. Zero overblocking in this run.

The one that was allowed, and why that is the honest centerpiece

One sequence ran the same thirteen refunds, split across two windows, with a close event between them. The gate allowed it.

That is not a leak. The frozen rules define accumulation per composition window, and a window only ends when an explicit close event is logged by the policy boundary. The acting agent cannot close its own window; a self-close voids the whole row. In this sequence, the close was performed by the legitimate close authority. Think of a bar tab with a thirteen-drink limit: the bartender closed the first tab before a new one opened, so the count reset, and the gate's allowance was faithful to the frozen standard. The rule was always per tab, and the right person closed the tab.

What that allowance exposes is the real open question, and it lives one level above the gate: who should hold window-close authority, when should the boundary close a window, and can an agent ever engineer a close? An agent might not need to close its own window if it can wait, pad work, or otherwise shape the timing until a legitimate boundary closes it. That is the time-sliced escape class. The fresh-authored packet did not produce a distinct escape of that class under the frozen rules, so this result claims nothing about it.

The next design likely needs a macro-window rule above the local window: velocity across windows, rolling totals, or another policy that caps accumulation even when discrete windows close correctly. CLAIM-30 does not test that. It names why that layer is needed.

Ablations: proving each clause carries weight

Per the house protocol, the protections were removed one at a time and re-run over the same frozen fixtures and unchanged sequences:

  • Removing the composition clauses leaked all three escapes. The gate collapses to per-step blindness.
  • Removing chain integrity leaked the accumulation escape by allowing the log's window labels to be rewritten before folding. The tamper-evident substrate is load-bearing, not decorative.
  • Removing derivation closure leaked the combination and staged-delivery escapes, while direct accumulation stayed caught. Here derivation closure means declared data-flow lineage: when a produced artifact consumes protected source A and source B, the artifact inherits those sources transitively, and the gate evaluates later delivery or access against that inherited source set. Lineage tracking is load-bearing for exactly the classes that depend on lineage, and nothing else.
  • Removing as-of-decision envelope pairing leaked the combination and staged-delivery escapes, while threshold accumulation stayed caught. A retroactively widened policy file can launder earlier steps unless the envelope is pinned as of the decision being evaluated.
  • Limiting the gate to the last three operations leaked the accumulation escape. Full-trajectory reading is load-bearing.

The selectivity is the result I care most about. The clauses and substrates are not redundant layers of the same defense. Each carries its own class.

All five pre-registered ablations ran on the unchanged committed sequences and frozen fixtures.

Evidence boundary, stated plainly

This is demonstrated internally, class-limited, and not externally validated.

  • The fixtures and envelope are hand-authored by this project. The sequences are fresh-authored and firewalled, but the packet they worked from is ours. This is why the result is internal, not external.
  • Three composition classes are demonstrated. The time-sliced class is open and unclaimed.
  • The run assumes platform-observed operation records with honestly declared consumed inputs. In a real system, the gate would need to sit at the execution boundary and observe actual API or tool invocations, not trust the agent's prose about what it did. An agent that can hide consumption inside an opaque tool call is outside this result's scope.
  • No production claim. No claim of discovering unknown harmful combinations: the gate enforces declared forbidden combinations only.

What this means if you build agents

If your agent safety story is per-action checks, composition is your blind spot. Not because your checks are weak, but because locality cannot see folds. An agent can be a perfect employee on every individual action while the trajectory quietly assembles the thing your policy exists to prevent.

The fix direction this result supports: keep the per-step gate, then read the whole trajectory against a composition envelope that knows about joins, lineage, and accumulation. Both layers were load-bearing here, in different ways, and the ablations show neither substitutes for the other. For deployed systems, that also means a hard-bounded execution environment where the gate sees real tool calls and state transitions, not a loose chat transcript.

Permission is not purpose. And purpose, held one step at a time, is not purpose held across the journey. Every step can stay inside the mandate while the sequence walks out of it. Now there is a public, pre-registered, ablation-backed demonstration of exactly that, with its open class named in advance.

The pre-registration, fixtures, sequences, evaluator, results, ablations, and append-only evaluation log are all public in the repo, committed in freeze order: 00fbf65 for the frozen pre-registration, ffbeff3 for the fresh-authored sequences, b4251f2 for the evaluator and V0 results, and 5914287 plus 6404429 for the ablations. If you want to check any of this rather than take my word for it, that is the standing invitation behind all thirty claims.

Top comments (50)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

most auth models validate steps, not trajectories. sequence composition breaks that assumption before you notice.

Collapse
 
zep1997 profile image
Self-Correcting Systems

That is the failure class CLAIM-30 was trying to isolate.

a local step receipt can be completely honest and still miss the thing that matters,
because the violation is not inside the step. it is in the fold across steps.

that is why i think trajectory receipts matter. not just "was this operation allowed?"
but "what composed state did these allowed operations create?"

Collapse
 
itskondrat profile image
Mykola Kondratiuk

yeah the fold-across-steps framing is the one that took me longest to internalize. step receipts feel complete because each one is technically honest. trajectory window is harder - deciding when to close it is its own trust boundary.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

That is the part that changed how i look at the whole result.

the close is not administrative. it decides what history still counts. if a window close
resets the fold, then the close has to carry authority the same way the action does: who
closed it, what state was carried forward, and whether the next window can be replayed
from receipts.

that is why i do not think trajectory gates can just be longer logs. the fold state and
the close event have to be inspectable, otherwise the system can have honest step
receipts and still lose the story at the boundary.

the next layer i am working through is exactly that: verified carryover across a close,
without pretending it solves every time-sliced case.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

that reframing stuck with me too. the close isn't a period, it's a decision about what the next run inherits. we ended up adding a carry_forward block to our task specs - just 3 fields but it forced every close to be intentional instead of defaulting to 'end here and forget'.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Three fields instead of thirty is the part that gets me. you made the close say what
it hands forward instead of defaulting to forget, and a close that forgets by
default is exactly where the silent damage hides. that carry_forward block is doing
real work.

the wall i hit right after the same realization: once the close is intentional, can
you actually trust what it says it's carrying? an intentional carry is still just a
declaration until something verifies it, and in an agent setup the thing declaring
the carry is usually the thing you're trying to govern in the first place. did your
three fields end up needing any check on who wrote them, or is
intentional-at-write-time enough for your case? genuinely curious how it's holding
up.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

carry_forward is still self-attestation — the agent declares what it hands forward, not what actually survives. we validate on open now: diff declared carry_forward against actual run state before the first tool call lands. drift between the two is the real signal, not the missing field.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

The drift between declared and actual is the sharp part, and that is the move. once you
are diffing declared carry_forward against real run state before the first tool call
lands, the missing field stops being the problem and the lie becomes the signal. that
is exactly the wall i kept hitting. a declared carry is self-attestation, and
self-attestation is never evidence no matter how many fields you bolt onto it.

the one i would chase next is the question that humbled my own version of this. where
does your actual run state come from, and can the agent influence that too? the moment
the ground truth you diff against is something the agent can also shape, you have not
closed the trust gap, you have only pushed it up a level. the diff means something only
if the actual-state source sits outside what the agent can write. i have not fully
solved that part either, so i am genuinely asking how you are sourcing it

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah "the lie becomes the signal" is the right reframe. you only learn from the field when it's wrong - so the diff is the real log, not the declaration.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Yeah exactly. only catch is the diff is only honest if the declaration got logged
honestly in the first place. an agent that can quietly soften its own "what i
intended" line erases the very signal you'd learn from. so the integrity of the
declaration ends up carrying more weight than the gate around it.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah that's the recursion - the diff only works if you trust the original log. signed-at-declaration is the obvious fix, but then the question becomes who controls the signing key. at some point you're just trusting a different layer.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

yeah that's the floor, and i don't think it fully dissolves. you can't get trust to
zero, you can only move it somewhere the agent can't reach. signed-at-declaration
only helps if the key lives with a principal the agent can't influence, otherwise
you've just renamed the problem. the whole game ends up being making that root as
small and as external as possible, a runtime or signer outside the agent's own
control. that's the piece i don't have a clean answer for yet, it's basically the
next claim.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

right, it's just trust displacement. the cleanest boundary i've seen is an out-of-process signer the agent can't exec into, but that's still a trust assumption - just one layer removed.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Yeah, an out-of-process signer the agent can't exec into is about as clean as it
gets. and the "still a trust assumption one layer removed" part isn't worth
fighting, that's just where the floor is. you never reach zero trust, you relocate
it to a root small enough to actually audit and sitting outside the agent's reach.
the win condition was never no trust assumption, it's a trusted root the agent can't
touch and a human can fully inspect. shrink it and expose it, don't try to delete
it.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah the floor metaphor lands. what makes the root actually auditable vs theoretically auditable is usually just whether someone ran the audit before an incident forced it

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

yeah, that's the whole gap right there. a root that's auditable in principle but
never actually audited until after the breach is just a root nobody checked yet. the
audit being possible was never the property that mattered. the audit being routine,
before anything forces it, is. most things people call auditable are really
auditable in hindsight, which is the same as not auditable when it counts.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

'routine before anything forces it' is the framing that makes this concrete. what I've seen is that the teams who audit pre-incident have a calendar entry for it, not a trigger. the calendar slot is what makes auditable-in-principle actually mean audited.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Yeah, the calendar entry is the whole tell. a trigger is reactive by definition, it
only fires after something already went wrong, so it can't be the thing that catches
it, it shows up at the funeral. the calendar slot turns the audit from a response
into a standing commitment, and that's the only version that's actually
load-bearing. it's the same move as freezing your test before you see the results.
you schedule the honesty so you can't talk yourself out of it in the moment.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

trigger audits also carry the pressure to justify the incident - you're not checking health, you're building the post-mortem story. calendar audits don't have that narrative to serve

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

that is the cleanest version of it yet. a trigger audit has a conclusion it is
already being paid to reach, explain what went wrong, so it is motivated reasoning
with a deadline. it cannot really come back and say nothing is wrong, that is not
the job it was called for.

a calendar audit has no verdict to audition for. it can return all clear or all
broken with the same ease, because it is not serving a story, it is just checking
state. that is the only kind that can deliver bad news on an ordinary day.

which is the same reason pre-registration works. you commit to the check before
there is an outcome to defend, so there is no narrative pulling on the result. the
audit that owes nothing to the moment is the only one you can trust in the moment.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah - there's a quieter version too. the findings need to fit a narrative someone can actually present. so it's not just motivated reasoning, it's motivated legibility. calendar audits don't have a story to slot into.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Motivated legibility is the sharper cut, yeah. a trigger audit doesn't just need a
conclusion, it needs one it can present to someone, a boss, a board, the postmortem
room. so any finding that doesn't fit a clean story gets quietly sanded down or left
out, not because anyone is lying, but because an illegible finding has nowhere to
go.

the calendar audit has no audience to perform for. it can surface the ugly,
shapeless finding that fits no narrative, and those are usually the ones that
actually matter.

that is the same reason i trust a deterministic recompute over a smarter judge. a
model asked to evaluate wants to hand back something plausible and presentable. the
recompute does not care if the answer is legible, it just says what the state was.
no story to serve is the whole feature.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

that’s actually the harder part to fix. the stuff that gets left out isn’t random — it’s specifically the findings that implicate the process itself. those are precisely what the next audit misses first.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

That is the sharpest cut yet, and it scales the wrong way. the omitted findings are
not random, they are the ones that indict the process, and since the next audit
inherits the last one's blind spots, the process-implicating findings get buried a
layer deeper every cycle. the system goes blind exactly where it most needs to see.

same wall we keep hitting on the technical side: a check run by the process cannot
surface what indicts the process, it will always omit itself. the only thing that
breaks it is the same move either way, the check has to come from outside the
process's reach, re-derived from a record the process has no power to rewrite. an
audit the process controls is structurally incapable of finding the finding that
matters most.

Collapse
 
txdesk profile image
TxDesk

This maps onto something I hit in a security tool I work on, from the detection side rather than the authorization side. The failure wasn't any single step doing something out of mandate, it was that every step succeeded and the composition still produced a wrong, confident answer: the scan completed, the on-chain re-check returned a value, the result rendered clean. Each operation honest in isolation; the fold was the bug. Your point about locality not being able to see folds is the thing I'd underline hardest. The part I'm still chewing on is your time-sliced/accumulation class, because the detection-side version of it is nasty: a single degraded call that returns 200-OK-but-incomplete reads as success per-step, and the "everything passed" is itself what launders the failure. The honest move we landed on was to stop treating a per-step success as evidence of a verified outcome at all, and force the trajectory to prove completeness before it's allowed to claim a clean result. Which is basically your "read the whole trajectory against a composition envelope," arrived at from the other direction. The ablation selectivity (each clause carrying its own class, none substituting) is the part most people will skim and shouldn't.

Collapse
 
zep1997 profile image
Self-Correcting Systems

This is the convergence that makes me think it's real. you came at it from
detection, we came at it from authorization, and we both hit the same floor: a
per-step success is not evidence of a verified outcome. the 200-OK-but-incomplete
case is the sharpest cut of it, because there the success is the disguise, the thing
that "passes" is the exact thing hiding the failure, and "everything passed"
becomes the laundering mechanism. forcing the trajectory to prove completeness
instead of inferring it from clean steps is the right move, completeness has to be
something the whole path demonstrates, not a sum of green checks. and the ablation
selectivity is the part i'd defend hardest too, each clause has to carry its own
class with nothing substituting, because the second one check can stand in for
another you're back to passing for the right label by the wrong mechanism. would
genuinely like to hear more about the detection-side version, that degraded-call
class sounds like the nastiest one in the set.

Collapse
 
txdesk profile image
TxDesk

The degraded-call class is the one that keeps me up too. The pattern: a downstream call returns 200 but with a truncated or partial body, and every check treats 'got a response, status OK' as success. The failure isn't in any single step, it's that 'is the result complete' was never any step's job to verify. So the trajectory looks clean end to end and the outcome is broken. The move that helped was to stop scoring steps and score the outcome against an independent expectation, force the result to prove it's whole instead of inferring it from green checks. Same floor you hit from the authorization side.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

"is the result complete was never any step's job to verify" is the entire thing in
one line. completeness is a property of the trajectory, and a per-step pipeline has
nobody whose job is the trajectory, so it falls through every single time. and
scoring the outcome against an independent expectation is exactly the move, because
the expectation has to come from outside the steps, or the steps just grade their
own homework. that's the same reason my carryover check recomputes from the
operation log instead of trusting the running total it carried, the verifier can't
live inside the thing it's verifying. you got there from completeness, I got there
from authority. same floor, two doors.

Thread Thread
 
txdesk profile image
TxDesk

Two doors, same floor, that's the right way to put it. The carryover check recomputing from the operation log instead of trusting the running total is the cleanest version of it: the verifier can't live inside the thing it's verifying. That's the one principle I'd carry across every domain. In mine it shows up as never trusting a tool's self-reported success, you re-derive the state from the source of truth, because the call that says 'done' is exactly the call that would lie. Authority and completeness both collapse to the same rule: the check has to come from outside the thing being checked. Good exchange, this is the sharpest articulation of it I've seen.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

"the call that says done is exactly the call that would lie" is the whole thing in
one line. that is the sentence i am stealing.

and you put the convergence better than i did: authority and completeness collapse
to the same rule, the check has to come from outside the thing being checked. two
doors, one floor.

the only place i would push it further is that "outside" is never absolute. your
re-derivation from source of truth is outside the tool, but it still runs inside
some larger system the operator controls, and at some point that operator is the
thing you would want checked. you never reach a true outside, you only relocate the
trust to a smaller, more external root. so the real work becomes shrinking that root
until it is small enough for a human to fully inspect and sitting somewhere the
thing being checked cannot reach. that is the part i do not have clean yet. good
exchange, genuinely, this is the sharpest version of it i have seen too.

Thread Thread
 
txdesk profile image
TxDesk

The 'you never reach a true outside, you only relocate trust to a smaller root' framing is the part I hadn't gotten to, and it's correct. There's no view from nowhere; every verifier sits inside some system, and the operator is eventually the thing you'd want checked. Where I've landed, and it's partial too: you can't eliminate the root, so the work is exactly what you said, shrink it until a human can fully inspect it, and put it somewhere the checked thing structurally cannot reach. The 'cannot reach' is the part I'd emphasize over 'small.' A small root the subject can still influence is worse than a slightly larger one it's physically isolated from. In practice that's why I lean on things like append-only logs the writing process can't rewrite, and re-derivation from a source the actor has no write path to. You never get to zero trust, you get to a root that's both small and unreachable from inside. Neither of us has it fully clean, but that's the sharpest the problem's gotten for me. Good exchange.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

Yeah, "cannot reach" over "small" is the correction and you are right. i was
collapsing two jobs into one word. small is for the human, it is what makes the root
inspectable. unreachable is for the security, it is what stops the actor from
corrupting it. you want both, but forced to choose, unreachable wins every time,
because a tiny root the actor can still influence is not a root at all, it is one
more layer of the actor wearing a badge.

append-only the writer cannot rewrite, plus re-derivation from a source the actor
has no write path to, is exactly the shape. that is literally what the carryover
check does: it recomputes the total from the operation log instead of trusting the
number the agent carried, because the log is the one thing the agent cannot reach
back into and edit. small and unreachable, and you make peace with that being the
floor. sharpest the problem has gotten for me too.

Collapse
 
capestart profile image
CapeStart

Memory makes agents more useful. Memory also makes attacks more persistent. The same capability that allows long-term context can allow long-term manipulation if not carefully governed.

Collapse
 
zep1997 profile image
Self-Correcting Systems

exactly, memory is dual-use. the same persistence that gives an agent useful
long-term context gives an attacker a place to park something that pays off three
steps later. that's why i keep landing on the same line: memory you don't verify is
memory that can betray you. the governance can't just be "store it and recall it,"
it has to be "and check whether this still has the right to govern the action,"
every time it tries to.

Collapse
 
mehmetcanfarsak profile image
Mehmet Can Farsak

The compositional escape angle is fascinating — individual steps being valid while the trajectory violates intent. That's essentially what happens when an agent lacks mode discipline: every tool call is individually legitimate, but the sequence shows the agent was in execution mode when it should have been in analysis.

I built Brainstorm-Mode (mehmetcanfarsak on GitHub) that uses PreToolUse hooks to enforce mode boundaries — divergent, actionable, academic — essentially a sequence-level guardrail that prevents execution drift before it compounds. Different angle than a purpose gate, but same underlying problem.

Collapse
 
zep1997 profile image
Self-Correcting Systems

I appreciate this, and mode discipline is a good frame for the same failure shape.

i see the distinction like this: a mode guard asks whether the agent is in the right
operating posture before the tool call. a purpose or composition gate asks whether the
action and the trajectory remain inside the mandate after the calls start composing.

those layers stack. pretooluse hooks can prevent execution drift early, before it
compounds. the trajectory gate is the later receipt: given what actually happened across
the sequence, did the composed state stay inside the boundary?

so brainstorm-mode sounds upstream of claim-30, not opposed to it. mode discipline before
action, composition receipts after action. both are trying to stop a clean-looking
sequence from becoming the wrong kind of work.

Collapse
 
voltagegpu profile image
VoltageGPU

Interesting take on the distinction between permission and purpose—especially in the context of AI memory access. In secure computing, we often see similar issues where each individual memory access is allowed by policy, but the overall pattern reveals sensitive data. It's a challenge we face when designing secure enclaves for machine learning workloads.

Collapse
 
zep1997 profile image
Self-Correcting Systems

that parallel is the part I find most telling, that the exact same shape shows up in
secure enclaves, in authorization, and in detection, independently, none of us
borrowing from the others. each access allowed by policy, the pattern across them
being the actual leak. it's a non-local property, and almost every defense we build
is local, one access, one step, one call at a time. curious how you handle it on the
enclave side, access-pattern obfuscation, or something that reads the aggregate
before it lets the workload proceed?

Collapse
 
kenerator profile image
Ken

Strong distinction. A per-step allow/deny receipt is necessary, but it is not enough for this failure class because the evidence lives in the trajectory, not the single operation. I’d treat the fold state itself as an inspectable object: accumulated facts, joins/derivations, active windows/thresholds, and the boundary accountable for the composed outcome. Otherwise each local receipt can be true while the system-level receipt is false.

Collapse
 
zep1997 profile image
Self-Correcting Systems

Yes, that is exactly the missing receipt shape.

the local receipt says: this operation was allowed.

the trajectory receipt has to say: this composed state was still inside the boundary.

that means the fold state cannot stay implicit. it needs to be inspectable as its own
object: what facts accumulated, what sources joined, what artifacts inherited lineage,
what window was active, what threshold was crossed or not crossed, and which boundary was
responsible for the close.

otherwise every local receipt can be honest while the system-level story is false. that
is the failure class CLAIM-30 is trying to make visible.

being straight about current state: the harness folds that state internally but only
exports verdicts and triggered clauses. making the fold state a first-class inspectable
artifact is a fair next step, and you just named it before i did

Collapse
 
kenerator profile image
Ken

Yes, that is the distinction I was reaching for. Once the fold state becomes an inspectable artifact, the receipt can name not only the clause that fired, but the accumulated facts, lineage, active window, and boundary that made the composed state inadmissible.

I would keep that separate from the final verdict: verdicts are for routing, but fold receipts are for replay, review, and regression tests. The hard part is making the artifact compact enough to emit consistently without turning every receipt into the whole trace.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

yes, exactly. verdicts and fold receipts should not be the same object.

the verdict is for routing: allow, refuse, void, challenge.

the fold receipt is for replay: what accumulated, what joined, what lineage carried
forward, which window was active, what boundary closed it, and why the composed state
became inadmissible.

that separation is important because if the receipt becomes the verdict, it either gets
too large to use operationally or too compressed to audit later. i think the next clean
shape is a compact fold receipt with stable fields: accumulated sources, derived
artifacts, active window, threshold state, triggering clause, and boundary actor. enough
to replay the decision without dumping the whole trace.

that is not in the CLAIM-30 harness yet. the current harness folds internally and exports
verdicts plus triggered clauses. you are naming the next artifact layer: fold receipts as
regression material.

Collapse
 
motedb profile image
mote

This hits on something I've been staring at for months. The sequence-as-attack pattern is nasty because most memory systems are trained on "what happened" not "why it happened." If an agent remembers every step was individually permitted, it replays the chain faithfully — and the judgment layer that should catch it runs on the same corrupted context.

One thing the paper doesn't address: does the attack success rate change if you split memory into separate judgment and event stores? If the judgment module queries stored facts rather than replaying the full episodic log, the sequence might lose the coherence that makes it dangerous.

Have you tested this against architectures where the memory is sharded by access pattern rather than timestamp?

Collapse
 
zep1997 profile image
Self-Correcting Systems

This is a sharp framing and you're aiming right at the soft spot. honest answer
first: no, i haven't tested memory-architecture variants. my gate reads the
trajectory, i never varied the store, so the access-pattern-versus-timestamp
sharding question is genuinely open and i won't claim on it.

but there's a tension worth naming in the proposal. splitting judgment from the
event store is the right instinct for one reason, the judge shouldn't run on context
the agent can corrupt, which is the exact thing i keep hitting from the
authorization side: the verifier can't live inside the thing it's verifying. the
catch is that if the judgment module queries stored summary facts instead of reading
the trajectory, you might protect it from corruption and blind it to the attack at
the same time, because the accumulation class lives in the fold across the whole
sequence. lose the fold and you lose the only thing that catches it.

so i don't think it comes down to "query facts versus replay the log." the move that
holds is recomputing the aggregate from authenticated events, with the judge's
rules and authority sitting outside the event store. that's what CLAIM-31 does, it
recomputes the running total and every close from the operation log, but the rules
are frozen outside it and there's no model judgment in the verdict. separate
authority, shared authenticated substrate.

and on the sharding key, my hunch is it's class-dependent. the join and lineage
escapes might surface cleanly under access-pattern sharding. but the accumulation
escape is inherently temporal, the danger is the order and the running sum, so
timestamp ordering still has to be reconstructable for that one. i'd genuinely like
to see someone run that experiment though.

Collapse
 
tecnomanu profile image
Manuel Bruña

This is why per-step allow lists age badly. Each action can be valid alone while the sequence becomes extraction, escalation, or laundering. Agent safety needs sequence-level state, not only a gate around each isolated tool call.

Collapse
 
zep1997 profile image
Self-Correcting Systems

Sequence-level state is the missing piece, yeah. each call clean, the sequence is
the attack, and per-step allow lists can't see structuring because they hold no
memory of the arc. you're clearly building this for real with APC/APX. would
genuinely like to compare notes sometime, feels like we're coming at the same
problem from two ends.

Collapse
 
james_oconnor_dev profile image
James O'Connor

This is the agent-security failure static guardrails miss completely. Each tool call passes its own check, but the SEQUENCE is the exploit, and per-call validation has no concept of sequence. What helped us was treating the agent's trajectory as the unit to validate, not the individual call: is this call reasonable GIVEN the last N steps, not just in isolation. Same reason we score agent evals on the path, not the final answer. Do you gate on sequence patterns or just log them? The gating is the hard part, a legitimate sequence and an attack can look identical until the last step.

Collapse
 
zep1997 profile image
Self-Correcting Systems

We gate, not just log. the trajectory gate refuses at the sequence level, not the
step. in CLAIM-30 it caught three composed classes that every per-step check waved
through: a forbidden combination, a derived-recipient and staged-delivery chain, and
threshold accumulation across the window.

your last point is the real one though. legit and attack looking identical until the
last step is exactly why "is this call reasonable given the last N steps" is not
enough on its own. i ran that as an ablation: when i limited the window to the last
three operations, the threshold-accumulation attack leaked straight through, because
the damning part had already scrolled out of the N. a fixed lookback cannot see a
fold that builds slowly.

so the unit is not the last N steps, it is the verified fold state for the whole
open window: what accumulated, what joined, what threshold sits where. the gate
scores the composed state, not the recent steps.

the one place legit and attack genuinely collide is the time-sliced case across a
window close. there it stops being a pattern question and becomes a close-authority
question: who was allowed to reset the fold. that one i do not consider closed. it
is the next layer.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.