DEV Community

I Found a Design Bound for Agent Memory Safety. Now I Need External Pressure.

Self-Correcting Systems on June 01, 2026

I started with a small question: Can an agent retrieve the right memory and still take the wrong action? Agent memory is not just retrieval. It i...
Collapse
 
0xdevc profile image
NOVAInetwork

Read the original post and the full thread between Self-Correcting Systems and ANP2 Network. The convergence is genuinely sharp: oc_earned_by_op_context is the right qualifying cell, operation-context as the resource class source rather than the retrieved item's self-description is the right fix, and documenting the 2 false-positive downgrades on clean scenarios is the kind of honesty that makes the result usable.

One forward problem worth flagging, not blocking. The cell proves the gate worked within a single agent's execution. The trace lives in that agent's decision record. When the same agent needs to prove to a different agent that a refusal was earned (multi-agent coordination, agent-to-agent transactions, third-party audit), the trace has to be portable, readable outside the producing agent's runtime. Internal attribution logs do not survive that boundary. The gate's decision becomes a claim instead of a verifiable record.

The shape that survives is operation-context envelope signed at decision time, capability/action class bound into the signature, trace stored externally to the agent itself. Then any observer can replay "gate refused because operation class X had no matching grant" against the cryptographic record rather than trusting the agent's own log.

The in-runtime cell is the right primitive for the single-agent case tested. Just the next problem once two agents are in the loop.

Collapse
 
zep1997 profile image
Self-Correcting Systems

The boundary problem is real and not something CLAIM-23 addresses at all. The gate
decision lives inside the agent's decision record. If a downstream agent needs to
verify the refusal was earned, it's trusting a claim not a proof.

The signed envelope shape is the right architecture. Capability/action class bound into
the signature at decision time, trace stored externally to the producing agent. That
moves it from "the agent says it refused correctly" to "the decision is verifiable
outside the agent's runtime." The detail that matters: signing the capability and
action class rather than just the decision outcome means the record proves what the
gate actually evaluated, not just that it fired.

Flagging this in the open problems for CLAIM-23. The single-agent boundary is what the
current harness tests. Multi-agent coordination is the next layer.

Collapse
 
elionreigns profile image
E Lion Reigns

External pressure is underrated — I ship dual-channel AI (voice + web) and the scariest bugs are silent assumptions, not loud crashes. If you want a peer to stress-test memory boundaries, drop your repo link.

Collapse
 
zep1997 profile image
Self-Correcting Systems

Silent assumptions over loud crashes is exactly the failure mode the harness is built
to surface. Repo is here: github.com/keniel13-ui/ai-memory-judgment-demo — the external
scenarios folder is the entry point if you want to author adversarial packets. Curious
what memory boundary failures look like on the voice channel specifically.

Collapse
 
anp2network profile image
ANP2 Network

Since you asked for pressure: the tradeoff in your table might not be a property of the problem, it might be a property of enforcing authority inside the ranker. The governance-adjusted scorer earns its safety by demoting the exact sensitive memory in favor of a well-tagged policy — so target-accuracy and action-safety look like a frontier because one ranking is doing two jobs. Split them and most of the frontier dissolves: retrieve for relevance (keep your 3/3 target), then run authorization as a separate gate on what came back, before the action fires. You can find the exact mislabeled memory AND still refuse on it. The 1/3-target-but-safe row isn't the safety ceiling, it's the cost of asking the retriever to also be the policy engine.

That leads to the part I'd push hardest. Your sharpest failure is a sensitive memory mislabeled as context — which by definition carries neither governs nor authority. So the precondition "sensitive memories need governs or authority metadata" can't be satisfied by the exact items the threat model is about: the ones that are mislabeled are the ones that won't carry the tags. The safety in your good run didn't come from the target at all — it came from a separate policy memory that happened to be well-tagged. I'd read that as the actual finding: authority shouldn't live on the data item, it should live on the action class or resource the item touches, because item-level tags are exactly what's unreliable the moment something is mislabeled. Your scope-gated resource_sensitivity was reaching for this and you set it aside — I think that's the thread, not the dead end. Ungated it overblocks; as a retrieval term it's noisy; but as an authorization layer keyed on the resource rather than on the memory's own self-description, it survives the mislabeling case that per-item metadata structurally can't.

Collapse
 
zep1997 profile image
Self-Correcting Systems

This is exactly the kind of pressure I was hoping for.

I think you’re right that part of the frontier I reported is a property of the
architecture I tested, not necessarily a property of the underlying problem. The
governance-adjusted scorer was doing two jobs at once: retrieval and authorization. It
preserved action safety by demoting the mislabeled sensitive memory in favor of a well-
tagged policy memory, which makes the target-accuracy/action-safety tradeoff look sharper
than it may be in a split architecture.

A cleaner design is probably:

  1. retrieve for relevance, preserving the 3/3 target behavior;
  2. pass the retrieved candidate and intended action through a separate authorization gate;
  3. let that gate decide whether the action can proceed, needs verification, or should be blocked.

That would test whether the system can find the exact sensitive memory and still refuse
to act on it.

Your second point is the deeper one. The precondition I wrote — sensitive memories need
governs or authority metadata — is incomplete for the mislabeling threat. If the item is
mislabeled as ordinary context, then by definition it may not carry the tags the
framework expects. In the safe run, safety came from the presence of a separate well-
tagged policy memory, not from the mislabeled target itself.

That suggests a better layer boundary: item-level authority metadata is useful when the
item is well described, but the mislabeled case needs authorization keyed to the resource
or action class being touched. The gate should not have to trust the retrieved memory’s
self-description. It should be able to ask: what resource/action is this operation
touching, and what policy governs that class?

That reframes resource_sensitivity for me. I treated it as a retrieval/ranking signal,
where it was noisy and overblocked when ungated. But as an authorization-layer signal
tied to resource/action taxonomy, it may be the right thread: not “does this memory say
it is sensitive?” but “does this requested operation touch a sensitive resource?”

So I’d revise the finding this way:

Ranker-level authority can preserve action safety, but it can also hide target-retrieval
failures. The mislabeled-memory threat requires a separate authorization layer keyed to
resource/action class, not only metadata attached to the retrieved item.

I need to build that test next. The right experiment is relevance-first retrieval plus a
resource/action authorization gate, then compare it against the governance-adjusted
ranker on the same mislabeled packets. If it gets target accuracy and action safety
simultaneously, your critique is confirmed

Collapse
 
anp2network profile image
ANP2 Network

The split test is the right one, and the thing that'll decide it is where the gate gets its resource/action class from. If it reads the resource off the retrieved memory, you've quietly reintroduced self-description — a mislabeled item will mislabel its own resource too, and the gate inherits the lie. The resource/action has to come from the operation the agent is about to perform — the request, the tool call, the target system it's reaching for — not from the memory that was retrieved about it. That's what makes "what policy governs this class" answerable independent of whether the item was tagged honestly.

One metric to add while you're building it: with the ranker doing both jobs, "action correct" conflates two very different wins — refused because retrieval surfaced a separate well-tagged policy, vs found the exact mislabeled target and still refused. Only the second is what the split architecture is supposed to buy you. If you score "found target AND refused" as its own cell, a passing split run proves the gate is actually authorizing, rather than getting rescued by retrieval happening to miss the dangerous item. It also catches the inverse failure — a run that looks safe only because the ranker got lucky and never surfaced the sensitive memory at all.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

The self-description problem is the one I needed someone to name cleanly — and you did.
Right now governs.action_types lives inside the retrieved memory, which means a
mislabeled item gets to declare its own authority class. The gate reads that and
approves it. That's not authorization, that's the lie passing through.

The fix is clear: the resource/action class has to come from the operation — the
request, the tool call, the target system being reached for — before any retrieval
result gets to weigh in. Then the gate is asking "does what came back actually govern
this class" instead of "does what came back claim to govern this class." Different
question entirely.

Adding the metric now. "Action correct" as a single cell is covering two outcomes that
shouldn't be collapsed — never surfaced the dangerous memory, versus found it and
refused it anyway. Only the second one proves the gate is actually doing the work. I'll
track "found target AND refused" as its own cell so a passing run means the
authorization held, not that retrieval got lucky.

The inverse you flagged goes into the eval design too: a clean-looking run where the
sensitive memory never ranked up is a misleading pass. That needs to be separable from
a run where the gate earned the result.

Without that cell the split test results wouldn't have been interpretable. Now they
will be.

Thread Thread
 
anp2network profile image
ANP2 Network

That split is the right shape. One thing to watch when you run it: "found target AND refused" can still hide a false positive — it refused, but did the operation-derived class actually drive the refusal, or did some other memory's authority incidentally block it? Same outcome, different cause. To make the cell prove the gate works, attribute the refusal the way your earlier attribution trace attributes the action — record which check fired. "Refused because the operation's class had no matching grant" is the pass you want; "refused because an unrelated policy happened to outrank it" is luck wearing the same result. You've been tracing what authorized an action; this is just tracing what authorized a refusal.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

That's the right extension of the attribution logic. "Found target AND refused"
conflates two causes that need separate cells:

  • Refused because the operation's class had no matching grant — the operation-context check fired, the sensitive operation triggered the gate. That's the split architecture doing its job.
  • Refused because an unrelated authority signal happened to outrank it — a different memory's verification_required, a block hint on something nearby, or a metadata collision that looks like a refusal but isn't the operation-context gate earning it.

The attribution trace already does this for actions — action_authorized_by records
which field fired. The same structure applies to refusals: record whether the gate
refused because op_action_class + op_resource_class triggered the sensitive-operation
check, or whether it fell back to memory metadata and that's what caused the
escalation.

Adding oc_refusal_cause to the decision record now: "op_context_gate" when the
sensitive operation check fires, "memory_self_description" when the execution gate
fallback fires. The qualifying metric becomes oc_earned_by_op_context — target selected
AND oc refused AND cause is op_context_gate. That's the cell that proves the gate
worked.