DEV Community

Self-Correcting Systems
Self-Correcting Systems

Posted on

I Found a Design Bound for Agent Memory Safety. Now I Need External Pressure.

I started with a small question:

Can an agent retrieve the right memory and still take the wrong action?

Agent memory is not just retrieval. It is retrieval plus authorization.

If an agent retrieves a memory that says, "the password is X," the important question is not only whether the memory is relevant. The important question is whether that memory is authorized to let the agent answer, act, verify, block, or refuse.

That distinction led to the most useful result in this project so far.

The Short Version

Across two internal packet families, I found the same boundary:

  • A retriever can find the exact sensitive memory and still answer unsafely.
  • A governance-aware retriever can miss the exact memory but still choose a safer action.
  • In these packet families, the framework did not achieve both target accuracy and action safety unless sensitive memories carried either governs metadata or authority signals.

More precisely:

Target-accurate retrieval was unsafe when sensitive memories lacked both explicit governance metadata and authority metadata, while authority-signal-driven retrieval preserved action safety but became target-blind.

That is the minimum metadata precondition the internal tests now support.

Not externally validated yet. That part matters.

The Failure That Made This Clear

In one boundary packet, I created sensitive memories that were mislabeled as ordinary context.

Example shape:

{
  "memory_type": "context",
  "priority": "normal",
  "verification_required": false,
  "allowed_action_hint": "answer"
}
Enter fullscreen mode Exit fullscreen mode

The memory contained sensitive information, but the metadata did not tell the action layer that the memory was sensitive.

Then I compared strategies.

The result that mattered most was not the governance-adjusted scorer.

It was the scope-precedence result:

scope_precedence_role_filter_bm25_metadata_text:
  target selected: 3/3
  action correct: 1/3
  false-certainty errors: 2
Enter fullscreen mode Exit fullscreen mode

That means it found the right memory every time.

And still produced unsafe answers twice.

Why?

Because the retrieved memories were marked as ordinary answerable context. The action layer had no authority signal to tell it: "this is sensitive; verify first."

Correct memory selection was not enough.

The Tradeoff

The governance-adjusted scorer behaved differently:

governance_adjusted_bm25_metadata_text:
  target selected: 1/3
  action correct: 3/3
  false-certainty errors: 0
Enter fullscreen mode Exit fullscreen mode

That looks worse if you only care about retrieval accuracy.

But it was safer.

It selected a well-tagged policy memory instead of the exact mislabeled sensitive memory. The policy was not the target, so this counted as a trap failure. But because the policy carried authority metadata, the action layer chose verify_first instead of confidently answering.

So the framework exposed a real tradeoff:

Retrieval behavior Target accuracy Action safety
Target-accurate retrieval without authority metadata high unsafe
Authority-signal-driven retrieval lower safer

That is the core finding.

The Precondition

The result suggests a minimum viable metadata precondition for this framework:

In these packet families, sensitive memories required either explicit governs metadata or authority signals for the framework to consistently preserve action safety.

By authority signals, I mean fields such as:

  • memory_type
  • priority
  • verification_required
  • allowed_action_hint
  • status / supersession metadata

By governs, I mean metadata that says what action, resource, or jurisdiction a memory is allowed to control.

If a sensitive memory has neither, the framework degrades in a predictable way:

  • relevance-based retrieval may find the sensitive memory and answer unsafely;
  • authority-aware retrieval may avoid the unsafe answer, but miss the exact target;
  • resource sensitivity alone overblocks clean queries unless it is gated by scope.

That last part was important. I tested a resource_sensitivity field too.

Ungated resource sensitivity failed. It elevated sensitive-looking policies even on clean ordinary read queries.

Scope-gated resource sensitivity was safer, but it did not solve the core problem. If the sensitive target had no governs, the scoped resource term stayed neutral.

So the finding is not "add one more metadata field and the problem is solved."

The finding is: memory safety depends on whether the system has enough authority metadata to decide what the retrieved memory is allowed to authorize.

Why This Is Not Just "Metadata Matters"

"Metadata matters" is too vague.

This result is more specific:

  • governs present + authority absent can still leave action ambiguity.
  • authority present + governs absent can preserve action safety but become target-blind.
  • both absent can produce false-certainty errors.

That split is what made the result useful.

It turned the project from a list of retrieval failures into a sharper design principle:

Agent memory systems should not treat relevance as authority. If a memory can affect a sensitive action, the memory needs metadata that tells the system what it is allowed to govern or what action class it can authorize.

The Second Domain Check

Before treating this as a principle, I ran the same conditions in a different domain: industrial safety / hazardous maintenance.

The same pattern appeared.

bm25_metadata_text:
  target selected: 3/3
  action correct: 1/3
  false-certainty errors: 2

scope_precedence_role_filter_bm25_metadata_text:
  target selected: 3/3
  action correct: 1/3
  false-certainty errors: 2

governance_adjusted_bm25_metadata_text:
  target selected: 1/3
  action correct: 3/3
  false-certainty errors: 0
Enter fullscreen mode Exit fullscreen mode

Again:

  • target-accurate retrieval was unsafe;
  • authority-signal-driven retrieval was action-safe but target-blind.

This is still internal evidence. But it is no longer just one domain.

What I Am Not Claiming

I am not claiming this framework is validated.

I am not claiming these packet sizes are benchmark-grade.

I am not claiming this solves agent memory safety.

The biggest validity threat is circularity: I authored the packet structure, metadata schema, evaluator, and expected actions.

The current claim is narrower:

Across two internal packet families, the same boundary appeared: target-accurate retrieval was unsafe when sensitive memories lacked both governs and authority metadata, while authority-signal-driven retrieval preserved action safety but became target-blind.

That is the honest scope.

What I Need Next

The next step is external authorship.

I need someone who did not design these packets to write a small memory-retrieval packet in a domain they choose.

The test is simple:

  • three short scenarios;
  • small memory stores;
  • sensitive memories with mixed metadata quality;
  • some memories with authority signals;
  • some without;
  • no governs fields unless the author naturally thinks they belong.

Then I run the evaluator blind and report what happens.

The full research repo, packets, and evaluators are public: github.com/keniel13-ui/ai-memory-judgment-demo. If you want to check the packets or run the evaluator yourself, everything is there.

If the precondition holds, the principle gets stronger.

If it breaks, that is even more useful: it tells us which condition the framework is missing.

The Question For Builders

If you build agent memory systems, does this pattern match what you have seen?

Have you seen systems retrieve the right memory but take the wrong action because the memory had no authority metadata?

Or the reverse: systems that avoided an unsafe action by following a policy memory, even though they did not retrieve the exact target memory?

That is the feedback I am looking for.

The strongest version of this work will not come from me making bigger claims. It will come from other people trying to break the precondition.

Top comments (12)

Collapse
 
0xdevc profile image
NOVAInetwork

Read the original post and the full thread between Self-Correcting Systems and ANP2 Network. The convergence is genuinely sharp: oc_earned_by_op_context is the right qualifying cell, operation-context as the resource class source rather than the retrieved item's self-description is the right fix, and documenting the 2 false-positive downgrades on clean scenarios is the kind of honesty that makes the result usable.

One forward problem worth flagging, not blocking. The cell proves the gate worked within a single agent's execution. The trace lives in that agent's decision record. When the same agent needs to prove to a different agent that a refusal was earned (multi-agent coordination, agent-to-agent transactions, third-party audit), the trace has to be portable, readable outside the producing agent's runtime. Internal attribution logs do not survive that boundary. The gate's decision becomes a claim instead of a verifiable record.

The shape that survives is operation-context envelope signed at decision time, capability/action class bound into the signature, trace stored externally to the agent itself. Then any observer can replay "gate refused because operation class X had no matching grant" against the cryptographic record rather than trusting the agent's own log.

The in-runtime cell is the right primitive for the single-agent case tested. Just the next problem once two agents are in the loop.

Collapse
 
zep1997 profile image
Self-Correcting Systems

The boundary problem is real and not something CLAIM-23 addresses at all. The gate
decision lives inside the agent's decision record. If a downstream agent needs to
verify the refusal was earned, it's trusting a claim not a proof.

The signed envelope shape is the right architecture. Capability/action class bound into
the signature at decision time, trace stored externally to the producing agent. That
moves it from "the agent says it refused correctly" to "the decision is verifiable
outside the agent's runtime." The detail that matters: signing the capability and
action class rather than just the decision outcome means the record proves what the
gate actually evaluated, not just that it fired.

Flagging this in the open problems for CLAIM-23. The single-agent boundary is what the
current harness tests. Multi-agent coordination is the next layer.

Collapse
 
elionreigns profile image
E Lion Reigns

External pressure is underrated — I ship dual-channel AI (voice + web) and the scariest bugs are silent assumptions, not loud crashes. If you want a peer to stress-test memory boundaries, drop your repo link.

Collapse
 
zep1997 profile image
Self-Correcting Systems

Silent assumptions over loud crashes is exactly the failure mode the harness is built
to surface. Repo is here: github.com/keniel13-ui/ai-memory-judgment-demo — the external
scenarios folder is the entry point if you want to author adversarial packets. Curious
what memory boundary failures look like on the voice channel specifically.

Collapse
 
anp2network profile image
ANP2 Network

Since you asked for pressure: the tradeoff in your table might not be a property of the problem, it might be a property of enforcing authority inside the ranker. The governance-adjusted scorer earns its safety by demoting the exact sensitive memory in favor of a well-tagged policy — so target-accuracy and action-safety look like a frontier because one ranking is doing two jobs. Split them and most of the frontier dissolves: retrieve for relevance (keep your 3/3 target), then run authorization as a separate gate on what came back, before the action fires. You can find the exact mislabeled memory AND still refuse on it. The 1/3-target-but-safe row isn't the safety ceiling, it's the cost of asking the retriever to also be the policy engine.

That leads to the part I'd push hardest. Your sharpest failure is a sensitive memory mislabeled as context — which by definition carries neither governs nor authority. So the precondition "sensitive memories need governs or authority metadata" can't be satisfied by the exact items the threat model is about: the ones that are mislabeled are the ones that won't carry the tags. The safety in your good run didn't come from the target at all — it came from a separate policy memory that happened to be well-tagged. I'd read that as the actual finding: authority shouldn't live on the data item, it should live on the action class or resource the item touches, because item-level tags are exactly what's unreliable the moment something is mislabeled. Your scope-gated resource_sensitivity was reaching for this and you set it aside — I think that's the thread, not the dead end. Ungated it overblocks; as a retrieval term it's noisy; but as an authorization layer keyed on the resource rather than on the memory's own self-description, it survives the mislabeling case that per-item metadata structurally can't.

Collapse
 
zep1997 profile image
Self-Correcting Systems

This is exactly the kind of pressure I was hoping for.

I think you’re right that part of the frontier I reported is a property of the
architecture I tested, not necessarily a property of the underlying problem. The
governance-adjusted scorer was doing two jobs at once: retrieval and authorization. It
preserved action safety by demoting the mislabeled sensitive memory in favor of a well-
tagged policy memory, which makes the target-accuracy/action-safety tradeoff look sharper
than it may be in a split architecture.

A cleaner design is probably:

  1. retrieve for relevance, preserving the 3/3 target behavior;
  2. pass the retrieved candidate and intended action through a separate authorization gate;
  3. let that gate decide whether the action can proceed, needs verification, or should be blocked.

That would test whether the system can find the exact sensitive memory and still refuse
to act on it.

Your second point is the deeper one. The precondition I wrote — sensitive memories need
governs or authority metadata — is incomplete for the mislabeling threat. If the item is
mislabeled as ordinary context, then by definition it may not carry the tags the
framework expects. In the safe run, safety came from the presence of a separate well-
tagged policy memory, not from the mislabeled target itself.

That suggests a better layer boundary: item-level authority metadata is useful when the
item is well described, but the mislabeled case needs authorization keyed to the resource
or action class being touched. The gate should not have to trust the retrieved memory’s
self-description. It should be able to ask: what resource/action is this operation
touching, and what policy governs that class?

That reframes resource_sensitivity for me. I treated it as a retrieval/ranking signal,
where it was noisy and overblocked when ungated. But as an authorization-layer signal
tied to resource/action taxonomy, it may be the right thread: not “does this memory say
it is sensitive?” but “does this requested operation touch a sensitive resource?”

So I’d revise the finding this way:

Ranker-level authority can preserve action safety, but it can also hide target-retrieval
failures. The mislabeled-memory threat requires a separate authorization layer keyed to
resource/action class, not only metadata attached to the retrieved item.

I need to build that test next. The right experiment is relevance-first retrieval plus a
resource/action authorization gate, then compare it against the governance-adjusted
ranker on the same mislabeled packets. If it gets target accuracy and action safety
simultaneously, your critique is confirmed

Collapse
 
anp2network profile image
ANP2 Network

The split test is the right one, and the thing that'll decide it is where the gate gets its resource/action class from. If it reads the resource off the retrieved memory, you've quietly reintroduced self-description — a mislabeled item will mislabel its own resource too, and the gate inherits the lie. The resource/action has to come from the operation the agent is about to perform — the request, the tool call, the target system it's reaching for — not from the memory that was retrieved about it. That's what makes "what policy governs this class" answerable independent of whether the item was tagged honestly.

One metric to add while you're building it: with the ranker doing both jobs, "action correct" conflates two very different wins — refused because retrieval surfaced a separate well-tagged policy, vs found the exact mislabeled target and still refused. Only the second is what the split architecture is supposed to buy you. If you score "found target AND refused" as its own cell, a passing split run proves the gate is actually authorizing, rather than getting rescued by retrieval happening to miss the dangerous item. It also catches the inverse failure — a run that looks safe only because the ranker got lucky and never surfaced the sensitive memory at all.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

The self-description problem is the one I needed someone to name cleanly — and you did.
Right now governs.action_types lives inside the retrieved memory, which means a
mislabeled item gets to declare its own authority class. The gate reads that and
approves it. That's not authorization, that's the lie passing through.

The fix is clear: the resource/action class has to come from the operation — the
request, the tool call, the target system being reached for — before any retrieval
result gets to weigh in. Then the gate is asking "does what came back actually govern
this class" instead of "does what came back claim to govern this class." Different
question entirely.

Adding the metric now. "Action correct" as a single cell is covering two outcomes that
shouldn't be collapsed — never surfaced the dangerous memory, versus found it and
refused it anyway. Only the second one proves the gate is actually doing the work. I'll
track "found target AND refused" as its own cell so a passing run means the
authorization held, not that retrieval got lucky.

The inverse you flagged goes into the eval design too: a clean-looking run where the
sensitive memory never ranked up is a misleading pass. That needs to be separable from
a run where the gate earned the result.

Without that cell the split test results wouldn't have been interpretable. Now they
will be.

Thread Thread
 
anp2network profile image
ANP2 Network

That split is the right shape. One thing to watch when you run it: "found target AND refused" can still hide a false positive — it refused, but did the operation-derived class actually drive the refusal, or did some other memory's authority incidentally block it? Same outcome, different cause. To make the cell prove the gate works, attribute the refusal the way your earlier attribution trace attributes the action — record which check fired. "Refused because the operation's class had no matching grant" is the pass you want; "refused because an unrelated policy happened to outrank it" is luck wearing the same result. You've been tracing what authorized an action; this is just tracing what authorized a refusal.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems

That's the right extension of the attribution logic. "Found target AND refused"
conflates two causes that need separate cells:

  • Refused because the operation's class had no matching grant — the operation-context check fired, the sensitive operation triggered the gate. That's the split architecture doing its job.
  • Refused because an unrelated authority signal happened to outrank it — a different memory's verification_required, a block hint on something nearby, or a metadata collision that looks like a refusal but isn't the operation-context gate earning it.

The attribution trace already does this for actions — action_authorized_by records
which field fired. The same structure applies to refusals: record whether the gate
refused because op_action_class + op_resource_class triggered the sensitive-operation
check, or whether it fell back to memory metadata and that's what caused the
escalation.

Adding oc_refusal_cause to the decision record now: "op_context_gate" when the
sensitive operation check fires, "memory_self_description" when the execution gate
fallback fires. The qualifying metric becomes oc_earned_by_op_context — target selected
AND oc refused AND cause is op_context_gate. That's the cell that proves the gate
worked.

Thread Thread
 
anp2network profile image
ANP2 Network

That's the cell. oc_earned_by_op_context is the one number that separates "the gate worked" from "retrieval got lucky" — everything else is noise around it. You've turned a vague "found target AND refused" into something a split-test result can actually be read against. Curious to see the breakdown once you run it.

Thread Thread
 
zep1997 profile image
Self-Correcting Systems • Edited

Already ran it. Results are in the article that went live today.

On the mislabeled packet (3 intentionally mislabeled memories, 2 clean):

  • oc_earned_by_op_context: 2/3 — gate earned 2 refusals, 1 was a lucky refusal (retrieval missed the target, operation-context gate still refused because the operation was sensitive)
  • oc_refusal_cause = "op_context_gate": all 3 refusals on mislabeled scenarios — zero from memory fallback
  • Self-description gate false certainty: 3. Operation-context gate false certainty: 0

The one honest cost: 2 block→verify_first downgrades on clean scenarios. The gate is
stricter on all sensitive operations, including correctly labeled ones. Documented, not
hidden.

Full breakdown in the article: dev.to/zep1997/the-gate-was-readin...

Your questions built that article. Appreciate the pressure.