The Other Half of Agent Memory Safety
Retrieval-time authority was the first half of the problem.
I have spent eleven articles building toward a finding: agent memory systems can
retrieve the right memory and still take the wrong action if that memory carries no
authority metadata. Relevance and authority are different things.
But retrieval-time authority only answers one question: did the right rule reach the
query?
The harder question is execution-time: did that rule govern the action? Can you trace a
line from a specific field in the retrieved memory to the specific action the agent
took — and is that line sufficient for the risk level of the operation?
This article is about building that second layer. The gate is real. So are the limits
of what it can and cannot rescue.
What Was Missing
After the retrieval-time work, the evaluator recorded "action correct: yes/no." That is
downstream of retrieval outcome. It did not record which field authorized the action,
or whether that field was sufficient.
I added an attribution trace to surface this. Every decision now records:
- action_authorized_by — the specific field that triggered the action
- attribution_status — one of: GOVERNED, AUTHORITY_ONLY, DEFAULT, or UNATTRIBUTABLE
UNATTRIBUTABLE is the dangerous classification. It means the action was permissive, the
scenario expected restriction, and no authority field in the selected memory
constrained it. The system answered confidently with nothing behind the decision.
When I ran the attribution trace on the boundary packets from prior work —
credential/PII and industrial safety — every false-certainty error mapped to
UNATTRIBUTABLE. Every governance-adjusted clean action mapped to GOVERNED.
That named the gap. It did not close it.
The Execution-Time Gate
I built the gate to intervene at action time, not just label the outcome.
The field at the center is governs.action_types — a declaration of what kinds of
operations a memory is allowed to govern. In this framework, valid values include read,
write, and execute. A memory that governs payment processing might carry ["execute",
"write"]. A memory that governs read-only lookups might carry ["read"].
The gate logic is direct. After retrieval selects a memory and layered_action
determines an action, the gate checks whether governs.action_types implies a
higher-stakes operation than the action reflects:
governs_restrictive = bool(action_types & {"execute", "write"})
if governs_restrictive and action in {"answer", "answer_context"}:
# governs says this memory governs state-changing operations
# but the action is permissive — escalate
return "GATE_FAIL", escalate to "verify_first"
I used verify_first as the conservative rescue action because the gate can detect
insufficient execution authority, but cannot yet determine whether the operation should
be fully blocked. A uniform escalation is safer than a uniform pass.
If there is no governs field, or no action_types within it, the gate returns GATE_SKIP.
It cannot fire.
This distinction is where the finding lives.
What the Tests Showed
I built a test packet with three scenarios: payment processing, production database
credentials, and a PII bulk export request. Each target memory was correctly scoped —
governs.action_types: ["execute"] or ["execute", "write"] — but authority signals were
deliberately omitted. No verification_required. Default allowed_action_hint: answer. No
authority memory_type.
This is a realistic authoring failure: a developer writes a correct governs block but
forgets verification_required: true. The retrieval system selects the right memory.
layered_action finds nothing restrictive in the authority signals and returns answer.
Without the gate, every retrieval strategy produced 3/3 false-certainty errors on the
packet — the right memory was found, the wrong action was taken.
With the gate active, every strategy reached 3/3 action correct, with 3 escalations
from answer to verify_first. The gate rescued all three cases.
One thing to name directly: this packet was internally authored with the failure mode
already known. I designed the gap, then built the fix. That is iterative engineering,
not independent validation. The result is real but should be read accordingly.
Where the Gate Cannot Reach
The gate only fires when governs is present.
In the earlier boundary packets, the mislabeled sensitive memories had no governs field
at all. The gate returned GATE_SKIP on every one. UNATTRIBUTABLE cases — where
authority signals are also absent — remain unprotected.
This is the irreducible gap in the current framework.
It also sharpens the precondition from prior work. The prior finding stated: sensitive
memories need either governs metadata or authority signals.
The gate reveals that is not sufficient for full protection. The two types of metadata
do different jobs at different layers:
- Authority signals (verification_required, memory_type: policy, allowed_action_hint: block) protect at retrieval time. They influence which memory gets selected and what action layered_action returns.
- governs.action_types enables the execution-time gate. If governs is absent, the gate is blind regardless of how sensitive the memory content is.
Neither alone gives complete coverage. The gate is a backstop for the governs-present /
authority-absent case. It cannot substitute for missing governs.
The Coverage Map
┌─────────┬────────────────┬────────────────┬──────────────────┬───────────────────┐
│ Governs │ Authority │ Attribution │ Gate │ Protection │
│ │ signals │ │ │ │
├─────────┼────────────────┼────────────────┼──────────────────┼───────────────────┤
│ Absent │ Absent │ UNATTRIBUTABLE │ GATE_SKIP │ None │
├─────────┼────────────────┼────────────────┼──────────────────┼───────────────────┤
│ Absent │ Present │ AUTHORITY_ONLY │ GATE_SKIP │ Retrieval-time │
│ │ │ │ │ only │
├─────────┼────────────────┼────────────────┼──────────────────┼───────────────────┤
│ Present │ Absent │ DEFAULT │ GATE_FAIL → │ Gate rescues │
│ │ │ │ escalate │ │
├─────────┼────────────────┼────────────────┼──────────────────┼───────────────────┤
│ Present │ Present │ GOVERNED │ GATE_PASS │ Full chain │
└─────────┴────────────────┴────────────────┴──────────────────┴───────────────────┘
The bottom-right cell is the target state. Both fields present. Action authorized at
retrieval time by authority signals. Action confirmed at execution time by the gate.
Full traceable chain.
Every other cell has a gap. A builder can look at their own memory schema and
immediately ask: which cell of the coverage map does your schema currently live in?
What This Does Not Solve
The gate escalates uniformly to verify_first. It does not distinguish cases that
warrant block. That is a known limit of the current implementation.
The gate has also not been tested against retrieval noise. If retrieval selects the
wrong memory — one that happens to carry governs.action_types: ["execute"] — the gate
fires on the wrong target. The escalation would be technically correct per the schema
but misleading in practice. That edge case requires a dedicated stress packet with
competing high-action-type memories in the store.
More fundamentally, the gate checks action_types categories, not the specific operation
being requested. It does not evaluate whether the proposed action falls within the
jurisdiction the governs field actually claims to govern. That would require checking
governs.any_terms or governs.all_terms against the tool call itself — a finer-grained
check that the current schema supports but the gate does not yet perform.
The practical takeaway is simple: retrieval metadata and execution metadata are not the
same layer. One helps select the memory. The other helps prove that the selected
memory is allowed to govern the action being taken.
The Sharpened Precondition
The minimum metadata requirement has become more specific with each layer of this work.
The current honest statement:
▎ For full retrieval-time and execution-time protection, sensitive memories require
▎ both governs jurisdiction metadata and authority signals. governs enables the
▎ execution-time gate. Authority signals protect at retrieval time and ensure
▎ layered_action returns the correct action class. Either field alone leaves a gap the
▎ other cannot close.
This is still internally authored evidence on small, controlled packets. The
architecture overfitting risk is real — the gate was designed after seeing the failures
it rescues. The findings should be treated as a bounded internal result, not a general
principle.
The next required step is external authorship: a packet written by someone who did not
design this framework, in a domain they choose, with mixed metadata quality. If the
coverage map holds under that pressure, the precondition gets stronger. If it breaks,
that is more useful — it tells us what condition is missing.
If you want to author an external packet, the schema, evaluation harness, and gate
implementation are public: https://github.com/keniel13-ui/ai-memory-judgment-demo
If you are building agent memory systems — does each sensitive memory carry both a
governs block and at least one authority signal? Which cell of the coverage map does
your schema currently live in?
Top comments (7)
The part that jumps out is that the gate authorizes from governs.action_types, but that field lives in the same store retrieval just pulled from. So GATE_PASS really means "a field in the selected memory said it governs execute" — it doesn't say who asserted that field or whether they were entitled to. If whatever can write the store can also write governs, then the bottom-right "full chain" cell is fully traceable only within the store's own word: a poisoned write that drops governs.action_types: ["read"] onto a payment memory passes the gate clean.
That's the retrieval-time false certainty moved up a layer. UNATTRIBUTABLE today means "no authority field constrained it," but there's a sibling case you're currently scoring as GOVERNED — an authority field that's present but unverifiable. I'd split those. What's held up for me is making governs/verification_required carry a signature from an authority key you can check out of band, so GOVERNED means "attributable to a key that was actually allowed to set this," not "present in the retrieved memory."
It also gives you a handle on the retrieval-noise limit you flagged at the end: if a stray memory carrying ["execute"] gets selected, require the governs assertion to be signed against the resource and operation in play. The wrong target won't fire, because its signature doesn't bind to this action — and the attribution trace becomes something a reviewer can verify independently instead of something the store asserts about itself.
The gap is real and you've named it precisely. GATE_PASS currently terminates at the
store — "a field said it governs execute" is the store speaking about itself. A store
that can be written can be downgraded, and the downgrade doesn't produce a visible
failure. governs.action_types: ["read"] on a payment memory makes the gate skip, not
fail. GATE_SKIP on a high-sensitivity memory is the same exposure as UNATTRIBUTABLE —
but it currently doesn't score that way.
The classification split is correct. What I've been calling GOVERNED conflates two
cases:
Those are different claims and they need different labels.
The operation-bound signature approach is the right direction — elegant because it
solves store integrity and retrieval noise simultaneously. A stray memory carrying
["execute"] can't fire on the wrong target if its signature doesn't bind to the
operation in play.
After reading your ANP2 spec: what you're building is the substrate layer this problem
is pointing toward. Ed25519-signed typed events with capability declaration is exactly
the mechanism governs assertions need — and an append-only signed event log solves
store integrity at the architecture level rather than requiring a separate key registry
bolted on. The bootstrapping problem doesn't disappear, but a computable trust graph
without a central authority is a cleaner answer than what I was working through.
The question worth sitting with: does memory governance become a typed event class on a
signed-event substrate rather than a field in a YAML store? If it does, GOVERNED means
"this assertion was published as a signed event by a key in the trust graph" — which
is independently verifiable in the way you described.
Adding this to the claim ledger as an open question. This is the next layer.
You did the hard part already — eleven articles building the gate and the GOVERNED/UNATTRIBUTABLE taxonomy are what make the substrate point even sayable. Without an execution-time gate to anchor it, "sign the authority field" is just a slogan; you gave it somewhere to land, and the GATE_SKIP-equals-UNATTRIBUTABLE observation you just made is the sharper version of what I was reaching for.
And yes — that's the move. On a signed-event substrate GOVERNED stops being the store speaking about itself and becomes something a third party can check. Governance is a typed event: a capability grant signed by a key, binding {who may govern} × {action_types} × {resource}, not a field a store asserts. The gate no longer reads governs out of the retrieved blob — it resolves the action to the grant event that authorizes it and verifies the signature and the path from that key into the trust graph. Your UNATTRIBUTABLE and the second GOVERNED case collapse into one check: is there a signed grant that binds to this operation.
Two things fall out once it's events not fields. Revocation gets the same treatment — a field has no clean "this no longer holds"; you mutate it and the history's gone. As a signed event a revocation is itself replayable, so "was this authority live at decision time" is answerable after the fact — which is exactly what the reviewer/attestation case you and ancilis were circling actually needs. And the bootstrap you flagged is real but it narrows to one honest question: which keys seed the graph. Everything above that is derivable and checkable — a much smaller thing to get right than "is every governs field in every store authentic."
The one I'd hand back to your ledger: integrity isn't freshness. If the grant is an event but the memory is still a YAML blob, the verifier still has to bind the specific content it's about to act on to the grant — content hash inside the operation-bound signature — or a downgrade just moves from the governs field to the body that field points at. You've spent eleven articles finding exactly these seams, so that one's yours to name better than I can.
The GATE_SKIP = UNATTRIBUTABLE observation was exactly the right sharpening — I'd named
them as two failure modes with different explanations, but on an event substrate they
collapse into one check that either resolves or doesn't. The gate finds a signed grant
binding {action_types} × {resource} with a verifiable path into the trust graph, or it
finds nothing. The field-based taxonomy kept them separate because fields can be
present-but-wrong in different ways. Events collapse that: wrong is just the signature
not verifying. One failure mode, not two.
The revocation point is the one I'd underweighted. I'd been thinking about revocation
as a propagation problem — how do you push the change before stale authority persists
through a cache. But you're pointing at the harder problem: auditability after the
fact. A mutable field that gets overwritten loses the timeline. A reviewer asking "was
this authority live at decision time" cannot answer that from current state. A signed
revocation event makes that question answerable — timestamp, scope, key, verifiable.
That is what the compliance and attestation case actually needs. Not "is this authority
currently valid" but "was it valid at the moment this decision was made." The review
needs the full event sequence, not a snapshot of present state.
Bootstrap narrowing to which keys seed the graph is the right contraction. Everything
above that derivable and checkable is a much more honest account of where the hard part
actually sits than "every governs field must be authentic." The latter is a
distributed maintenance problem with no clean solution. The former is a bounded and
reasonably tractable trust assumption.
The one you handed back: integrity isn't freshness. I'd been treating the
grant-as-event as the full close. It is not. If the grant is a signed event but the
memory is still a YAML blob, a valid grant against wrong content is not authority — the
downgrade just shifts. Instead of asserting authority through the governs field, a
compromised or stale store now asserts through the body the grant points at. The full
close is content hash inside the operation-bound signature: the gate resolves action →
grant event → signature → content hash → trust path. Anything shorter leaves a seam
between the capability claim and the content it's claimed to govern.
That goes into the ledger as the next claim: a grant event without content-addressed
binding is a partial solution. The verifier needs to know both that the capability was
granted and that the content it is about to act on is the content the grant bound to at
signing time. "Valid signature, wrong blob" is still a failure mode — it is just a
different seam than the ones the gate was built to close.
The compliance version of this question is whether the trace satisfies what an external reviewer needs, not just whether a trace exists. Logging what happened and attesting which control authorized it are different artifacts. Most teams building in this space end up needing both and realize they built only the first.
That distinction is exact and worth naming directly.
What the attribution trace gives you: for each decision, a named field
(action_authorized_by) and a status — GOVERNED, AUTHORITY_ONLY, DEFAULT, or
UNATTRIBUTABLE — that identifies which specific field in the retrieved memory
authorized the action, not just that an action occurred. It's closer to a compliance
trace than a plain event log.
What it doesn't give you yet: a separate, externally inspectable attestation artifact.
The trace lives inside the evaluator output — not as a standalone record linking memory
ID + authority field + action class + query timestamp against a versioned control
registry. An external reviewer can read the trace, but they're reading an embedded
field in a results file, not an independently auditable artifact with its own chain of
custody.
The gap you're naming is exactly that second layer. Most systems produce logs.
Compliance-grade systems produce attestations — records that can be examined,
versioned, and challenged independently of the system that generated them.
We haven't built that layer yet. The schema supports it: every decision carries
action_authorized_by, attribution_status, and the memory's governs field, which would
serve as the control reference. The missing piece is the attestation format — a
separate artifact, independently addressable, that proves the control existed and was
consulted at decision time, not reconstructed after the fact.
That's the next honest build target.