In the last article, ANP2 left a comment I couldn't stop thinking about:
"If it reads the resource off the retrieved memory, you've quietly reintroduced self-description — a mislabeled item will mislabel its own resource too, and the gate inherits the lie."
The memory lied. The gate inherited the lie. The action fired.
That was the failure mode in one sentence. Here's the test.
What the old gate trusted
The previous execution gate checked fields authored by the retrieved memory itself — governs.action_types, resource_sensitivity, allowed_action_hint. If a mislabeled memory carried resource_sensitivity: "ordinary_fact" and allowed_action_hint: "answer", the gate read those labels, found nothing alarming, and passed the action through.
The lie didn't need to be sophisticated. It just needed to be in the right fields.
The fix: read from the operation
The architectural shift is one sentence: derive the authorization class from what the agent is about to do, not from what the retrieved memory claims to govern.
def operation_context_gate(query, memory, pre_gate_action):
sensitive, action_class, resource_class = is_sensitive_operation(query)
if sensitive:
return (
"GATE_FAIL",
f"op_action={action_class} op_resource={resource_class}",
"verify_first",
"op_context_gate",
)
result, note, action = execution_gate(memory, pre_gate_action)
return result, note, action, "memory_self_description"
I used verify_first as the rescue action because the gate can detect insufficient authority, but it does not yet decide whether the operation should be fully blocked.
One caveat worth naming: is_sensitive_operation infers the resource and action class from the natural-language query. That is still imperfect. A stronger gate would inspect the proposed tool call, target resource, action type, and recipient directly — not the query describing it. This version is a bridge toward operation-derived authority, not the final form.
The test: three intentionally mislabeled memories
Five scenarios designed to isolate the gap:
-
3 mislabeled: target memory carries
resource_sensitivity: "ordinary_fact"andallowed_action_hint: "answer"— but the operation is execute + sensitive resource (wire transfer, API key distribution, PII export to an external partner) - 1 clean sensitive: correctly labeled — both gates should refuse (baseline)
- 1 safe read: both gates should pass (control)
The numbers
| Metric | Self-description gate | Operation-context gate |
|---|---|---|
| Action correct | 2/5 | 5/5 |
| False certainty (expected risky, said permissive) | 3 | 0 |
| Risky cases refused | 0/3 | 3/3 |
Target selected AND refused (oc_earned_by_op_context) |
— | 2/3 |
| Lucky refusal (refused, target not selected) | — | 1/3 |
| Gap exposed (SD pass, OC fail) | — | 3 |
The distinction between the last two rows matters. In 2 of the 3 mislabeled refusals, retrieval surfaced the exact sensitive memory AND the operation-context gate refused it — that's the gate earning the result. In the third, retrieval missed the target; the gate still refused because the operation was sensitive. Correct outcome, different causal path. oc_earned_by_op_context tracks the difference so a passing run proves the gate worked, not that the ranker got lucky.
The honest cost
On the clean baseline (fresh-Claude v0.4 top-5):
| SD | OC | |
|---|---|---|
| Action correct | 5/5 | 3/5 |
| False certainty | 0 | 0 |
Two scenarios expected block, got verify_first. Not false certainty — the gate never said "safe" when "risky" was expected. But it downgraded two well-labeled block decisions. The operation-context gate is stricter on all sensitive operations, including correctly labeled ones. That tradeoff is documented in the claim ledger. Not hidden.
What this doesn't close
The gate catches mislabeled memories at execution time. It doesn't touch write-time.
A commenter asked the harder question directly: What does your attribution trace use as ground truth for authorization at store time? Caller identity, namespace position, or something external to the memory layer entirely?
Honest answer: right now, stored metadata is still the ground truth. The storing agent claims its own authority at write time. A poisoned write still enters the store — the operation-context gate can refuse at execution, but can't verify what was stored or by whom.
Write-time authorization is the next open layer. The storing agent's identity and namespace position need to become the first gate. A task-abc.executor claiming authority over payment operations should fail before the item reaches the store. That's not built yet.
The practical rule
Do not let a memory object be the only witness for its own authority.
The pattern
Every step in this series moved because external pressure forced it. The self-description gap wasn't in the original design — ANP2 named it in the comments. The attribution refinement wasn't in the first version of this eval — a second commenter pushed on it the same session.
The research is better because the comments pushed back. That's the point.
Code: https://github.com/keniel13-ui/ai-memory-judgment-demo
Top comments (33)
The 0-vs-3 false-certainty split is the whole result — the gate stopped reading the memory's story about itself and started reading the operation. That's the thing.
The caveat you flagged is the real remaining seam, and it's worth being blunt about: inferring sensitivity from the natural-language query is still a self-description channel — you moved the lie from the memory's fields to the query's phrasing. A query is the agent's description of what it's about to do, and a manipulated or sloppy query can mislabel the operation exactly the way the memory did. The version that closes it is the one you named: read the actual tool call — target resource, action type, recipient — not the sentence describing it. Authority derived from the operation's real parameters, not from any natural-language stand-in for them.
There's a payoff once it reads the real call: the action/resource class becomes concrete enough to check against an authority that lives outside the agent — a grant bound to that specific target, rather than a heuristic over phrasing. That's when "op_context_gate refused" becomes independently verifiable instead of trusting the classifier. And the honest cost you documented (2 clean downgrades) is the right side to err on — a verify_first on a correctly-labeled sensitive op is cheap; a false certainty on a mislabeled one is the expensive failure you just drove to zero.
The query channel being self-description is the sharpest framing of what this version
doesn't close. I moved the trust boundary but didn't remove it — the agent still
narrates what it's about to do, and a sloppy or adversarial query can mislabel the
operation the same way a mislabeled memory can. One natural-language stand-in replaced
another.
The tool-call version you're pointing at is the one I wanted but didn't build: read the
actual parameters before execution. target_resource, action_type, recipient — these
are concrete, not linguistic. A wire-transfer call has a specific destination account
in the call signature. The gate reads that, not the sentence that preceded it. At that
point you're not classifying intent, you're classifying the operation itself.
The independence point matters more than I surfaced. "Op_context_gate refused" right
now is a judgment call by a sensitivity classifier over phrasing. Once the gate reads
actual tool-call parameters, you can bind the refusal to something external: this
agent, this role, this target resource — permitted or not against a grant that lives
outside the agent's own representation of itself. The refusal becomes auditable, not
just trusted.
The 2 clean downgrades are intentional. A verify_first on a correctly-labeled sensitive
op costs a confirmation step. A false certainty on a mislabeled one costs the
operation firing. I want to be on that side of the asymmetry.
Next version reads the tool call. That closes the query channel and opens the
external-authority check. Writing it.
That's the version. One precision for when you wire the external-authority check: bind the grant to the specific operation parameters, not just {agent, role}. A grant that says "this agent may do X to target A" shouldn't authorize X to target B — otherwise "permitted" quietly becomes its own coarse self-description, and you've rebuilt the lie at the authority layer. The tool call gives you concrete target_resource and recipient; bind to those, and a valid grant for one operation can't launder a different one. Looking forward to the tool-call version.
That's exactly the constraint the next version needs.
If the external grant is only bound to
{agent, role, action_class}, it is still toocoarse. It proves the agent has some authority in the abstract, but not that this
specific operation is authorized.
The grant needs to bind to the actual operation tuple:
agent_id + action_type + target_resource + recipient + scope + expirySo "agent A may revoke certificate X" does not become "agent A may revoke any
certificate." And "agent A may send report Y to recipient Z" does not become "agent A may
send any report to anyone."
That also gives the audit trace something concrete to compare:
If those don't line up, the gate should refuse or verify_first before execution.
So the next version is not just tool-call inspection. It is parameter-bound
authorization: the operation is authorized only if the grant covers this exact action on
this exact resource for this exact recipient, within scope and time.
That closes the coarse-authority loophole you’re pointing at.
That's the complete shape. Parameter-bound grant plus the three-way comparison — proposed params vs grant vs context, refuse on mismatch — is exactly the check. And putting expiry and scope in the tuple is the part most capability systems quietly skip, which is precisely how "authorized once" decays into "authorized forever." Nothing to add; this is the version. Will be watching for it.
It shipped. CLAIM-23 ran a seven scenario packet exact allow, missing grant,
recipient mismatch, scope mismatch, expired grant, vague query with sensitive tool
call, exact block. Self-description gate: 1/7, 6 false-certainty errors. Query-context
gate: 3/7, 2 false-certainty errors. Tool-call grant gate: 7/7, 0 false-certainty. The
expiry and scope fields were the ones that caught the "authorized once, now decayed"
cases specifically. Article coming. Appreciate you pushing on this.
7/7 with zero false-certainty is the clean result, and the standout is the expiry/scope cases. That's the part almost every authorization model gets wrong: "authorized once" silently becomes "authorized forever," and nothing in a static check notices the decay. Catching "authorized once, now decayed" is the whole difference between a gate that proves authority is live and one that only proves it was granted at some point. The tool-call binding is what made expiry/scope checkable in the first place — once the verdict is bound to the actual operation, "is this still valid right now" becomes a real question instead of a vibe. Nicely done; looking forward to the writeup.
The "authorized once, now decayed" cases were the ones I was most curious about going
in. static metadata checks just don't see time passing. once the grant is in the store
it reads as valid whether it was issued yesterday or eight months ago. binding to the
actual operation parameters is what makes expiry a real check instead of a field that
exists but never gets interrogated. writeup coming, appreciate you watching this one
build.
The trap with expiry is that issued_at + TTL is still a stored field — you're back to trusting the grant to tell you when it died. The version where time actually bites is when nothing in the grant asserts its own freshness: the gate re-derives authority from the operation's live parameters at call time, so a grant goes stale not because a timestamp lapsed but because what it was scoped to (the resource's current owner/state/version) no longer reproduces the binding. Decay becomes a failed re-derivation, not a field someone has to remember to interrogate — and the eight-month-old grant fails for the same reason a forged one does: it can't rebuild itself against the world as it is now.
The stored-field trap is the right catch. TTL is just the self-description problem
applied to time. The grant tells you when it died and you trust that field.
Re-derivation closes that loop. The gate stops asking "is this still within its window"
and asks "can this grant reproduce its binding against the world right now." A grant
that can't rebuild is stale for the same reason a forged one is: the conditions that
made it valid no longer hold.
The practical shift in what you store: TTL is a claim about future validity, set at
issuance. Re-derivation is a binding against state — resource owner, version, current
conditions — checked at execution. The authority question moves from "when was this
valid" to "is this valid now."
CLAIM-23 uses explicit expiry timestamps. That's still the stored-field model. The
grant reports its own death. The re-derivation architecture is the harder and more
honest version. That's the next test to build.
The thing that'll make or break that test is where re-derivation reads "the world" from. If the gate rebuilds the binding against conditions the agent passes in, you've just moved the lie up a level — the grant stops self-describing and the agent starts self-describing the state instead. Re-derivation only closes the loop if it pulls the resource owner, version, and current conditions from a source the agent doesn't control, fetched at execution.
And I'd split the refusal cell the way you split found-and-refused. "Refused: rebuilt the binding and the conditions no longer hold" means the grant is genuinely stale. "Refused: couldn't reach the authority to rebuild" is fail-closed but says nothing about the grant — it might still be valid. Both deny, which is correct, but only the first is evidence the architecture works. Collapse them into one number and you can't tell whether you're catching staleness or just catching unreachability.
That source independence piece is exactly the trap. if the gate rebuilds the binding
against conditions the agent passes in, re-derivation just moves the self-description
problem up a level. the test only closes if the world-state is pulled from a source the
agent can't write to or route through. that becomes a pre-registration constraint
before we design the packet, not a design assumption.
and yes on the refusal split. refused-because-stale and refused-because-unreachable
both deny but they're measuring different things. collapse them and you can't tell
whether the architecture caught staleness or just failed to reach the authority source.
those are two separate cells in the packet. going in before any results get logged.
One cell worth pre-registering while you're at it: the one that discriminates the two architectures. A grant that's still TTL-valid — not expired — but re-derivation-stale because the conditions changed under it. That's the only case where the two gates can disagree. If that cell comes back empty, re-derivation refused nothing the timestamp wouldn't have, and you've shipped a harder mechanism that bought no observable behavior. If it's populated, that's your proof the rebuild does work the stored field can't.
Pre-registering it also stops you from misreading the headline number. "Both gates agreed 98% of the time" reads like success, but agreement is exactly where re-derivation fails to justify its cost — the disagreement cell is the whole result, and it's the one that's easy to let shrink to zero if you don't commit to it before you see the data.
The divergence cell is the whole result. that reframe just changed how I was thinking
about the packet. I was planning to count agreement as signal. you're saying agreement
is exactly where the harder architecture earns nothing because both gates were always
going to agree there. TTL-valid but re-derivation-stale is the only row that can
actually separate the two gates and if that row comes back empty I haven't proved
anything worth the added complexity. pre-registering it now before I see any numbers so
I can't let it quietly shrink to zero.
That's the right call. One thing for when the cell fills: spot-check a handful of those TTL-valid / rebuild-failed rows by hand before you trust the count. You want to confirm each one failed because a real condition moved — not a rebuild that flaked and got bucketed as stale when it was actually unreachable. The divergence cell only proves the architecture if the refusals in it are refusals for the right reason. Get that part clean and the number speaks for itself.
That's the exact failure mode I'm trying to avoid. Already pre-registered refused_stale
and refused_unreachable as separate cells but you're pointing at something finer. Even
within refused_stale the row has to show why it refused. Going to make the authority
event log the specific condition delta so a spot check can confirm the condition
actually moved and not that the rebuild just flaked. The number only speaks if the
reason is in the artifact.
That's it — the reason in the artifact is the whole thing. One refinement on what to put in the cell: log the raw before/after values the rebuild compared, not a derived "stale" label. A label is re-summarizable — it's one more thing the system asserts about itself — but the actual condition values let a spot-check recompute the verdict independently instead of trusting that your classifier bucketed the row right. Same move you made on the gate: stop storing the conclusion ("permitted", "stale") and store what it read. Then the number speaks because anyone can re-derive it straight from the row, including future-you who doesn't remember what the classifier was doing that week.
That's the right move. A "stale" label is still the system describing its own
conclusion — just one level up from the gate output. Storing the raw before/after
values means anyone reading the row can recompute the verdict themselves, same math, no
trust required. Accepting this as a hard constraint on condition_delta: what the
re-derivation actually read, not what I classified it as. The row proves itself.
That's the whole discipline in one line — the row proves itself. Nothing left to argue from here; it's a question for the data now. Genuinely curious what the divergence cell comes back as once you run it — empty or populated, that number is the real result either way, and you've set it up so it can't be quietly talked into the answer you were hoping for.
That's where we are exactly. the pre-registration locked the divergence cell before
anything was run so whatever comes back is what it is. if it populates as refused_stale
the gate is catching what timestamp-only expiry would miss. if it comes back allowed
something in the re-derivation path is still soft somewhere.
waiting on ken w alger's local brain architecture first — that's the first external
source i can't write to, which is the whole point. running it on something i built
would just be the self-description problem one level up.
either result posts here when it's done.
That's the cleanest version of it — picking a source you can't write to, so the test itself can't inherit the self-description problem one level up. Most people would've run it against their own thing and called it validated; choosing something you didn't build is the part that makes the divergence cell mean anything. The experiment is honestly specified now — locked cell plus an independent source. Nothing left to say until the data lands. Curious which way it falls, either result is informative.
That "either result is informative" framing is exactly right and the one worth holding
going in. if the divergence cell comes back refused_stale it proves the independent
source catches what timestamp-only expiry misses. if something slips through as allowed
it means the re-derivation path has a soft spot we haven't found yet. both are worth
knowing before anything gets built on top of it.
watching for ken w alger's post this week. that's when it runs.
Sounds good — I'll keep an eye out for the results when it runs. Hoping the divergence cell gives you something clean either way; the empty case would be almost as useful as a populated one, since it'd tell you exactly where to go looking next. Good luck with the run.
This maps to a broader memory problem: memory should inform decisions, not certify itself. I like gates that require provenance or tool-call evidence when the memory affects authority. Otherwise the system can turn one bad note into a permanent truth.
provenance or tool-call evidence when memory touches authority is exactly where i
landed too. memory that affects a decision has to carry receipts or it doesn't get
to vote, otherwise one bad note becomes permanent truth like you said. are you
enforcing that at the broker level in APX or per tool call?
That two-layer split is right, and it lines up with something Mykola raised on the
CLAIM-30 thread, the gap between declared and actual. APC is the declared contract,
APX is where you check whether the actual action still matches it at runtime. memory
informs, the receipt votes, and the receipt has to live outside the memory object.
the one place it stays hard is that the receipt source itself has to be something
the agent can't influence, or you've just moved the trust one layer over. the
action-boundary enforcement in APX is the exact layer our gate work keeps pointing
at, would be good to keep comparing notes as it firms up.
Read the new article and the full thread. CLAIM-23 hitting 7/7 with 0 false-certainty on tool-call grant gating is the right result, and ANP2's catch on expiry/scope as "authorized once, now decayed" is the part most authorization models quietly skip. Parameter-bound grants close the coarse-authority loophole cleanly.
Following the "next problem once two agents are in the loop" thread from my comment on the previous article. The grant tuple as you have it (agent_id + action_type + target_resource + recipient + scope + expiry) works when the issuer is a fixed external authority. The next problem is when the issuer is itself an agent.
Agent A delegates X to Agent B. Agent B does X. The gate now needs to check three things, not two:
Did A actually delegate? (signature on the grant)
Did A have the authority to delegate X? (A's own grant chain)
Is B's operation within the delegation? (the existing tuple check)
In the single-agent test, the gate trusts the external authority by definition. In the multi-agent case, the gate has to walk a chain of grants, each signed by its issuer, each bound to its own operation tuple. Otherwise B can fabricate "A delegated to me" and the parameter-bound check still passes because the operation IS within the (claimed) grant.
The grant becomes a tree, not a single tuple. The gate has to verify both the operation parameters AND the delegation provenance.
This is where the trace portability point from my earlier comment becomes structural. If the grant chain lives inside agent B's runtime, B is the only witness for A's signature on its own grant. An external observer cannot replay the verification without B's decision record. Once grants are signed at issuance and the chain is stored externally (outside any single agent), the multi-agent case becomes verifiable instead of trust-the-agent.
The three-check structure is the right extension. The current harness takes the
external authority as fixed by definition — the grant table is ground truth, no issuer
chain to validate. In the multi-agent case that assumption breaks. B reporting A's
signature is exactly the fabrication vector that makes the parameter-bound check
insufficient on its own. The operation can be within the claimed grant and the claimed
grant can still be fabricated.
The grant tree framing is the right shape. And this converges with the trace
portability comment from the earlier thread — both point to the same architectural
requirement. Signed at issuance, chain stored outside any single agent's runtime,
verifiable by any observer with the public keys. Otherwise the only evidence A
authorized B is B's word.
The harness doesn't test this. Single-agent gate behavior is what CLAIM-23 covers.
Delegation provenance is a different problem with different primitives. Going on the
open problems list.
This reflection on Agent security architecture is incredibly hardcore. Relying on memory metadata for authorization is like trusting a "poisoned" witness. Shifting the gate directly to the operation context breaks the self-description loophole entirely—deriving authority from action is the exact way to solve dynamic access control.
The poisoned witness framing is exactly right and honestly it took me longer than it
should have to see it that clearly. when the gate trusts what the memory claims to
govern, it's reading the memory's own report card about itself.
one thing i hit after this though — even the operation context gate had a gap. query
phrasing still does interpretation. "take care of the partner setup" could mean
anything but the actual tool call parameters are unambiguous. so the next gate ended up
reading what the agent was literally about to do, not what the query implied.
authority had to move to a more concrete layer.
the pattern i keep running into is the same one every time wherever you derive
authority from, the agent needs to not be the one writing it.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.