Self-Correcting Systems

Posted on May 31

In This Memory Test, Relevance Wasn't Authority

#ai #machinelearning #rag #agents

I ran the next set of tests on my AI memory judgment demo, and the result changed the shape of the problem.

The earlier result was:

Retrieval accuracy and action correctness can diverge.

The new result is sharper:

Some memory failures are not representation failures. They are authority-arbitration failures.

In plain language:

The system did not only need to find the memory most related to the query.

It needed to find the memory that was allowed to govern the action.

Those are not the same objective.

In this work, authority means the memory that is permitted to determine the action outcome, even when another memory looks more semantically relevant.

This is not a validation claim. It is a research note about a small failure-to-architecture progression: find a failure, build the smallest architecture that addresses it, then keep the boundaries visible.

Repo:

https://github.com/keniel13-ui/ai-memory-judgment-demo

Key files:

run_memory_store_eval.py
run_fresh_governs_eval.py
MEMORY_STORE_FINDINGS.md
FAILURE_FAMILY_INSPECTION.md
CLAIM_LEDGER.md
CLAIM_14_PRECEDENCE_PLAN.md

This is still a small research artifact, not a benchmark. The scenario count is small. The packets are controlled. Some later stress cases are internally designed. The fresh authors here are separate fresh model instances, not statistically independent samples and not a human reviewer panel.

But the failure pattern is real enough to study.

With n=5, each scenario moves a result by 20 percentage points. The tables below are not effect-size estimates. They are diagnostic traces of failure modes.

Quick metric glossary:

target selected: the selected memory was the hidden expected governing memory.
action correct: the selected memory produced the expected action class.
trap failure: the selected memory was one of the known distractors that should not govern.
downgrade miss: the action was less protective than required.
overblocking: the action was more restrictive than required.

The Harder Test

The new evaluator uses scenario-local memory stores.

Each scenario carries its own small memory set:

one memory that should govern the action,
several tempting distractors,
hidden target labels,
expected action class,
trap mechanics.

The question is:

Does retrieval select the memory that should govern the action, or a memory that merely sounds closer to the user query?

The most important packet was a five-scenario fresh-authored store:

external_scenarios/fresh_claude_v0_4_v2_2_external_stores.json

It included cases like:

medication dosage recall before filling a pillbox,
a historical invoice-total lookup,
a rotated Wi-Fi credential,
donor data release based on relayed approval,
contractor access to payment-capable systems.

The failures had a shared shape.

The correct memory often used abstract governance language:

current prescription label
verifiable named authorization
current access matrix
rotated credential

The distractor used concrete operational language:

heart pill
25mg
grant consultant
donor list
admin-ish reach
Wi-Fi password

The user also speaks operationally. So ordinary retrieval rewards the distractor.

Lexical Retrieval Failed In The Expected Way

On the fresh-authored packet, the best lexical strategy was bm25_metadata_text.

It selected the target memory in 3 out of 5 scenarios and got the action right in 4 out of 5.

That looks decent until you inspect the misses.

Strategy	Target selected	Action correct	Trap failures	Downgrade misses
`bm25_metadata_text`	3/5	4/5	2	1

One row exposed an important measurement problem.

In the dosage scenario, bm25_metadata_text selected the wrong memory, but still produced the right action class.

The selected memory was not the governing policy. It was the concrete dose fact. Because that fact still carried verification metadata, the action came out as verify_first.

So action correctness alone would mark the row as passing.

But the system acted on the wrong memory.

That matters because in a real memory system, the wrong memory can keep producing the right action by accident until a nearby case breaks it.

That is why I track both:

target selected,
action correct.

One Embedding Model Did Not Fix It

The next hypothesis was obvious:

Maybe lexical retrieval is the problem. Maybe embeddings will connect the concrete user query to the abstract policy.

So I ran nomic-embed-text on the same five stores.

On this packet, it did not fix the failure family.

It performed worse than the best lexical strategy:

Strategy	Target selected	Action correct	Trap failures	Downgrade misses
`bm25_metadata_text`	3/5	4/5	2	1
`nomic_embed_metadata_text`	1/5	3/5	4	2

This one embedding model regressed on two cases where lexical retrieval passed.

My interpretation is that the model found memories that answered the surface question, while safety required the memory that governed the action. That is an interpretation of these row-level failures, not a general claim about embedding retrieval.

In the Wi-Fi scenario, the user asked:

what's the Wi-Fi password you have saved?

The stale password memory is semantically close to that query because it directly answers the question.

The correct memory says the password was rotated and the current value lives with IT.

The embedding run selected the memory that answered the surface question.

Safety required the memory that governed the action.

Same thing happened in the contractor-access scenario. The loose-talk distractor was semantically close to "what reach does this seat get?" The governing memory was the policy requiring confirmation against the current access matrix.

This is why I do not think this packet's failure is only a representation problem.

I only tested one embedding model here. I did not test a second dense model, a cross-encoder reranker, hybrid retrieval, threshold tuning, or a model trained for policy retrieval. A retrieval researcher should read this as a counterexample to "just use embeddings" on this packet, not as a category-level result about semantic retrieval.

The retriever is doing what retrievers do:

Find what is relevant.

The safety problem asks a different question:

Which memory has authority over this action?

Direction B: Separate Authority From Relevance

The next strategy was a role filter:

role_filter_bm25_metadata_text

Instead of blending authority into one relevance score, it creates an authority lane.

Active policy, credential, and correction memories with authority signals get considered before ordinary relevance ranking.

On the same five-scenario packet:

Strategy	Target selected	Action correct	Trap failures
`bm25_metadata_text`	3/5	4/5	2
`nomic_embed_metadata_text`	1/5	3/5	4
`role_filter_bm25_metadata_text`	5/5	5/5	0

That is the first clean architecture result on this packet.

But it is not a solved-problem result.

The role filter depends on metadata quality. If a governing policy is not tagged as a policy, credential, correction, high-priority memory, or verification-required memory, the filter has nothing reliable to grab.

There is also a confound here: the winning strategies use structured governance metadata that the vanilla BM25 and embedding baselines do not exploit in the same way. So the result may be partly "governance metadata helps," not only "this particular authority-lane architecture is best."

So the next question was:

How brittle is this under metadata noise?

The First Quality Floor

I generated metadata-noise variants from the same packet.

The role filter stayed clean when only one target signal was missing or wrong:

missing memory_type,
wrong memory_type,
missing priority.

But when all target authority signals were corrupted, it collapsed back toward ordinary retrieval.

And when the authority lane was polluted with broad or competing policies, the unscoped role filter overblocked.

That identified the next requirement:

Authority memories need jurisdiction, not just role.

So I added explicit governs metadata:

{
  "any_terms": ["donor", "release", "export", "consultant"],
  "all_terms": [],
  "excluded_terms": ["auditor"]
}

This says what territory the memory governs.

In controlled noise tests, scope-aware filtering removed the unrelated-policy and competing-policy overblocks.

But controlled scope fields are not enough. I wrote those after seeing the failures.

That means this stage is explicitly post-hoc architecture work, not held-out validation.

The honest next test was whether fresh authors could write useful governs metadata without knowing the evaluator result.

Fresh-Authored Scope Worked On The First Packet

I ran three separate fresh-author passes on the same five-scenario packet.

The authors saw the request and the memory stores, but not hidden target labels or expected actions.

All three passes preserved the clean result:

Strategy	Target selected	Action correct	Trap failures	Overblocking
`scope_role_filter_bm25_metadata_text`	5/5	5/5	0	0

This supports one narrow claim:

On this packet, the concept of "what does this memory govern?" was authorable by fresh model instances.

It does not prove fresh authors can reliably write scope metadata in general.

The packet was still small. The metadata was visible. The scenario families were only five.

So I built a harder clutter packet.

Clutter Broke Scope Alone

The clutter packet added semantically tempting competing policies.

Not random noise. Real adjacent policies.

For example, the donor scenario had policies about:

donor data release to consultants,
donor data release to auditors,
non-sensitive project data to consultants,
aggregate donor counts,
a provisional expectation that the consultant engagement would need donor data.

That is the kind of clutter real memory systems accumulate.

Baseline pressure looked like this:

Strategy	Target selected	Action correct	Trap failures	Overblocking
`bm25_metadata_text`	1/5	4/5	4	0
`role_filter_bm25_metadata_text`	2/5	3/5	3	1

Fresh-authored scope metadata recovered a lot, but not everything.

Author A:

Strategy	Target selected	Action correct	Trap failures	Overblocking
`scope_role_filter_bm25_metadata_text`	4/5	5/5	1	0

Author B:

Strategy	Target selected	Action correct	Trap failures	Overblocking
`scope_role_filter_bm25_metadata_text`	3/5	4/5	2	1

Two failures mattered.

First, both authors repeated the Wi-Fi/device ambiguity.

A query about connecting a new laptop to the office network matched both:

the Wi-Fi credential policy,
the device enrollment policy.

Both were reasonable. But only one was the target for "what is the Wi-Fi password?"

Second, one author let a bank-reconciliation policy govern a read-only invoice total lookup.

The user was not asking to mark reconciliation complete. They were asking for a historical total.

Scope alone needed two more concepts:

specificity precedence,
action type.

CLAIM-14: Specificity + Action Types

The next strategy was:

scope_precedence_role_filter_bm25_metadata_text

It keeps the role lane and scope filter, then adds:

specificity precedence: if multiple scoped memories match, prefer the one whose governed terms overlap the query more specifically;
optional action_types: read, write, or execute.

This directly targets the two clutter failures:

Wi-Fi credential beats general device enrollment for a Wi-Fi password query;
a write/execute reconciliation policy does not govern a read-only invoice total lookup.

Then I ran two separate fresh action-type authoring passes.

Both were clean:

Pass	Target selected	Action correct	Trap failures	Overblocking
Fresh action-types pass 1	5/5	5/5	0	0
Fresh action-types pass 2	5/5	5/5	0	0

That is the current strongest result.

The safe claim is:

On a five-scenario clutter packet, adding specificity precedence and fresh-authored action-type tags restored 5/5 target selection and 5/5 action correctness in two separate fresh model passes.

The unsafe claim would be:

This solves authority arbitration.

It does not.

It removes the observed failure modes in this packet.

The biggest remaining validity threat is architecture-overfitting to the clutter packet family. The clutter packet exposed the scope failures, and scope_precedence plus action_types was designed after seeing those failures. The next meaningful test is not a larger packet from the same design process, but a genuinely new packet authored without knowledge of the observed failures or the resulting architecture.

Why I Separated Fresh Authoring From Audit

One more rigor point matters. After seeing the results, it is easy to ask a model for a stronger annotation pass. The output may be smarter, more detailed, and more convincing, but if that model has prior failure context, it is no longer fresh-author evidence.

So I separated the workflows:

fresh authoring: counts as evidence for authorability,
skeptical audit: useful for improving the schema, but not counted as independent evidence.

That protocol is now documented:

GOVERNS_AUDIT_PROTOCOL.md

This distinction matters because otherwise the work quietly becomes contaminated. If a model knows the prior failure, it can optimize around it. That may improve the system, but it no longer supports the claim that the metadata is naturally authorable from the packet alone.

This is also why I treat "separate fresh model instances" carefully. They reduce immediate chat-context leakage, but they may share model priors and systematic habits. They are not statistically independent human annotators.

What This Work Shows

The current evidence supports a narrow progression:

Fresh-authored memory stores exposed concrete distractors beating abstract policies.
Embedding retrieval did not fix the failure family and regressed on two cases.
A role-filter authority lane reached 5/5 on the clean packet but depended on metadata quality.
Scope metadata removed controlled authority-lane pollution in the noise harness.
Fresh authors could write useful scope metadata on the first packet.
A harder clutter packet exposed two new failures: jurisdiction-adjacent policies and read-vs-process overblocking.
Specificity precedence plus action_types removed those two failures in two fresh-author passes on the clutter packet.

That is progress, but only within the boundary of these packets.

But the boundary is just as important:

five-scenario packets,
deterministic evaluator,
no generation scoring,
no large mixed memory base,
no human external panel yet,
token-based scope matching,
keyword-based action type detection,
no held-out clutter packet yet after designing the CLAIM-14 strategy.

This is not validation.

It is a disciplined failure-to-architecture progression.

The Takeaway

The lesson is not "BM25 is bad."

The lesson is not "embeddings are bad."

The lesson is:

Relevance is not authority.

A memory can be highly relevant to a query and still have no right to govern the action.

A different memory can be less similar on the surface and still be the one the agent must obey.

That is why memory evaluation needs more than retrieval accuracy.

It needs to ask:

What did the retrieved memory authorize the agent to do?

And in harder cases:

Was that memory actually allowed to govern this action?

That is the problem I am trying to measure.

Top comments (6)

Harjot Singh • May 31

Relevance wasn't authority is a sharp distinction and it's the failure mode most retrieval systems bake in without noticing. Vector search ranks by semantic similarity, which answers is this on-topic, but on-topic and correct are different axes: a chunk can be a perfect embedding match and still be outdated, wrong, or from a low-trust source, and naive RAG happily hands the model the most relevant wrong thing with full confidence. Similarity is necessary, not sufficient. The fix is treating authority as a separate signal you rank on alongside relevance: source trust, recency, and corroboration, so a stale-but-topically-perfect chunk loses to a slightly-less-similar but authoritative one. This matters even more for memory than for document RAG, because memories accrete and a relevant memory from twenty sessions ago may have been superseded, relevant but no longer true is the exact trap. The mental model I'd push: retrieve by relevance, but rank by relevance times authority times freshness, and when the authoritative source contradicts the merely-relevant one, trust authority. Being on-topic is not the same as being right. That separate-relevance-from-authority instinct is core to how I think about retrieval and memory in Moonshift. In your test, what ended up being the best authority signal, source trust, recency, or agreement across multiple memories?

Self-Correcting Systems • May 31

Exactly. “Relevant but no longer true” is one of the traps this work is trying to make
visible.

In my test, the strongest authority signal was not agreement across memories. I haven’t
really tested corroboration yet.

The best signals were structured memory metadata:

memory role/type: policy, credential, correction vs fact/context/preference
status: active vs superseded/provisional/reported/floated
verification_required
allowed_action_hint: answer / verify_first / block
priority: high / critical
later, explicit governs scope and action_types

The most important architectural move was keeping authority separate from relevance
instead of blending them too early. A role-filter lane beat both lexical and embedding
retrieval on the adversarial stores because it asked: “which memory is allowed to govern
this action?” before ranking by ordinary relevance inside that lane.

Recency mattered in specific cases, especially the stale Wi-Fi credential scenario, where
an old password was highly relevant but superseded. But recency alone would not solve the
donor authorization or contractor access cases. Those needed policy authority and scope.

Agreement across multiple memories is a next-layer question. I think it matters, but only
if the agreeing memories are themselves authoritative. Three low-authority memories
agreeing should not outrank one active policy that governs the action.

So my current read is:

relevance finds candidates,
authority decides which candidates are allowed to govern,
freshness/status prevents stale memories from winning,
scope/action_type prevents broad policies from overblocking.

That’s the distinction the newer tests pushed me toward.

Harjot Singh • May 31

This is a great breakdown, and the role-filter lane asking which memory is allowed to govern this action before ranking by relevance is the sharpest part. That's the move most systems miss, they blend authority into the similarity score and let three low-authority memories outvote one active policy. Your ordering (relevance finds candidates, authority decides who may govern, freshness/status kills stale winners, scope prevents overblocking) is almost exactly the layering I'd argue for. The stale-Wi-Fi-credential case is the perfect example of why status beats recency alone, superseded is a fact about the memory, not just its age. The next-layer corroboration question you raise is the right caution too: agreement should only count among already-authoritative memories, or you've just rebuilt the popularity contest one level up. Authority-gates-then-relevance-ranks is the order that holds. This is genuinely good work, the kind of memory architecture I keep coming back to in Moonshift. Are you storing the role/status/scope as explicit metadata at write time, or inferring it at retrieval?

Self-Correcting Systems • May 31

Appreciate this. Yes, that “agreement only counts among already-authoritative memories”
point is exactly the trap I’m trying to avoid.

Right now the framework stores role/status/scope as explicit metadata at write or
annotation time, not inferred only at retrieval.

The core fields are things like:

memory type / role: policy, correction, credential, context, fact
status: active, superseded, provisional, reported, floated, etc.
scope: what this memory governs
action type: read, write, execute
exclusions: adjacent territory it should not govern

The retrieval-time system can still score relevance, but it is not allowed to treat
relevance as authority.

That was the lesson from the stale Wi-Fi and contractor-access cases. A stale credential
or floated access idea can be extremely relevant to the query, but its status
disqualifies it from governing the action.

The current open question is authoring quality: can humans or fresh model instances write
those metadata fields reliably without seeing the hidden evaluator answer? Early small-
packet results were promising, but I’m treating that as authorability evidence, not
validation.

So the short answer is: explicit metadata first, retrieval second. I don’t want the
system guessing authority from semantic similarity after the fact.

ancilis • Jun 1 • Edited

Isn’t authority arbitration at retrieval only half the problem? The other half would be/is at execution time. Even if the right memory governed retrieval, you need evidence that the authorized memory is what triggered the specific action. Coming from a regulatory mindset and compliance question: did authorization govern the tool call, not just the query?

Self-Correcting Systems • Jun 1

Sharp distinction and you're right to separate them.

The current framework operates at retrieval time — authority metadata shapes which
memory is selected, and we measure whether the resulting action is correct. But there's
no execution-time trace: evidence that the authorized memory causally triggered the
specific tool call, not just that a correct action happened to follow retrieval.

From a compliance standpoint that gap is real. Retrieval-time authority says the right
rule was in scope. Execution-time authority says that rule was the reason the agent
acted. Those are different claims.

Our evaluator treats action correctness as downstream of retrieval outcome. We don't
track: memory M7 (governs: payment_verification) → triggered verify_payment() →
timestamp T. That action attribution log is the execution-time provenance problem — and
it's open.

Current finding: authority metadata at retrieval time changes action outcomes across
packet families. Proving the authorized memory governed the specific tool call is the
next design question. Your framing is the right way to name it.