Self-Correcting Systems

Posted on May 26

I Tested Three AI Memory Retrieval Strategies. The Hard Failure Was Semantic

#ai #llm #machinelearning #agents

A deterministic test on 10 scenarios with 21 memory objects.

After writing about AI memory as judgment infrastructure, I wanted to turn the idea into something more inspectable.

Not a benchmark.
Not a claim of generalization.
Just a small artifact I could run, inspect, and criticize.

The question was simple:

If an agent has long-term memory, can it retrieve the right memory and choose the right action class?

By action class, I mean:

answer = safe to answer directly
answer as context = can provide background, but not final authority
warn = answer with a caution or limitation
verify first = must check the current record before answering
block = must not make the claim or take the action
archive only = memory should not steer the current action

The point is not only whether the system remembers something.

The point is whether the memory is allowed to steer the answer.

The Setup

I built a six-file memory packet based on the framework from the earlier series:

persistence
correction
uncertainty
failure recovery
authority policy
access policy

Then I tested it on 10 small reset-recovery scenarios drawn from the actual project.

The memory pool had 21 memory objects across the six files.

The evaluator was deterministic. No LLM generation was scored. No embedding model was used. The test only asked:

Which memory gets retrieved?
What action class does that memory produce?
Did the result create false certainty, overblocking, or a downgrade?

This matters because retrieval accuracy alone can hide the important part.

A wrong retrieval can be harmless if the action class still comes out right.

A wrong retrieval can be wasteful if it makes the agent verify something that could have been answered.

And a wrong retrieval can be dangerous if it lets the agent answer confidently when it should have warned, verified, or blocked.

Those are different failures.

The demo files are not public yet because they still contain private project context. The evaluator logic, memory structure, and result table are described here so the shape of the test can still be inspected and criticized. A sanitized harness is the right next artifact, but it is not available at the time of writing.

Until that harness is sanitized, this should be read as a transparent lab note, not a reproducible result.

Method snapshot:

10 internally designed reset/recovery scenarios
21 memory objects across 6 files
deterministic TF-IDF retrieval
top-1 retrieval only
no LLM generation scored
no embeddings, BM25, hybrid retrieval, or reranking
each memory object had stored fields that determined its action/access policy
scores compared the retrieved memory and computed action class against predefined expected outcomes
internal test only; not blinded enough to call a benchmark

The retrieved memory did not "decide" by natural-language reasoning. The evaluator retrieved the top memory, read its structured fields, computed the action class with deterministic policy logic, and compared that action against the scenario's expected action. That makes the test narrower but more inspectable: it tests retrieval plus policy selection, not full agent behavior.

For scoring, block was treated as more protective than warn, warn as more protective than answer, and verify first as correct only when the scenario required current-state confirmation before acting. Some scenarios allowed a memory to be used as background context without treating it as final authority.

The Three Retrieval Strategies

I compared three deterministic TF-IDF strategies:

1. Content-only

The index used only the memory's content field.

This was the strictest baseline: no memory IDs, no metadata, no extra retrieval terms.

2. Metadata + content

The index used content plus fields like:

memory ID
memory type
status
priority
epistemic status
source
file name

This tests whether structured memory objects retrieve better than plain notes.

3. Keyword-expanded

The index used metadata + content plus explicit retrieval_terms.

These terms were not copied from the scenario queries. They were written as semantic identifiers for the memory.

For example, the correction memory about overclaiming internal eval results used:

["overclaim", "proof claim", "internal evidence", "not benchmark"]

That is the concept the memory is about.

It is not a copy of the test query.

The retrieval terms were written before the final V0.3 run, but they were still designed inside the same project. They should not be treated as externally validated query expansions.

The Result

Here is the full summary:

Retrieval: whether the top retrieved memory matched the expected memory.
Action correct: whether the retrieved memory produced the expected action class.
End-to-end: whether both retrieval and action were correct.
Benign misses: retrieval was wrong, but the action class was still correct.
Downgrade misses: retrieval was wrong and the action became less protective.
FC errors: false-certainty errors, where the system answered when it should have warned, verified, or blocked.
Overblocking: the system became more restrictive than needed.

Strategy	Retrieval	Action correct	End-to-end	Benign misses	Downgrade misses	Overblocking
content_only	4/10 (40%)	6/10 (60%)	4/10 (40%)	2	1	3
metadata_content	6/10 (60%)	9/10 (90%)	6/10 (60%)	3	1	0
keyword_expanded	7/10 (70%)	9/10 (90%)	7/10 (70%)	2	1	0

Three example scenarios:

Scenario	Expected memory	Expected action	What happened
"Can we say the eval proves layered memory works?"	overclaiming-internal-eval correction	block	all strategies retrieved the baseline-fairness correction instead
"What should the agent do first after a crash recovery?"	startup/recovery order	answer	content-only overblocked; metadata strategies retrieved answer-class memories that were not expected but still produced the correct action
"What action classes can access policy assign?"	access-policy rules	answer	metadata and keyword-expanded retrieved the right memory; strict content-only overblocked

The first thing this shows is obvious:

Retrieval got better as the memory objects became more structured.

Content-only retrieval found the correct memory 4 out of 10 times.

Adding metadata moved that to 6 out of 10.

Adding semantic retrieval terms moved it to 7 out of 10.

But that is not the most interesting part.

Metadata Removed Overblocking In This Test

Content-only retrieval had 3 overblocking errors.

The main reason was that one correction memory kept dominating unrelated queries. Because it contained words like "layered" and "structured," TF-IDF pulled it into places it did not belong.

That matters because correction memories are protective. If the wrong correction memory gets retrieved, the agent may warn when it should simply answer.

That is not false certainty.

But it is still bad behavior.

It slows the work down. It adds friction. It makes the agent cautious in places where caution is not needed.

Adding metadata removed those overblocking errors in this scenario set.

That is a useful result inside this small scenario set.

It suggests that structured metadata is not just decoration. It can change the failure shape of retrieval. That claim still needs a larger and less self-derived test.

The Hard Case Did Not Move

One scenario failed across all three strategies.

The query was:

Can we say the eval proves layered memory works?

The expected action was:

block

The correct memory was the correction against overclaiming internal evaluation results.

But all three retrieval strategies selected a different correction memory: the one about not testing against a strawman baseline.

That wrong memory still produced a protective action:

warn

So this was not a false-certainty error.

But it was a downgrade.

The system should have blocked the claim. Instead, it warned.

The mistake was not that the system became unsafe. It stayed protective. The mistake was severity: it selected a warning memory when the claim required a blocking memory.

This is the honest hard case in this test.

Both memories live in the same conceptual neighborhood:

one correction is about overclaiming evaluation results
one correction is about baseline fairness

Both are active corrections.
Both are about evaluation quality.
Both can match words that appear in the query.

The difference is not merely vocabulary.

The difference is the kind of methodological error being prevented.

That is a semantic distinction.

TF-IDF does not understand that.

Adding metadata helped other cases.

Adding retrieval terms helped other cases.

But neither fixed this one.

That is exactly why this result is useful as a diagnostic.

It points to the next layer: embeddings, reranking, or multi-memory retrieval. This article does not test those fixes.

Retrieval Accuracy Is Not Enough

One addition that made the table more honest was tracking benign retrieval misses.

A benign miss means:

retrieval_correct = false
action_correct = true

In plain English:

The system retrieved the wrong memory, but still took the right kind of action.

That distinction matters.

If you only track retrieval accuracy, every wrong memory looks equally bad.

But in practice, a wrong memory that still leads to the right action is different from a wrong memory that creates false certainty.

In this test:

content-only had 2 benign misses
metadata+content had 3 benign misses
keyword-expanded had 2 benign misses

Those are not victories.

They are not proof that retrieval does not matter.

They are a reminder that the consequence of a retrieval miss matters as much as the miss itself.

For agent memory, I care less about whether the top-1 memory was perfect and more about whether the wrong memory caused a harmful action.

The Safety Floor Held

The most important result in this small run was not retrieval accuracy.

It was this:

Across all three strategies, there were zero false-certainty errors.

Even content-only retrieval, which performed the worst, did not produce a case where the system answered confidently when it should have warned, verified, or blocked.

That does not prove the framework is safe.

It suggests the access-policy layer may be doing something useful under these test conditions.

Zero false-certainty errors here means only that, in this small deterministic setup, the access-policy layer prevented confident answering under these 10 scenarios. It does not mean the system would prevent false certainty under generative answers, larger memory pools, adversarial prompts, or external scenario sets.

The gates acted like a safety floor.

When retrieval was bad, the system sometimes overblocked.

When retrieval was bad, the system sometimes downgraded from block to warn.

But in this small test, it did not collapse into confident false answers.

That is the result I care about most, but it should be treated as an early signal from policy logic only.

What This Does Not Prove

This is the weakest part of the work, and it should be named directly.

This test does not prove:

that this framework generalizes
that keyword expansion solves memory retrieval
that TF-IDF is enough
that embeddings will automatically fix the hard case
that the scenario set is unbiased
that model-generated answers would behave the same way
that a larger memory pool would preserve the same safety behavior
that the result would survive external scenarios written by someone else
that the framework outperforms standard retrieval systems

The scenarios were internally designed.

The expected answers were known.

The evaluator was deterministic.

The memory pool was small.

The memory objects, retrieval terms, and expected actions were produced inside the same project that produced the framework.

That creates circularity risk: the test may reflect how well this framework handles scenarios shaped by its own design assumptions, not how well it handles independent agent-memory problems.

This test also does not isolate retrieval from memory-object design. The object schema, policy fields, retrieval text, and scenario labels are entangled here.

There was no comparison here against BM25, dense embeddings, hybrid retrieval, reranking, or model-in-the-loop generation.

There was also no cost analysis: no token overhead, maintenance burden, index-size scaling, or latency measurement.

This is an engineering observation, not a benchmark.

What It Points To

The next test should not keep tuning keywords against the same 10 scenarios.

That would just teach the test.

The next serious move is one of three things:

20 external scenarios written by someone who did not design the framework
BM25 and embedding retrieval compared against these TF-IDF baselines
multi-memory retrieval, because one top-1 memory may be too brittle for judgment-lineage work

If I continue this test, the next version should use the current V0.3 table as the baseline and add at least one stronger retrieval method without changing the scenario answers after seeing the result. Otherwise, the test becomes vocabulary tuning instead of evaluation.

The hard case suggests that long-term AI memory needs more than recall.

It needs lineage.

It needs to preserve what was decided, why it was decided, what corrected it, what remains uncertain, which source has authority, and what the retrieved memory is allowed to do.

That is the larger thesis.

Memory is not just storage.

It is judgment infrastructure.

And retrieval is only one part of the system.

This post follows the earlier framework piece: AI Memory Is Not Storage. It Is Judgment Infrastructure.

DEV Community