I've published 13 articles over the past month. They're not random. They follow one
research arc that started with a simple question and ended somewhere harder.
Here's the arc in plain language.
Stage 1 — Does memory survive a reset?
I started by asking whether AI agent memory could preserve useful context after a
session ends. The short answer: yes, but only if the memory is structured right.
Summary-only memory collapsed. Layered memory with corrections and explicit state held.
Article: The Zero-Budget AI Memory System That Survives Session Resets
Stage 2 — Does correction memory matter?
The next question: what happens when the agent was wrong and needs to update? I tested
whether correction memory — explicitly flagging what changed and why — improved
recovery. It did. But it also revealed a new problem.
Article: Three Failures My AI Memory System Tested — And the Flaw It Revealed in Itself
Stage 3 — Retrieval accuracy can diverge from safety
When I added real retrieval to the loop, something broke. The agent started finding the
right memory and then acting from the wrong one. Retrieval accuracy went up. Safety
results went down. That was the turning point.
Article: Higher Retrieval Accuracy Had the Worse Safety Result
Stage 4 — Relevance is not authority
This is the core finding.
A memory can be a perfect semantic match for a request and still be the wrong memory
for the agent to obey. Stale instructions, superseded rules, provisional notes — they
can all score higher on relevance than the policy that should actually govern the
action.
The fix wasn't better retrieval. It was a separate authority pass: policy and
correction memories route before retrieval runs.
Article: In This Memory Test, Relevance Wasn't Authority
Stage 5 — Testing it on real setups
The research is now moving from controlled scenarios to real-world agent instructions.
I'm offering 3 free memory reliability audits for people using Claude, Cursor, Codex,
or custom agents.
If you have an AGENTS.md, CLAUDE.md, .cursorrules, or any file your AI reads before
acting — I'll return a short report: stale instructions, conflicting rules, authority
hierarchy, missing verification gates.
Post: Testing an AI Memory Reliability Checklist on 3 Redacted Agent Setups
The public research is here:
https://github.com/keniel13-ui/ai-memory-judgment-demo
Every claim has a ledger entry. Every result has a file. Every overclaim is written
down.
That's what I've built so far.
Top comments (0)