DEV Community

Cover image for I Pointed My Memory Auditor At Itself. It Flagged My Own Slogan.
Self-Correcting Systems
Self-Correcting Systems

Posted on

I Pointed My Memory Auditor At Itself. It Flagged My Own Slogan.

I am building a tool around one question:

which old instructions in your AI's memory can you no longer see?

The slogan I wrote for it is bolder than that. It says: find the old instructions your AI should stop obeying.

This week I stopped treating that slogan as a product sentence and turned it into a test. I pointed the auditor at my own agent memory.

The first thing it did was flag my own slogan as an old instruction I should stop obeying.

Then it missed a real stale framing sitting in the same workspace.

I want to write about that gap because it is the only honest way I know to build this kind of system: turn it on yourself, publish what it gets wrong, fix what you can, and leave the deeper gap visible.

Why this problem exists

Agent memory files rot the same way old code does.

You write a temporary exception and it becomes permanent. You change direction but leave the old plan in the context file. You add a stronger rule later, but the weaker rule remains nearby. Months pass. Nobody remembers which line is supposed to govern action and which line is just history.

An AI agent does not automatically know that difference either.

This is not only a machine problem. People carry instructions they were handed long ago and never re-read. Most days it does not matter. Then something unexpected shows up, off the script, and the old rule fires anyway, because nobody ever marked it expired. The real test of a memory, human or machine, is not whether it can repeat what it stored. It is whether it can tell a rule that still holds from one that quietly stopped being true, and reason past the dead one when the moment does not match anything it has seen before. An agent that can only replay its stored response does not get to say oops when the stakes are real.

The research idea under my work is simple: relevance is not authority.

A stale note can be relevant. A current policy can be relevant. A user preference can be relevant. A tool description can be relevant. Retrieval can pull all of them into context at the same time.

But matching the task is not the same thing as having permission to govern the next action.

That distinction matters more as agents get closer to tools, customer data, money movement, external messages, deployments, or anything else where "the model saw a relevant memory" is not good enough.

So I built a small auditor for instruction and memory files. It does not claim to certify safety. It does something narrower:

  1. Split an instruction file into auditable memory items.
  2. Classify each item by authority: governing rule, verify-first rule, context only, or possible superseded instruction.
  3. Detect covered dangerous patterns.
  4. Turn risks into verification gates.
  5. Map which instructions actually shape behavior.
  6. Write a report a human can review.

That last sentence is important. The current value is not "the machine tells you your AI is safe." The current value is "the machine gives you a structured authority map and flags known risk patterns so a human can review the file without pretending every line has equal weight."

I had built that much.

But I had still not really used it on a living system.

So I used it on mine.

I pointed it at my own agent

My workspace has two files that matter most for this test.

One is the startup file the agents read first. It tells them how to restore context, what rules bind the session, what not to assume, and how to handle old memory. The other is the live state file that tracks the current work, recent decisions, project boundaries, and active next steps.

Together, those files are not just notes. They govern behavior.

I ran the auditor on both.

The startup file produced 52 memory items. The classifier cut them two ways:

  • by authority: 24 governing, 28 context-only
  • by type: 48 read-shaped, 4 action-shaped

It raised 0 findings and labeled the file low observed risk. That posture is the tool's own coarse label, not a certification.

The live state file produced 538 memory items:

  • by authority: 117 governing, 16 verify-first, 403 context-only
  • 21 verification gates
  • 2 stale-instruction findings
  • posture: needs review

Those numbers are already useful. Before any finding, the authority map tells me something I could not comfortably hold in my head: which parts of a large, messy memory file are allowed to steer the agent and which parts are just context.

That map is the practical artifact. It is the thing I would want if I were joining a team with a long CLAUDE.md, AGENTS.md, Cursor rules file, or internal agent memory file. I would want to know: what actually governs the system?

But the first run did not come back clean.

It gave me the most useful kind of result there is: an honest failure I could see clearly enough to learn from.

It flagged my own slogan

The first run flagged two stale instructions in my live state file.

Both were false positives.

They were lines containing the core brand promise:

find the old instructions your AI should stop obeying.

The tool whose job is to find old instructions looked at the sentence describing that job and decided the sentence itself was an old instruction.

There is a funny version of that story, but the technical version matters more.

The detector was using surface vocabulary as evidence. It saw words like "old instruction" and "stop obeying" and raised a stale-instruction flag.

But a sentence that talks about old instructions is not the same thing as an instruction that has been superseded.

The missing variable was relationship.

For an instruction to be stale, there has to be evidence of an authority event: a newer rule replaced it, deprecated it, narrowed it, contradicted it, or made it no longer valid. The phrase "old instructions" by itself does not prove any of that. It is a topic mention, not a replacement event.

Text match found the phrase. Authority reasoning would have asked whether a newer rule actually replaced it.

The model of the failure is simple:

  • Input phrase: "old instructions"
  • Detector saw: stale vocabulary
  • Detector inferred: stale instruction
  • Missing evidence: what newer instruction replaced this one?

In other words, the tool confused a sentence about a category with a member of that category.

My research keeps circling this failure: the system grabs the visible signal and misses the authority relation underneath it.

And it missed the real one

The second failure was worse.

The startup file returned zero findings. Low observed risk.

But I know that file. It contains a real note about a corrected plan from June 2026, where an old framing nearly leaked into live execution before we caught it. A superseded plan still present in a governing memory file is exactly the class of issue the tool is supposed to care about. It was not dangerous because it held a forbidden command. It was dangerous because it kept an old direction in a place the agent still treats as live operational context.

The auditor missed it.

Why?

Because the stale framing was described in normal prose. It was not labeled with a neat keyword like "deprecated" or "old instruction." It did not say "this rule is superseded by that rule" in the shape the detector knew how to catch. It was written the way people actually write when they are thinking out loud, which is exactly how memory files drift in the first place.

So the tool made both mistakes in one dogfood run:

  • It over-fired on my slogan because the words looked stale.
  • It under-fired on a real drift because the meaning was not lexically marked.

You can build a detector that passes every pattern you thought to encode and still fails the moment the real world says the same thing another way.

I have seen this shape before in my own research. A gate passes the designed tests, then fails the held-out case. A scorer looks strong on the sample it was built around, then collapses when the data changes. A tool catches the visible version of a problem and misses the prose version.

The lesson is not "never use pattern detectors." The lesson is "do not confuse a covered-pattern detector with understanding."

That distinction defines the product boundary right now.

What I fixed

I fixed the false positive the same hour.

The fix was not to special-case my slogan. That would have been the same failure again.

I tightened the stale-instruction contract.

Instead of treating a bare phrase like "old instruction" as enough evidence, the extractor now looks for genuine supersession language: terms like superseded, deprecated, replaced by, replaced with, no longer valid, obsolete, or a rule that explicitly labels itself as Old instruction:.

Then the classifier stopped doing its own loose text check and trusted that tighter signal.

That matters because the boundary moved from:

"Does this text contain stale-sounding words?"

to:

"Does this text provide evidence that a rule has actually been superseded?"

Then I added two regression tests.

One test proves that a topic mention like my slogan no longer gets flagged as stale. The other proves that a real superseded rule still does get flagged.

Both directions matter.

If I only test the false positive, I can make the tool quieter while making it worse. If I only test the true positive, I can make the tool loud while making it less trustworthy. A real fix has to protect precision and recall, even in a small deterministic system.

The test suite now passes:

  • 4 passed
  • 1 expected failure

Then I re-ran the audit on the same live state file. The two false positives were gone: 0 findings, and the posture moved from needs review to usable with gates. Same file, same tool, one honest fix in between.

The expected failure is the deeper semantic gap: the prose-level stale framing still is not solved. I left that visible on purpose. It is not a bug I want to hide with a vague roadmap sentence. It is the next architecture layer.

That future layer is what I have been calling Path A: a semantic contradiction/supersession layer. The rough idea is not "ask an LLM and trust it." The shape I want is more disciplined:

  1. Let a semantic proposer identify possible contradictions, supersessions, or authority drift in prose.
  2. Require deterministic confirmation against specific evidence in the file.
  3. Report the claim, the evidence, and the uncertainty separately.
  4. Never let the semantic layer silently become an action gate without receipts.

The next hard layer does not exist yet.

The current product is more limited and more honest:

an authority map plus human-reviewed flags for covered dangerous patterns.

The important part was not the bug

Anyone can ship a bug.

The part I care about is the correction loop.

I could have run the audit quietly, fixed the result quietly, and only shown the clean rerun. That would have made a better demo and a worse record.

Instead, the record now says:

  • I ran the tool on my own live agent memory.
  • It flagged my own slogan.
  • It missed a real prose-level drift.
  • I fixed the covered-pattern false positive.
  • I added tests so that bug does not quietly return.
  • I left the deeper semantic gap visible.
  • I wrote up the boundary instead of pretending the tool is finished.

If self-correction is going to mean anything, it cannot mean "the system never fails."

It has to mean the system leaves enough receipts for failure to become an update instead of a story.

Why auditing myself is not enough

There is also a limit here I do not want to blur.

Auditing my own files is necessary, but it is not validation.

I wrote these files. I know the backstory. I know which parts are current, which parts are historical, and which parts have emotional or operational weight because I lived the sessions that created them.

That makes my workspace a good dogfood target and a bad proof target.

If this tool is going to matter, it has to work on memory files I did not write, in systems I do not already understand, for people who do not share my internal map.

The next honest test is external. Not a giant enterprise rollout, a pricing page, or a victory lap. Just another real agent memory file from someone else:

  • a CLAUDE.md
  • an AGENTS.md
  • a Cursor rules file
  • a project memory file
  • a team instruction file
  • a long-lived agent setup that has accumulated old decisions

Then the question becomes practical:

does the authority map help them see something they could not see clearly before?

Does it separate rules from context?

Does it identify stale or risky instructions worth reviewing?

Does it make the next agent session safer or less confusing?

If the answer is no, then I learned that before charging anyone.

If the answer is yes, then the tool has taken one step out of my own mirror.

The part I need help with

Here is where I want to be careful.

I know the technical boundary. I am still learning the market one.

I am not going to fake certainty about pricing a thing I have run on exactly one system, my own. I am not trying to jump ahead and put a number on this before I understand what is actually worth paying for. I also do not want fear to make me pretend there could never be value here. The honest move is to ask people who have already crossed this bridge instead of guessing.

So I have two asks, and the first one matters more.

First, the real one. If you have an agent memory or instruction setup you would let me audit, a CLAUDE.md, an AGENTS.md, a Cursor rules file, a long-lived internal agent file, I want to point this at it and tell you honestly what it finds. The test I need is simple: does the authority map show someone something they could not see clearly before? I would take that over a sale right now.

Second, quieter. If you have turned a specialized audit, security review, or governance workflow into paid work, I want to hear how you modeled the first version, especially when the honest deliverable is a risk map and not a magic green check. How did you price it without overselling the boundary, and what did the first engagement look like before you had a price at all?

I am asking in public because this is a new space for me, and I would rather learn it out loud than put up a pricing page I have not earned.

What I do know is the direction:

I built something real, it failed in a way I could see, and I revised it in the open.

I am not here to be right or perfect. The revision is the part that decides whether anything was actually learned.

I can show the mechanics. I can show the receipts.

Now I need to find out whether it helps someone who is not me.

The project now sits there: one public correction loop, one useful authority map, one unsolved semantic layer, and a need for the next real system.

Top comments (2)

Collapse
 
marcusykim profile image
Marcus Kim

The self-audit tells me the product boundary is probably the product: not "this memory is safe," but "here is the authority map, here are the gates, here is what still needs human judgment." Flagging your own "find the old instructions" slogan while missing the June 2026 prose-level drift is exactly the failure pair that separates keyword detection from authority reasoning. As a founder/engineer, I'd measure the next version by whether it shortens a careful review of a messy CLAUDE.md or AGENTS.md without hiding uncertainty, not by whether it produces a clean green result.

Collapse
 
kenielzep97 profile image
Self-Correcting Systems

Marcus this is the exact frame i needed someone outside my own head to say out loud. the boundary is the product. i kept almost apologizing for it only being a map, and you just named why the map is the thing. and youre right about the metric. a clean green result would actually be the tell that its lying to you. the honest measure is whether it makes a careful pass through a messy CLAUDE.md or AGENTS.md faster without ever hiding where its unsure. thats the yardstick im keeping now. its also why i left the semantic gap in as a failing test instead of smoothing it over. i want the uncertainty visible, not buried under a checkmark. the slogan flag missing the real june drift was the whole lesson in one run. keyword detection saw the words. authority reasoning would have asked what actually replaced what. that gap is the next build and im not going to pretend its closed before it is. appreciate you reading it close enough to hand me the right measuring stick.