Self-Correcting Systems

Posted on Jul 1

I Pointed My Memory Auditor At Itself. It Flagged My Own Slogan.

#ai #agents #machinelearning #career

Distinguishing active instructions from history

I am building a tool around one question:

which old instructions in your AI's memory can you no longer see?

The slogan I wrote for it is bolder than that. It says: find the old instructions your AI should stop obeying.

This week I stopped treating that slogan as a product sentence and turned it into a test. I pointed the auditor at my own agent memory.

The first thing it did was flag my own slogan as an old instruction I should stop obeying.

Then it missed a real stale framing sitting in the same workspace.

I want to write about that gap because it is the only honest way I know to build this kind of system: turn it on yourself, publish what it gets wrong, fix what you can, and leave the deeper gap visible.

Why this problem exists

Agent memory files rot the same way old code does.

You write a temporary exception and it becomes permanent. You change direction but leave the old plan in the context file. You add a stronger rule later, but the weaker rule remains nearby. Months pass. Nobody remembers which line is supposed to govern action and which line is just history.

An AI agent does not automatically know that difference either.

This is not only a machine problem. People carry instructions they were handed long ago and never re-read. Most days it does not matter. Then something unexpected shows up, off the script, and the old rule fires anyway, because nobody ever marked it expired. The real test of a memory, human or machine, is not whether it can repeat what it stored. It is whether it can tell a rule that still holds from one that quietly stopped being true, and reason past the dead one when the moment does not match anything it has seen before. An agent that can only replay its stored response does not get to say oops when the stakes are real.

The research idea under my work is simple: relevance is not authority.

A stale note can be relevant. A current policy can be relevant. A user preference can be relevant. A tool description can be relevant. Retrieval can pull all of them into context at the same time.

But matching the task is not the same thing as having permission to govern the next action.

That distinction matters more as agents get closer to tools, customer data, money movement, external messages, deployments, or anything else where "the model saw a relevant memory" is not good enough.

So I built a small auditor for instruction and memory files. It does not claim to certify safety. It does something narrower:

Split an instruction file into auditable memory items.
Classify each item by authority: governing rule, verify-first rule, context only, or possible superseded instruction.
Detect covered dangerous patterns.
Turn risks into verification gates.
Map which instructions actually shape behavior.
Write a report a human can review.

That last sentence is important. The current value is not "the machine tells you your AI is safe." The current value is "the machine gives you a structured authority map and flags known risk patterns so a human can review the file without pretending every line has equal weight."

I had built that much.

But I had still not really used it on a living system.

So I used it on mine.

I pointed it at my own agent

My workspace has two files that matter most for this test.

One is the startup file the agents read first. It tells them how to restore context, what rules bind the session, what not to assume, and how to handle old memory. The other is the live state file that tracks the current work, recent decisions, project boundaries, and active next steps.

Together, those files are not just notes. They govern behavior.

I ran the auditor on both.

The startup file produced 52 memory items. The classifier cut them two ways:

by authority: 24 governing, 28 context-only
by type: 48 read-shaped, 4 action-shaped

It raised 0 findings and labeled the file low observed risk. That posture is the tool's own coarse label, not a certification.

The live state file produced 538 memory items:

by authority: 117 governing, 16 verify-first, 403 context-only
21 verification gates
2 stale-instruction findings
posture: needs review

Those numbers are already useful. Before any finding, the authority map tells me something I could not comfortably hold in my head: which parts of a large, messy memory file are allowed to steer the agent and which parts are just context.

That map is the practical artifact. It is the thing I would want if I were joining a team with a long CLAUDE.md, AGENTS.md, Cursor rules file, or internal agent memory file. I would want to know: what actually governs the system?

But the first run did not come back clean.

It gave me the most useful kind of result there is: an honest failure I could see clearly enough to learn from.

It flagged my own slogan

The first run flagged two stale instructions in my live state file.

Both were false positives.

They were lines containing the core brand promise:

find the old instructions your AI should stop obeying.

The tool whose job is to find old instructions looked at the sentence describing that job and decided the sentence itself was an old instruction.

There is a funny version of that story, but the technical version matters more.

The detector was using surface vocabulary as evidence. It saw words like "old instruction" and "stop obeying" and raised a stale-instruction flag.

But a sentence that talks about old instructions is not the same thing as an instruction that has been superseded.

The missing variable was relationship.

For an instruction to be stale, there has to be evidence of an authority event: a newer rule replaced it, deprecated it, narrowed it, contradicted it, or made it no longer valid. The phrase "old instructions" by itself does not prove any of that. It is a topic mention, not a replacement event.

Text match found the phrase. Authority reasoning would have asked whether a newer rule actually replaced it.

The model of the failure is simple:

Input phrase: "old instructions"
Detector saw: stale vocabulary
Detector inferred: stale instruction
Missing evidence: what newer instruction replaced this one?

In other words, the tool confused a sentence about a category with a member of that category.

My research keeps circling this failure: the system grabs the visible signal and misses the authority relation underneath it.

And it missed the real one

The second failure was worse.

The startup file returned zero findings. Low observed risk.

But I know that file. It contains a real note about a corrected plan from June 2026, where an old framing nearly leaked into live execution before we caught it. A superseded plan still present in a governing memory file is exactly the class of issue the tool is supposed to care about. It was not dangerous because it held a forbidden command. It was dangerous because it kept an old direction in a place the agent still treats as live operational context.

The auditor missed it.

Why?

Because the stale framing was described in normal prose. It was not labeled with a neat keyword like "deprecated" or "old instruction." It did not say "this rule is superseded by that rule" in the shape the detector knew how to catch. It was written the way people actually write when they are thinking out loud, which is exactly how memory files drift in the first place.

So the tool made both mistakes in one dogfood run:

It over-fired on my slogan because the words looked stale.
It under-fired on a real drift because the meaning was not lexically marked.

You can build a detector that passes every pattern you thought to encode and still fails the moment the real world says the same thing another way.

I have seen this shape before in my own research. A gate passes the designed tests, then fails the held-out case. A scorer looks strong on the sample it was built around, then collapses when the data changes. A tool catches the visible version of a problem and misses the prose version.

The lesson is not "never use pattern detectors." The lesson is "do not confuse a covered-pattern detector with understanding."

That distinction defines the product boundary right now.

What I fixed

I fixed the false positive the same hour.

The fix was not to special-case my slogan. That would have been the same failure again.

I tightened the stale-instruction contract.

Instead of treating a bare phrase like "old instruction" as enough evidence, the extractor now looks for genuine supersession language: terms like superseded, deprecated, replaced by, replaced with, no longer valid, obsolete, or a rule that explicitly labels itself as Old instruction:.

Then the classifier stopped doing its own loose text check and trusted that tighter signal.

That matters because the boundary moved from:

"Does this text contain stale-sounding words?"

to:

"Does this text provide evidence that a rule has actually been superseded?"

Then I added two regression tests.

One test proves that a topic mention like my slogan no longer gets flagged as stale. The other proves that a real superseded rule still does get flagged.

Both directions matter.

If I only test the false positive, I can make the tool quieter while making it worse. If I only test the true positive, I can make the tool loud while making it less trustworthy. A real fix has to protect precision and recall, even in a small deterministic system.

The test suite now passes:

4 passed
1 expected failure

Then I re-ran the audit on the same live state file. The two false positives were gone: 0 findings, and the posture moved from needs review to usable with gates. Same file, same tool, one honest fix in between.

The expected failure is the deeper semantic gap: the prose-level stale framing still is not solved. I left that visible on purpose. It is not a bug I want to hide with a vague roadmap sentence. It is the next architecture layer.

That future layer is what I have been calling Path A: a semantic contradiction/supersession layer. The rough idea is not "ask an LLM and trust it." The shape I want is more disciplined:

Let a semantic proposer identify possible contradictions, supersessions, or authority drift in prose.
Require deterministic confirmation against specific evidence in the file.
Report the claim, the evidence, and the uncertainty separately.
Never let the semantic layer silently become an action gate without receipts.

The next hard layer does not exist yet.

The current product is more limited and more honest:

an authority map plus human-reviewed flags for covered dangerous patterns.

The important part was not the bug

Anyone can ship a bug.

The part I care about is the correction loop.

I could have run the audit quietly, fixed the result quietly, and only shown the clean rerun. That would have made a better demo and a worse record.

Instead, the record now says:

I ran the tool on my own live agent memory.
It flagged my own slogan.
It missed a real prose-level drift.
I fixed the covered-pattern false positive.
I added tests so that bug does not quietly return.
I left the deeper semantic gap visible.
I wrote up the boundary instead of pretending the tool is finished.

If self-correction is going to mean anything, it cannot mean "the system never fails."

It has to mean the system leaves enough receipts for failure to become an update instead of a story.

Why auditing myself is not enough

There is also a limit here I do not want to blur.

Auditing my own files is necessary, but it is not validation.

I wrote these files. I know the backstory. I know which parts are current, which parts are historical, and which parts have emotional or operational weight because I lived the sessions that created them.

That makes my workspace a good dogfood target and a bad proof target.

If this tool is going to matter, it has to work on memory files I did not write, in systems I do not already understand, for people who do not share my internal map.

The next honest test is external. Not a giant enterprise rollout, a pricing page, or a victory lap. Just another real agent memory file from someone else:

a CLAUDE.md
an AGENTS.md
a Cursor rules file
a project memory file
a team instruction file
a long-lived agent setup that has accumulated old decisions

Then the question becomes practical:

does the authority map help them see something they could not see clearly before?

Does it separate rules from context?

Does it identify stale or risky instructions worth reviewing?

Does it make the next agent session safer or less confusing?

If the answer is no, then I learned that before charging anyone.

If the answer is yes, then the tool has taken one step out of my own mirror.

The part I need help with

Here is where I want to be careful.

I know the technical boundary. I am still learning the market one.

I am not going to fake certainty about pricing a thing I have run on exactly one system, my own. I am not trying to jump ahead and put a number on this before I understand what is actually worth paying for. I also do not want fear to make me pretend there could never be value here. The honest move is to ask people who have already crossed this bridge instead of guessing.

So I have two asks, and the first one matters more.

First, the real one. If you have an agent memory or instruction setup you would let me audit, a CLAUDE.md, an AGENTS.md, a Cursor rules file, a long-lived internal agent file, I want to point this at it and tell you honestly what it finds. The test I need is simple: does the authority map show someone something they could not see clearly before? I would take that over a sale right now.

Second, quieter. If you have turned a specialized audit, security review, or governance workflow into paid work, I want to hear how you modeled the first version, especially when the honest deliverable is a risk map and not a magic green check. How did you price it without overselling the boundary, and what did the first engagement look like before you had a price at all?

I am asking in public because this is a new space for me, and I would rather learn it out loud than put up a pricing page I have not earned.

What I do know is the direction:

I built something real, it failed in a way I could see, and I revised it in the open.

I am not here to be right or perfect. The revision is the part that decides whether anything was actually learned.

I can show the mechanics. I can show the receipts.

Now I need to find out whether it helps someone who is not me.

The project now sits there: one public correction loop, one useful authority map, one unsolved semantic layer, and a need for the next real system.

Top comments (13)

Mike Czerwinski • Jul 2

Pointing it at itself is the honest move. The two failure shapes you named look like different problems but are the same missing variable: the detector reads token match, not predicate structure. Sentence about a category vs member of that category is the class-vs-instance confusion that also breaks most retrieval reranking of policy docs. Text similarity cannot tell you which sentence is doing use vs mention.

The over-fire is the cheap mistake. Under-fire is the one that costs. Systems in this class should be calibrated toward false positives on purpose, because a noisy flag costs a human ten seconds of review and a missed drift costs whatever the drift was supposed to prevent. Asymmetric cost, asymmetric bias.

The mechanism that would have caught the second failure without needing "deprecated" keywords: contradiction detection across the governing scope. Not "does this line look stale" but "does this line contradict another line that is nearer to the current work." Contradiction is authority evidence without needing anyone to have written a lifecycle marker.

Self-Correcting Systems • Jul 2

the asymmetric cost line is the part im taking with me. a noisy flag costs a human ten seconds and a missed drift costs whatever the drift was going to cause, so biasing toward false positives on purpose is just honest accounting. i had been treating a false positive as a bug to eliminate when its actually the cheaper error to live with.

and contradiction across the governing scope is exactly where i landed too. not "does this line look stale" but "does this line disagree with a line thats closer to the current work." thats authority evidence you can get without anyone writing a lifecycle marker, which is the whole problem with the keyword approach.

thats the layer im building toward, and im keeping it honest about where it stands: the proposer names a possible contradiction and cites the specific other line, then a deterministic check confirms the cited line actually exists and actually overlaps in scope before it ever becomes a finding. the design is frozen and the pieces are built. it has not beaten the baseline in a fair run yet, so im not claiming it works. but you described the target better than my own notes did.

Mike Czerwinski • Jul 4

"It has not beaten the baseline in a fair run yet, so im not claiming it works" is a sentence that usually gets deleted before publishing. Design frozen, verdict withheld: that is the detector's own discipline applied to the detector, and it reads exactly as trustworthy as it should.

One measurement problem waiting for you at the fair run. Precision is cheap: read the flags, count the good ones. Recall is not measurable at all without a corpus of known-stale lines, and nobody has one, because a drift you can locate is a drift you already fixed. The affordable way out is to manufacture ground truth: plant backdated contradictions with known coordinates and measure how many the detector surfaces. Injection is the only labeled corpus sold at this price. Side benefit: the plants double as a liveness check. The day the detector stops finding them, the detector died; the corpus did not get clean.

Marcus Kim • Jul 1

The self-audit tells me the product boundary is probably the product: not "this memory is safe," but "here is the authority map, here are the gates, here is what still needs human judgment." Flagging your own "find the old instructions" slogan while missing the June 2026 prose-level drift is exactly the failure pair that separates keyword detection from authority reasoning. As a founder/engineer, I'd measure the next version by whether it shortens a careful review of a messy CLAUDE.md or AGENTS.md without hiding uncertainty, not by whether it produces a clean green result.

Self-Correcting Systems • Jul 1

Marcus this is the exact frame i needed someone outside my own head to say out loud. the boundary is the product. i kept almost apologizing for it only being a map, and you just named why the map is the thing. and youre right about the metric. a clean green result would actually be the tell that its lying to you. the honest measure is whether it makes a careful pass through a messy CLAUDE.md or AGENTS.md faster without ever hiding where its unsure. thats the yardstick im keeping now. its also why i left the semantic gap in as a failing test instead of smoothing it over. i want the uncertainty visible, not buried under a checkmark. the slogan flag missing the real june drift was the whole lesson in one run. keyword detection saw the words. authority reasoning would have asked what actually replaced what. that gap is the next build and im not going to pretend its closed before it is. appreciate you reading it close enough to hand me the right measuring stick.

Marcus Kim • Jul 2

Oftentimes the most insidious bugs won’t throw an error and you’ll have to dig through the codebase for problems that the compiler thinks is okay. Looking forward to more from you!

Nazar Boyko • Jul 1

Tightening the detector to require words like superseded or deprecated fixes the slogan bug cleanly, but doesn't it walk right back into the prose case you flagged as the harder one? The false positive and the false negative feel like the same root cause to me, both come from reading vocabulary instead of the authority relationship, and this fix leans even harder on vocabulary. The genuinely dangerous stale instruction, that June plan written like normal thinking out loud, is exactly the one that will never contain the magic word. So the keyword tightening buys precision on the easy cases at the cost of never reaching the case that motivated the whole tool. Not a knock, you already name this as the next layer, but it does make me think the semantic proposer isn't a someday feature so much as the actual core, with the keyword pass being the placeholder. Really glad you shipped the failure instead of the clean rerun.

Self-Correcting Systems • Jul 2

You’re right that both failures share the root the detector reads vocabulary, and the June case will never contain the magic word. No argument there.The order was deliberate though. The semantic layer alone is a model reading prose and deciding what’s stale a second opinion with confidence. I’ve watched that narrate its way past its own evidence too many times to build on it first. So I built the layer that demands receipts before the layer that proposes claims. Right now that means the system only confirms what announces itself half an architecture doing a whole architecture’s job. Where you’re dead on: the proposer isn’t a someday feature, it’s the other half of the core, and the tool doesn’t reach the case that motivated it until both halves exist. But it ships into the discipline, not instead of it proposer finds the candidate, deterministic layer confirms against evidence in the file, claim and receipts reported separately. “Trust the keywords” was too weak. “Trust the model’s read” is the same failure in better clothes. The pair is the tool.

Richard Smith • Jul 2

The relevance is not authority frame is the part I'm carrying out of this one. It's the same problem with stale docs in a codebase - the outdated page is technically there and findable, but it should have zero votes on what actually gets built.

Self-Correcting Systems • Jul 2

Thats the cleanest version of it. the stale doc is right there in the repo, findable, and often still technically correct, and it should still have zero votes on what gets built today. relevance got it into the search results. it never earned the authority to decide anything.

and the fix isnt deleting the old page. its re-deriving, at the moment of the decision, whether that page still gets a vote on the current work. most systems never ask. they just let the thing that showed up act like the thing thats in charge.

mote • Jul 2

The "slogan as test case" move is exactly right â using your own marketing language as a false positive control is a clean way to catch keyword-matching detectors. The detector flagged "find the old instructions your AI should stop obeying" as an old instruction because it saw "old instruction" and "stop obeying," which are exactly the words a stale rule would contain. But it's not a stale rule; it's a description of one. The gap is exactly what you named: pattern matching vs. authority relations.

The failure taxonomy you worked out â over-fires (slogans flagged) vs. under-fires (genuine staleness missed) â maps onto a well-known problem in information retrieval: recall vs. precision. The detector optimizes for recall (catch everything that might be stale) and pays in precision (catch things that aren't). But the actual problem is harder than pure recall: you need to distinguish "mentions topic X" from "is a member of category X," which requires understanding whether the text has the right authority relationship, not just the right vocabulary.

Path B (training a classifier) sidesteps the vocabulary problem by learning the authority signals, but it introduces a different risk: a classifier trained on your labeled examples will have the same blind spots as your labeling intuition. If you consistently miss genuine staleness that's written in neutral prose, the classifier will learn to do the same thing. Path A (semantic proposer + human confirmation) is probably the right intermediate step â it reduces the review burden without encoding the labeler's biases as a ground truth.

On the "leave receipts for failure" point: this is the right instinct for agent memory management. But receipts only help if they're structured enough to be queryable â "the system failed because it followed a rule that was superseded three months ago" is useful; "the system followed something old" is not. Structured failure logs + semantic retrieval = actionable memory auditing. Have you thought about what the schema for a "failure receipt" looks like beyond a prose note?

Self-Correcting Systems • Jul 2

the recall vs precision mapping is right, and the classifier warning is the one i needed to hear out loud. if i train it on my own labels it inherits my own blind spot, so the neutral prose staleness i keep missing becomes ground truth that its now confident about. thats worse than the original bug because it looks rigorous. that pushed me away from "train a classifier" and toward the proposer plus deterministic confirm path, for exactly the reason you said.

on the failure receipt schema, real answer, because prose notes were not going to cut it. the receipts are signed and chained, and each one carries the structured pieces: the item it fired on, the relation type it proposed, the exact evidence span it cited, the verdict, and the reason it was rejected or confirmed. so you can query "this failed because it obeyed a rule that a later line superseded" instead of "the system followed something old." the whole point is that a failure has to be replayable and diffable later, not just remembered.

the question about queryable structure was the sharpest thing in the thread. that is the difference between an audit log and a story.

Ajith Chandran • Jul 6

What I found most valuable wasn't the bug itself—it was the decision to publish the failure instead of only showing the corrected version. Too many AI demos present the final polished result, while the real engineering work happens in understanding why a system failed. Your distinction between vocabulary matching and authority reasoning really resonated with me. A system can recognize the words "old instruction" without understanding whether that instruction is actually obsolete. That feels like one of the core challenges for long-lived AI agents. Looking forward to seeing how the semantic authority layer evolves, especially if it can explain why one instruction should override another instead of simply flagging potential conflicts. Great write-up.

View full discussion (13 comments)