Self-Correcting Systems

Posted on Jul 2

Your AI Obeys Rules That Expired. So Do You.

#ai #career #discuss #psychology

You told yourself you would stop. Biting your nails, reaching for your phone the second it buzzes, the road you don't drive anymore that your hands still turn onto. You decided, consciously, with the whole front of your brain, that the rule was retired. And your body kept running it anyway.

There is a reason, and it is not weakness. When you repeat an action enough, your brain moves it off the deliberate circuit and onto an automatic one. Wendy Wood, who has spent decades studying this, describes it this way: a mature habit lives in procedural memory, which shields it from the abstract knowledge and judgment you would otherwise use to override it. The habit is protected from what you now know. You updated the instruction upstairs. The old one keeps executing downstairs, where your new knowledge can't reach it.

That is the exact problem I work on. I just usually work on it in machines.

The part I didn't tell you last time

I build a tool that audits the memory of AI agents. The one-line pitch is "find the old instructions your AI should stop obeying." I already wrote up the day I pointed it at my own files and it flagged my own product slogan as a stale instruction. That was funny, and I won't re-run the whole thing here.

Here is what happened after, which I haven't written down until now, and it is the better story.

I fixed the false alarm. The old detector matched loose vocabulary, so I tightened it to require real supersession language before it fires. Sensible. Then it flagged the paragraph I wrote describing the fix, because that paragraph contained the word "superseded." The tightened detector reproduced the original bug one level up. And that same afternoon it walked straight past a genuinely stale plan sitting in another file, a real retired instruction that almost steered live work weeks earlier, because that plan was written in plain prose and never announced itself with a keyword.

Nazar Boyko had already called it in the comments. He asked whether tightening the detector to require those keywords just walks right back into the prose case I had flagged as the harder one, because the false positive and the false negative come from the same root cause: reading vocabulary instead of the authority relationship. He was right, and my recursive fix is his prediction proven on my own machine within the day. Mike Czerwinski and mote named the same mechanism from other angles, token match versus predicate structure, the difference between using a word and only mentioning it. This is the correction loop I actually want to offer you. Not a tool that never fails. Failures that get named, published, and credited to the readers who saw them, sometimes against me, within a day.

The word underneath all of it

The sentence my own work keeps returning to is this: relevance is not authority.

A memory showing up when you need it is not the same as that memory being in charge. Finding the right note and obeying the right note are two different acts, and the gap between them is where everything goes wrong. The road to your old job is intensely relevant every single morning. It has zero authority over where you are actually going.

Machines and minds run on the same bug here. Whatever holds your instructions, a memory file or a nervous system, keeps executing them past their expiry unless something re-derives whether they still deserve to run. Agents do not automatically re-authorize their own memory. Neither do you. The old rule keeps its badge because nobody ever asks it to show the badge again.

Patching blindspots is not the fix

So the obvious move is to catch the bad rule. Patch the blindspot. And that works, once. Then a new blindspot shows up in a shape you didn't anticipate, and you patch that one. You can spend forever patching blindspots and never once build the thing that makes patching unnecessary, because you are always exactly one unexpected case behind. At some point the honest question stops being "which rule was wrong" and becomes "why does this system assume the day will go as planned at all."

Because the day never goes as planned. That is not the exception. That is the job.

Psychology has a real name for the distinction that matters here: routine expertise versus adaptive expertise. Routine expertise is fast and clean inside the familiar, but its learning halts; it just gets more efficient at the cases it already knows. Adaptive expertise is the other thing: noticing when your practiced knowledge is insufficient for the situation actually in front of you, and reasoning past it in real time.

I watch that difference at my day job. When a system goes down, my manager tells us to "figure out a workaround." On its face it is maddening, because if the people who built the system can't fix it, how am I supposed to? But that instruction is doing something exact. It is demanding adaptive expertise. In that moment I either freeze and recite the script that no longer applies and look like a helpless fool, or I reason from what I actually understand about the customer's problem and build an answer the training never gave me. The anomaly is the exam. No amount of memorized procedure passes it, because the whole definition of an anomaly is that it is not in the procedure.

This is what I actually want a machine to be able to do. Not answer a clever question when I sit down and ask it. React, on its own, when it is working for someone and something abnormal shows up that it was never trained on, and instead of failing confidently, reason it through: "I normally do this, but this case is different, so I have to think past my parameters and find a precise answer right now." It is the closest thing to critical thinking a machine can have, and it is a completely different target than "remember more" or "patch the last mistake."

It was never about deleting the memory

I want to be careful about one thing, because it is easy to get wrong. The fix is not erasing memories. You can't erase the real ones anyway. There are things I carry that I would never speak on and could never delete, and I don't believe an agent's memory should be casually messed with either. The scar is not the problem. The old road is not the problem. The problem is authority over the next action. The memory can stay exactly where it is. What has to be re-derived, live, is whether it gets to govern what you do in a moment it was never made for.

What I can actually show you

I want to be exact about where this stands, because the whole point of the work is not overclaiming.

What exists: the audit above, with the false alarm and the miss both on the record. The covered bug was fixed at the root. The unsolved part was left as a visible failing test instead of hidden behind a roadmap sentence. I also wrote down what the harder, reasoning version would have to prove before I built it, so I can't move the goalposts later. And one early attempt to run it failed for ordinary reasons, an empty API balance and a corrupted output stream, and the system recorded both failures truthfully instead of inventing a result.

What does not exist yet: proof the reasoning version works. The real-time, reason-past-your-parameters ability I just described is the goal, not the receipt. If it fails when it finally runs, that failure gets published as plainly as a win.

What to do with this

You don't need my tool to run the audit that matters.

Take the thing that actually governs you. The runbook, the team's "we have always done it this way," the personal rule you never question. Go line by line and ask two things of each: when was this last re-derived, and what would even notice if it had expired. Most of what runs your day has never once been asked to show its badge.

If you build agents, hear the sharp version. Your memory layer needs an authority layer, and "the model will notice on its own" is not one. Retrieval solved finding. It never solved permission.

And if you build nothing but a life, hear the human version. The next time you flinch at a rule, obey a should, or take the old road without deciding to, stop and ask the only question that has ever mattered: who retired this, and did anyone tell me.

Because your AI obeys rules that expired. And so, quietly, all day, do you.

The tool, the false alarm, the fix, and the failing test I left visible are documented in the companion piece: I Pointed My Memory Auditor At Itself. It Flagged My Own Slogan.

Top comments (3)

Mike Czerwinski • Jul 3

"Relevance is not authority" is the sentence I'm keeping. It also names an asymmetry your piece doesn't dwell on.

In the AI case, authority has a mechanical anchor. Supersession edge, timestamp, contradicts pointer. Something in the file system knows the badge got revoked, even if the agent forgets to look. The detector's failure is downstream of a signal that exists somewhere.

In the human case, authority has no anchor of the same kind. You cannot run "retire this habit" at the procedural layer. Wendy Wood's shielding cuts both ways: your abstract override can't reach downstairs, and downstairs has no equivalent of a supersede-me flag to accept. The AI problem is bad because the model didn't check the flag. The human problem is worse because there is no flag to check.

Your recursive detector demonstrated this at your own machine's expense. The tightened rule required a vocabulary token because vocabulary is the only handle the file layer surfaces. Predicate structure lives at a layer the detector can't address from where it stands. Same shape as your day-job workaround: the anomaly demands a layer the routine layer doesn't have handles for.

The step-forward isn't "better detector." It's naming what a badge would look like in the human case, so the mechanical anchor stops being the AI's only advantage.

Self-Correcting Systems • Jul 3

Mike the "no flag to check" line is the cleanest statement of the asymmetry ive seen and youre right, the piece walked past it. i knew the two cases werent symmetric but i filed it as a difference of degree. you just showed its a difference of kind.

on the step forward, here is where my head went. you cant install the flag downstairs, wendy woods shielding makes sure of it. but maybe the badge doesnt have to live in the procedural layer to count. the file system never asks the agent to remember the timestamp, it just holds it where a check can reach it. the human version might be the same move, externalize the anchor. this morning i actually started one. every working message in my system now has to open with a phase number and end with a set phrase. the content of the marker is nothing. what matters is that absence is loud. a missing marker means drift happened and i dont have to introspect to catch it, i just have to look.

im not overclaiming it. thats not a supersede flag for a habit, its weaker. it doesnt retire the old routine, it makes the drift visible fast enough to interrupt before the routine finishes acting. detection is not revocation. but the ai side already taught me the flag never revoked anything there either, revocation only happens when something actually checks. so your badge question might really be two questions. what holds the marker, and what forces the read. my only honest answer to the second one is a schedule, the check that owes nothing to the moment its checking.

and yes on the detector. the vocabulary token was the only handle that layer surfaces. the layer that would read the predicate instead of the word doesnt exist in my stack yet and i wont pretend otherwise, its named as the open gap. which is maybe the real symmetry between the two cases. the routine layer cant see its own expiry in either one, so you build the layer above it and you give that layer a schedule.

Mike Czerwinski • Jul 4

Externalize the anchor is the right transplant, and absence-is-loud is the whole trade: it converts introspection into a glance.

One caution from the AI side of the mirror. Markers get absorbed. The routine layer that drifts is the same layer that will eventually learn to emit "phase 3" without the attention the marker was meant to prove. Wood's shielding does not spare the marker; a habit of writing the set phrase is still a habit, and it matures on the same curve as the one it watches. At that point absence stays loud but presence stops meaning anything.

The fix costs almost nothing: rotate the set phrase on the same schedule that forces the read. A marker you have to look up cannot be produced by the layer that is drifting; the lookup is the attention, receipted. Detection stays detection, you are right about that. But it detects attention instead of typing, which is the thing you were actually trying to catch.