What I Learned by Deleting Rules My Agent Had Already Learned

#ai #agents #llm #coding

Knowledge that isn't pruned starts to mislead you

Part of the ForgeFlow series — building a coding agent that runs its execution loop locally on an M5 Max, and writing down what actually breaks. Planning runs through a separate planning step; code generation runs on a local model via Ollama, test-driven inside a Docker sandbox.

For months, my agent got better by accumulating rules.

Every time a project failed in a way I understood, I'd write the lesson down as a rule and feed it back in — don't do this, always do that — so the agent wouldn't repeat the mistake. To be clear about what "rules" means here: not fine-tuning, not weights. A plain, human-readable rulebook the system injects into the agent's context, built up from past failures. For a long time, adding to it worked. Autonomy went up. The same mistakes stopped coming back.

Then the agent started getting worse, and the rulebook was the reason.

Not because the rules were wrong. Because some of them had stopped being right — and I'd never built anything to notice. This post is about the part of the system I hadn't built: the part that lets old knowledge leave.

(This continues a thread from a recent run of posts about verifying your own system. The short version of that run: every layer you trust should be measured before you trust it. This is what happens when you forget that even earned trust has an expiry date.)

The half I built first: a place to write lessons down

The first half of this is the obvious, satisfying half. Your agent fails, you diagnose it, you encode the fix as a rule, and the failure doesn't recur. Do that across a few dozen projects and you've got a real rulebook — concrete, earned, specific. Mine grew to around 77 rules at one point, and I wrote a whole post about how good that felt ("77 Rules Later"). Each rule had a story behind it. Each one had prevented something real.

This is the part that feels like progress, because it is progress — at first. A rulebook induced from actual failures is far better than generic advice like "write clean code." It captures things like a specific framework-level behavior that only surfaces when two configuration paths interact — the kind of thing that turns a multi-hour debugging session into a one-line guardrail.

So you keep adding. Why wouldn't you? Every rule paid for itself once.

The hidden assumption in "why wouldn't you" is that a rule, once true, stays true. That's the assumption that broke.

The half I had missed: a place for lessons to expire

Somewhere past a few dozen rules, I noticed the curve bend the wrong way. More rules stopped producing better behavior. Past a point, they started producing slightly worse behavior — more hesitation, more rules half-applied, the occasional case where two rules pulled in different directions and the agent picked the wrong one to honor.

This surprised me, because my default instinct with the agent was to give it more context. More rules, more examples, more guardrails — surely more is safer. But a rulebook isn't free. Every rule you inject is something the model has to hold, weigh, and reason around on every step. Past some threshold, the cost of carrying a rule can exceed the failure it still prevents — especially when the thing it was written to prevent no longer happens.

And that's the category I'd completely missed: rules that had simply aged out. A rule written for one version of a library, after the library changed. A guardrail for a failure mode I'd since fixed at the source, so it could never recur — but the rule kept riding along, costing attention, preventing nothing. I had a careful process for adding knowledge and no process at all for removing it. The rulebook could only grow.

A rulebook that can only grow stops being a curated knowledge base. It quietly turns into an index where the dead entries dilute the living ones — and nothing in it tells you which is which.

Why a stale rule is worse than no rule

Here's the part worth sitting with, and it rhymes with something I keep running into in this project: the failures that cost me most were the quiet ones.

A missing rule is loud. The agent makes a mistake, you notice, you add the rule. The system has a built-in way to surface what it doesn't know — failure.

A stale rule is silent. It was right when you wrote it. It reads as authoritative because it earned that authority, once. It sits in the rulebook looking exactly like the rules that are still true, and nothing about it announces that the world it described has moved on. The agent dutifully follows it. You don't get a failure that points at it — you get slightly degraded behavior with no obvious cause, spread thin across everything the agent does.

That's why a stale rule can be worse than no rule. A missing rule often leaves a gap you can see once the failure appears. A stale rule fills the gap with confident, outdated instruction, and confident-and-outdated is one of the harder kinds of wrong to catch — because everything looks fine.

What I built instead: rules with a confidence score and a way to die

The fix wasn't cleverer rules. It was treating each rule as something other than a permanent truth.

The reframing that helped: a rule is not a fact. It's a hypothesis with an evidence record. It claims "doing X prevents failure Y," and that claim is either being supported by what actually happens in real runs, or it isn't. So instead of a flat list, each rule now carries a small amount of state: how confident I currently am in it, and what recent evidence supports or undercuts it.

The lifecycle, in plain terms:

A new rule starts as a candidate — written down, but not yet trusted. It hasn't earned its place.
If real runs keep showing evidence that the rule is still guarding against an active failure mode, confidence rises and the rule becomes trusted.
If a rule stops getting that supporting evidence — the failure it guards against simply isn't occurring anymore, across many runs — its confidence decays. Below a threshold, it's flagged stale for review.
A stale rule that, on inspection, appears to guard against something that can no longer happen gets retired. Out of the rulebook, into an archive, where it's recorded but no longer injected. One honest caveat about that evidence: I can't prove a counterfactual. I don't know the failure would have happened without the rule — I'm inferring it from run traces and the shape of what the agent attempted, not from a controlled A/B where I remove the rule and watch the system break. The confidence score reflects that inference, not a proof. I keep that in mind whenever I read it.

I'm deliberately not giving the exact scoring here, partly because the specific numbers are tuned to my system and would be noise to you, and partly because the point isn't the formula. The point is the direction: confidence flows from evidence, evidence comes from real runs, and a rule with no recent supporting evidence is a liability until proven otherwise — not an asset by default. Adding a rule is cheap. Letting it persist without review is where the cost accumulates.

If this sounds like cache invalidation, or like deprecating a feature flag, or like paying down tech debt — yes. It's the same shape: the cost isn't in creating the thing, it's in the discipline of removing it when it stops earning its keep. I just hadn't been treating lessons as something that needed garbage collection.

The part I deliberately didn't automate

The obvious next step would be to close the loop completely: let the system retire its own rules the moment confidence drops. I didn't, and I think the reason matters.

A confidence score is itself a measurement, and measurements can be wrong. A rule might look unsupported simply because I haven't run the kind of project that triggers its failure mode lately — not because it's obsolete. If I let the system auto-delete on a low score, it could kill a good rule during a quiet stretch, and I'd only find out when the old failure came roaring back.

So the system doesn't delete. It proposes. A rule going stale raises a flag for me to look at, and retiring it is a decision I still make. The machine is good at noticing "this rule hasn't earned its keep lately." It is not good at knowing whether that's because the rule is obsolete or because I just haven't stress-tested it. That difference needs a human, at least for now.

What this didn't prove

I'll keep this honest, the way I've tried to with the rest of these.

This is one system's experience — a single-person setup with a rulebook small enough that I can still read all of it. I haven't shown that knowledge decay matters at every scale, or that my particular lifecycle is the right one. It's an approach that worked in this setup, not a general result. For a small or short-lived project, the whole apparatus is overkill; a rulebook you fully re-read every week doesn't need confidence scores, it needs you to read it.

And the hard part isn't solved. "Has this rule earned its keep lately?" is a real signal, but it's a proxy, and proxies drift too — the watcher needs watching, same as everything else in this series. I don't have a clean rule for when a quiet rule is obsolete versus merely untested, which is exactly why I kept a human in that decision. I'm sharing the shape because it changed how I think about my own knowledge base, not because it's finished.

The takeaway

In my own system, I spent enormous effort teaching it new things and almost none deciding when old things should stop being believed. A knowledge base that can only grow may stop getting wiser; past a point, it can get slower and quietly less correct, because the rules that have gone stale look exactly like the rules that are still true.

If I had to compress it: writing a lesson down is the easy half. Letting it expire when it no longer earns its place is the half that keeps the lessons honest.

I'd like to know how other people handle this. In your systems — prompt libraries, rule sets, runbooks, internal docs, lint configs — does anything ever retire, or does it all just accumulate? And if things do get removed, how do you decide a piece of hard-won knowledge has stopped being true? I worked out a rough answer for my case, but it leans on being small enough to still read everything, and I suspect that doesn't last.