DEV Community

Thi An
Thi An

Posted on

Your AI coding agent keeps re-trying the fix that already failed. I built a zero-dependency cure.

There's a specific loop that anyone who codes with AI agents knows by heart.

The agent tries a fix. The test fails. It tries something else. Fails again. Then — a few prompts later, or after its context window gets compacted — it confidently re-applies the exact first fix that already failed.

The failure got erased from its memory. The confidence didn't.

I watched Claude Code do this four times in one session, each time presenting the same broken patch as a fresh idea. Every loop burned tokens, time, and a little bit of my soul. So I built RegressionLedger.

The core idea: verdicts should outlive the context window

An agent's context is ephemeral — compaction, new sessions, /clear all wipe it. But what was tried and how it went is exactly the knowledge that shouldn't die with the context.

RegressionLedger is a Claude Code hook + CLI that:

  1. Fingerprints every edit the agent makes — a normalized token stream where whitespace, comments, and string/number literals are abstracted away (but truefalse; opposite behavior must never collide)
  2. Links each edit to the next test/build outcome — it recognizes 12+ toolchains (jest, vitest, pytest, go, cargo, tsc, eslint, gradle…) and parses pass/fail plus the error signature
  3. Persists everything to a local JSON ledger that survives restarts and compaction
  4. Hard-blocks re-applying a fix that already failed — a PreToolUse deny with the original error attached:
RegressionLedger: you already tried the same fix to src/auth.js 2 hours ago.
It failed with: AssertionError: expected 200, got 401
Re-applying it will reproduce the same failure. Change strategy instead.
Enter fullscreen mode Exit fullscreen mode

The key design decision: enforced, not advisory. Memory systems that merely suggest get ignored exactly when the agent is most confident — which is exactly when it's about to repeat the mistake. The block is at the tool layer; the agent can't argue its way past it.

The parts I haven't seen anywhere else

The agent wakes up knowing its dead ends. A SessionStart hook injects a compact "what already failed here" briefing every time a session starts — including right after compaction wipes the agent's memory. Failures get blocked before they're re-conceived, not just re-applied.

Renaming variables doesn't help. A second, structure-only fingerprint (identifiers erased, shape kept) catches "the same fix, renamed" — it never blocks on that weaker signal, but it annotates: this may be the same fix, paraphrased.

Thrash detection. Blocking identical fixes catches one doom loop. The sneakier one: different fixes all dying on the same error. At 3 distinct failed approaches on one error signature, the hook escalates: "The patches differ; the error doesn't. The diagnosis is wrong, not the patches. State 2–3 root-cause hypotheses before editing again."

Herd immunity. rl export / rl import — your agent inherits the dead ends a teammate's agent already paid for, attributed and auditable.

Trust, but verify — literally

A guardrail that blocks wrongly gets uninstalled within the hour. So:

  • Reproducible benchmark: npm run bench runs a deterministic 310-case corpus — 120/120 cosmetically-disguised repeat fixes caught, 0/190 false blocks on genuinely different fixes
  • rl doctor proves the install works end-to-end: it fires synthetic events through the real hook process — a first-time edit must pass, a seeded repeat failure must be denied
  • warn mode + hit auditing: run it advisory-only for a week, then check rl stats to see exactly what it would have blocked on your codebase before enabling hard-block
  • rl unblock <file> for when the context genuinely changed (postgres → aurora) — deliberately a human decision, because an agent claiming "it's different this time" is exactly the confidence loop this tool exists to break

Honest limitation: the fingerprint is a lexer, not an AST. A heavily restructured-but-equivalent patch can slip under the similarity threshold (it then earns its own ledger entry when the test fails again). Opt-in tree-sitter mode is on the roadmap — but a lexer that ships beats a tree-sitter that's roadmapped.

Try it

# inside your project
npx regressionledger init
npx regressionledger doctor   # proves the guardrail works end-to-end
Enter fullscreen mode Exit fullscreen mode

Or as a Claude Code plugin:

/plugin marketplace add anlor1002-alt/regressionledger
/plugin install regressionledger@anlor1002-plugins
Enter fullscreen mode Exit fullscreen mode

Zero dependencies. Fully offline — no network calls, no API keys, nothing leaves your machine. MIT.

Repo (with a 30-second demo GIF): https://github.com/anlor1002-alt/regressionledger

If you live in the doom loop too, I'd love feedback — especially on the false-positive/false-negative balance. The best features so far were all requested by people who tried to poke holes in it.

Top comments (0)