I Built a Closed-Loop Self-Healing System for My AI Config — By Accident

#ai #claudecode #automation #systems

Series: Building AI Systems That Verify Themselves

Part 1: Self-verifiable rules

Part 2 (you are here): Closed-loop self-healing

Part 3: Why zero output is the goal

I didn't set out to build a self-healing system. I just wanted to stop forgetting things. What emerged was a feedback loop that fixes itself without me.

After 40+ Claude Code sessions, I noticed a pattern: identify a problem → write a rule → forget about it → rediscover the same problem 5 sessions later → write the same rule again.

The rules folder was growing. But behavior wasn't changing.

The Alertmanager Problem

Prometheus Alertmanager has a concept that changed how I think about systems: alert → acknowledge → fix → auto-resolve.

Without the "auto-resolve" step, alerts pile up. You get a graveyard of acknowledged-but-never-fixed issues. They're not "open" — they're dead. Nobody looks at them. Nobody fixes them. They just sit there, eroding trust in the entire monitoring system.

My AI config had the same problem. Problems were identified. Rules were written. But nothing tracked whether the rules were actually followed — or whether the problem was actually fixed.

The missing piece wasn't more rules. It was a closed loop.

The Accidental Architecture

Three files. That's all it took to close the loop:

config-health.py (Stop hook)
    ↓ counts markers, detects gaps
pending-verifications.md (persistent memory)
    ↓ surfaces unresolved items
BODY.md (startup rules, read every session)
    ↓ tells AI to focus on pending items
    ↓ AI follows the rule, produces markers
config-health.py (next Stop hook)
    ↓ detects markers → auto-deletes from pending

The circuit:

Measure: config-health.py greps the transcript for rule markers like [✓THINK] and [✓CONTEXT]
Remember: rules with low execution rates get written to pending-verifications.md — they persist across sessions
Surface: BODY.md reads pending-verifications.md on startup — the AI sees "these rules need attention"
Re-check: next Stop hook, config-health.py counts again — if the rule is now being followed, auto-delete from pending

I didn't design this loop. It emerged from three independent decisions:

Rules should be self-verifiable (embed markers in rule text)
Monitoring should be silent on success (zero tokens on normal path)
Unresolved issues should persist across sessions (write to file, read on startup)

Each decision solved an immediate problem. Together, they formed a complete circuit.

The Self-Healing Part

Here's the part I didn't expect: the loop runs without me.

I don't check dashboards. I don't review reports. I don't maintain a tracking spreadsheet. The system surfaces issues when they exist and silences them when they're fixed. My only job is to follow the rules when BODY.md tells me to.

Last week's session: BODY.md surfaced "THINK rule execution: 40% (target: >70%)." I paid extra attention. Hit [✓THINK] 12 times. Next session, pending-verifications.md was clean — config-health had auto-deleted the entry.

I didn't manually "close" anything. The loop closed itself.

What Makes This Work

Three properties that any self-healing loop needs:

Measurement is mechanical, not judgment. Regex counting doesn't get tired, doesn't rationalize, doesn't say "eh, probably fine." It counts or it doesn't.
Memory survives sessions. pending-verifications.md is a file on disk. It doesn't disappear when the AI's context window resets. Problems can't hide by waiting you out.
Re-check is automatic. You don't have to remember to verify. The hook runs every session. Fixed problems auto-resolve. Only real issues stay visible.

The General Principle

Closed loops don't require complex architecture. They require a measurement, a memory, and a re-check. Three files, if they form a complete circuit, are enough.

This is the same pattern that makes TDD work (write test → run test → fix → re-run). The same pattern that makes CI work (push → build → test → report). The same pattern that makes any feedback system work.

The insight isn't new. What's new is applying it to personal AI configuration — the rules and habits that govern how you work with AI assistants.

Try It Yourself

Take one rule in your AI config. Add these three pieces:

A measurement: grep -c '\[✓YOUR_RULE\]' transcript.jsonl
A memory: write the count to a file that gets read on startup
A re-check: run the same grep next session, compare

That's your first closed loop. If you get the circuit right, it might start self-healing without you noticing.

The loop lives in delivery-gate (the mechanical layer) and checkgrow (the methodology layer). Both MIT licensed, both open to contributions.

This article is part of the *Engineering Trustworthy AI Output** series:*

Part 1: I Made My AI Rules Self-Verifiable — Here's How
Part 2: Your AI Agent's Best Work Produces Zero Output — And That's the Point
Part 3: I Built a Closed-Loop Self-Healing System for My AI Config — By Accident ← you are here

What config bugs have you fixed more than twice? If you've built any automation to catch recurring issues before they need a third fix, I want to hear about it.

中文版：掘金/YuhaoLin2005yhl · Code on GitHub