From Learning Capture to Self-Evolving Rules: Adding Verification Sweeps to terraphim-agent
A self-evolving AI coding agent sounds like science fiction. It is not. It is a shell script, a markdown file with grep patterns, and a weekly review discipline.
We have been running terraphim-agent in production for months. It captures every failed bash command from Claude Code and OpenCode, stores them in a persistent learning database, and lets agents query past mistakes before repeating them. The capture loop works. The query system works. The correction mechanism works.
What was missing was verification. We could capture mistakes and add corrections, but we had no way to prove the corrections were being followed. No machine-checkable enforcement. No audit trail. No quantitative measure of whether the system was actually improving.
Then Meta Alchemist published a viral guide on transforming Claude Code into a self-evolving system, and two ideas jumped out: verification patterns on every rule and session scorecards. We already had the foundation. The article showed us what to build on top.
This post covers what we added, what we deliberately did not copy, and why the combination of a Rust CLI with a thin shell verification layer is more robust than an all-in-JSONL approach.
If you have not read the foundation post on configuring terraphim-agent for Claude Code and OpenCode, start there. This post assumes you have the capture system running.
What we already had
Before reading the Meta Alchemist article, our learning infrastructure had three layers:
Layer 1: Learning capture (PostToolUse hook)
Every failed bash command in Claude Code triggers our post_tool_use.sh hook. The hook extracts the command, exit code, and error output, then pipes them to terraphim-agent learn hook --format claude. The learning is stored as a structured file in ~/.local/share/terraphim/learnings/ (global) or .terraphim/learnings/ (project-scoped).
# What the hook does on every failed command:
terraphim-agent learn capture "$COMMAND" --error "$ERROR_OUTPUT" --exit-code "$EXIT_CODE"
The design is fail-open: if terraphim-agent is missing or crashes, the hook passes through silently. An observability tool must never break the tool it observes.
Layer 2: Safety guard (PreToolUse hook)
Before any bash command executes, our pre_tool_use.sh hook runs two checks:
-
terraphim-agent guard --jsonblocks destructive commands (rm -rf, git push --force, etc.) -
terraphim-agent replace --role "Terraphim Engineer"performs knowledge graph text replacement (npm -> bun, pip -> uv, etc.)
The guard blocks. The replacement corrects. Neither depends on the LLM remembering instructions.
Layer 3: Query and correct
Humans and agents query the learning database:
# List recent learnings
terraphim-agent learn list
# Search by pattern (with synonym expansion via thesaurus)
terraphim-agent learn query "docker"
# Add a correction to a captured learning
terraphim-agent learn correct 3 --correction "Use 'docker compose' (v2 plugin)"
The thesaurus has 20 semantic categories and 160+ synonym mappings. Search for "error" and you find "failure", "bug", "issue". Search for "setup" and you find "configuration", "install", "init". This is not keyword matching. It is structured retrieval.
What was missing
The capture-query-correct loop is a journal. It records mistakes and lets you look them up. What it does not do:
- Enforce rules mechanically. A rule saying "never use pip" exists only in CLAUDE.md text that the LLM might or might not follow.
- Verify compliance at session start. No sweep checks whether graduated rules are actually being obeyed.
- Track improvement quantitatively. No session scorecards. No trend data. No way to prove the system is getting better.
- Provide an audit trail for rule changes. Rules appear and disappear without record.
These gaps are exactly what the Meta Alchemist article addressed.
What Meta Alchemist proposed
The full guide describes a four-layer system:
- Cognitive core (CLAUDE.md): A decision framework Claude runs before writing code, plus completion criteria that must pass before any task is done.
- Specialised agents: An architect (plans, read-only) and a reviewer (validates, read-only) that spawn as subagents.
- Path-scoped rules: Security rules that only load when editing auth code. API design rules that only activate in handler directories. Keeps context lean.
- Evolution engine: A memory system that captures corrections in JSONL files, runs verification sweeps at session start, generates session scorecards, and promotes patterns through a confidence ladder.
The genuinely good ideas:
-
Verify lines on every rule. Each learned rule gets a machine-checkable grep pattern.
verify: Grep("\.\.\.options", path="src/api/") -> 0 matches. The sweep runs the grep and reports PASS/FAIL. This is brilliant. It turns instructions into guardrails. - Session scorecards. Quantitative tracking of corrections received, rules checked, rules passed, violations found. Trend detection over time. If corrections are flat or increasing, the rules are not working.
- Promotion ladder. Corrected once = logged. Corrected twice = auto-promoted to permanent rule. In learned-rules for 10+ sessions = candidate for graduation to CLAUDE.md.
- Capacity management. Max 50 lines in learned-rules.md forces graduation or pruning. Prevents unbounded growth.
The key quote: "A rule without a verification check is a wish. A rule with a verification check is a guardrail. Only guardrails survive."
We agree with the principle. We disagree with the implementation.
Where our approaches diverge
The Meta Alchemist article builds everything from scratch using JSONL files parsed by the LLM:
-
corrections.jsonl-- user corrections as JSON objects -
observations.jsonl-- verified discoveries -
violations.jsonl-- rule violations caught by sweep -
sessions.jsonl-- session scorecards
This is a reasonable approach if you have no existing infrastructure. We do. terraphim-agent already provides structured file storage with frontmatter, synonym-expanded querying, project/global scoping, and correction chaining. Adding parallel JSONL files would create a split-brain problem: two sources of truth for the same data.
The article also proposes auto-promotion: when the same correction appears twice, it automatically becomes a permanent rule. This is risky. A correction might be context-dependent (correct for one project, wrong for another). It might be a preference rather than a constraint. Auto-promotion without a quality gate means the system accumulates rules without human judgement about which ones deserve to be permanent.
Our approach: capture in terraphim-agent, verify with shell scripts, promote with CTO approval.
The verification layer we added
Three new components, all configuration. No Rust code changes.
learned-rules.md: graduated rules with verify patterns
The file lives at .claude/memory/learned-rules.md. Each rule has three parts: the constraint text, a machine-checkable verify pattern, and a source annotation.
# Learned Rules
Rules graduated from terraphim-agent corrections and CLAUDE.md conventions.
Each rule has a `verify:` pattern checked by the /boot verification sweep.
---
- Never use pip, pip3, or pipx; always use uv instead.
verify: Grep("pip install|pip3 install|pipx install", path="automation/") -> 0 matches
[source: CLAUDE.md convention, terraphim-agent learning #4, 2026-03-30]
- Never use npm, yarn, or pnpm; always use bun instead.
verify: Grep("npm install|yarn add|pnpm add", path="automation/") -> 0 matches
[source: CLAUDE.md convention, terraphim KG hook replacement, 2026-03-30]
- Never use double dashes in document titles or markdown headings.
verify: Grep("^#.*--", path="knowledge/") -> 0 matches
[source: corrected 2x, terraphim-agent learning #5, 2026-03-30]
- Never hardcode API keys as default values in bash scripts.
verify: Grep("API_KEY=.[a-zA-Z0-9]", path="automation/") -> 0 matches
[source: MEMORY.md security lesson, 2026-03-30]
The format is deliberately simple. No JSON. No YAML frontmatter. Just markdown that a human can read and a shell script can parse. The verify line follows a consistent pattern:
verify: Grep("regex_pattern", path="scope/") -> N matches
Where -> 0 matches means the pattern should NOT appear (absence check) and -> 1+ matches means the pattern MUST appear (presence check).
Rules without a verify line are flagged as technical debt during evolution review.
verify-sweep.sh: the verification engine
The core script parses learned-rules.md, extracts each verify line, runs the check, and reports PASS/FAIL. It uses rg (ripgrep) when available for native output limiting (no SIGPIPE issues from pipe chains).
#!/bin/bash
# Verification Sweep: parse learned-rules.md, run verify: checks, report PASS/FAIL
# Always exits 0 (advisory tool, never blocks).
set -uo pipefail
RULES_FILE="${1:-.claude/memory/learned-rules.md}"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
RG="$(which rg 2>/dev/null || echo '')"
TOTAL=0; PASSED=0; FAILED=0; MANUAL=0
current_rule=""
while IFS= read -r line; do
# Capture rule text (lines starting with "- ")
if echo "$line" | rg -q '^\s*- .+' 2>/dev/null; then
current_rule=$(echo "$line" | sed 's/^\s*- //')
fi
# Process verify: lines
if echo "$line" | rg -q '^\s*verify:' 2>/dev/null; then
TOTAL=$((TOTAL + 1))
# Skip manual checks
if echo "$line" | rg -qi 'manual' 2>/dev/null; then
MANUAL=$((MANUAL + 1))
echo "SKIP: $current_rule (manual check)"
continue
fi
# Extract pattern, path, and expected count
pattern=$(echo "$line" | sed -n 's/.*Grep("\([^"]*\)".*/\1/p')
path_scope=$(echo "$line" | sed -n 's/.*path="\([^"]*\)".*/\1/p')
expected=$(echo "$line" | sed -n 's/.*-> \([0-9]*\).*/\1/p')
[ -z "$pattern" ] && { MANUAL=$((MANUAL + 1)); echo "SKIP: $current_rule (unparseable)"; continue; }
search_path="${path_scope:-.}"
[ ! -d "$search_path" ] && [ -d "$PROJECT_ROOT/$search_path" ] && search_path="$PROJECT_ROOT/$search_path"
# Count matches
if [ -n "$RG" ]; then
match_count=$("$RG" -c "$pattern" "$search_path" 2>/dev/null \
| awk -F: '{s+=$NF} END {print s+0}') || true
else
match_count=$(grep -rEc "$pattern" "$search_path" 2>/dev/null \
| awk -F: '{s+=$NF} END {print s+0}') || true
fi
# Compare against expectation
if [ "$expected" = "0" ]; then
if [ "$match_count" -eq 0 ]; then
PASSED=$((PASSED + 1)); echo "PASS: $current_rule"
else
FAILED=$((FAILED + 1))
echo "FAIL: $current_rule (found $match_count matches, expected 0)"
"$RG" -n --max-count 1 "$pattern" "$search_path" 2>/dev/null \
| head -3 | sed 's/^/ >> /' || true
fi
else
if [ "$match_count" -gt 0 ]; then
PASSED=$((PASSED + 1)); echo "PASS: $current_rule"
else
FAILED=$((FAILED + 1))
echo "FAIL: $current_rule (found 0 matches, expected 1+)"
fi
fi
fi
done < "$RULES_FILE"
echo ""; echo "--- Verification Summary ---"
echo "Rules checked: $TOTAL | Passed: $PASSED | Failed: $FAILED | Skipped: $MANUAL"
Real output from our production environment:
PASS: Never use pip, pip3, or pipx; always use uv instead.
PASS: Never use npm, yarn, or pnpm; always use bun instead.
FAIL: Never use double dashes in document titles or markdown headings. (found 252 matches, expected 0)
>> knowledge/mem-layer-graph-memory.md:1:# Mem-Layer -- Graph-Based AI Memory System
>> knowledge/claude-1m-context-ga.md:7:# Claude 1M Context Window -- General Availability
>> knowledge/ukri-funding-opportunities-2026.md:7:# UKRI Funding Opportunities -- 2026 Active/Recent
FAIL: Use British English spelling in all generated content. (found 316 matches, expected 0)
>> knowledge/topics/context-engineering.md:41:- locality-of-behavior-dev-community.md
>> knowledge/topics/conway-vs-strongdm-identity.md:49:- Govern AI agent behavior at runtime
SKIP: Always run date before date-sensitive operations. (manual check)
PASS: Never hardcode API keys as default values in bash scripts.
PASS: Never use git commit --amend in pre-push hooks.
--- Verification Summary ---
Rules checked: 7 | Passed: 4 | Failed: 2 | Skipped: 1
The two failures are from imported external articles that use American English spelling and double dashes. These are expected: imported content is not generated content. This kind of nuance is exactly why we do not auto-fix violations. The sweep surfaces them; the human decides what to do.
/boot skill: session-start verification
The /boot skill wraps the verification sweep into a session-start ritual:
- Run
dateto establish the actual current date (never trust stale context) - Read
learned-rules.mdto load all graduated rules - Execute
verify-sweep.shto check compliance - Run
terraphim-agent learn listto surface recent learnings - Report a one-line summary
Boot complete: 7 rules checked, 4 passed, 2 failed. 10 recent learnings loaded.
The Meta Alchemist article proposes a SessionStart hook to trigger this automatically. Claude Code does not have a SessionStart hook type. Their article assumes one exists. Ours does not. We invoke /boot manually at the start of each session. A manual invocation that runs reliably is better than an automatic hook that does not exist.
The evolution engine
Verification tells you what is wrong. Evolution fixes it over time.
/evolve skill: weekly review with approval gate
The /evolve skill is the mechanism by which the system improves. It runs weekly (or on demand) and does the following:
-
Gathers corrections from
terraphim-agent learn list(recent captures and corrections) -
Reads current rules from
learned-rules.md -
Reads the evolution log from
evolution-log.md(to avoid re-proposing rejected rules) - Groups failures by pattern and identifies repeat corrections
- Proposes changes using a structured format:
PROPOSE: PROMOTE
Rule: Never use timeout command on macOS (does not exist)
Source: terraphim-agent learning #8, #12
Evidence: Corrected twice across different sessions
Verify: Grep("timeout ", path="automation/") -> 0 matches
Destination: learned-rules.md
Risk: Low. The command genuinely does not exist on macOS.
- Waits for CTO approval. No changes are applied until each proposal is individually approved, rejected, or modified.
-
Logs everything to
evolution-log.md: approved changes, rejected proposals, and the reasoning.
This is where we differ most from the Meta Alchemist approach. Their system auto-promotes on the second correction. Ours proposes and waits. The CTO reviews and approves.
Why? Because a correction is context. "Don't use pip" is correct for our projects. It is not correct for a project that deliberately uses pip. Auto-promotion assumes all corrections are universally true. They are not.
Our principle: A rule without CTO approval is an assumption.
Promotion ladder
| Signal | Destination |
|---|---|
| Failed command captured | terraphim-agent learn database |
| Human adds correction | Same learning, enriched |
| Same correction appears twice | Flagged for /evolve review |
Approved during /evolve
|
learned-rules.md with verify: pattern |
| Passing for 5+ sessions | Candidate for graduation to CLAUDE.md |
Rejected during /evolve
|
evolution-log.md (never re-proposed) |
The ladder is one-way unless the CTO explicitly overrides. Rejected rules do not come back. Graduated rules do not regress. The evolution log is the audit trail that makes this provable.
Session scorecards
The session-scorecard.sh script generates a quantitative summary:
=== Session Scorecard: 2026-03-30 ===
--- Recent Learnings ---
Total learnings in database: 10
With corrections: 2
--- Verification Sweep ---
Rules checked: 7 | Passed: 4 | Failed: 2 | Skipped: 1
--- Relevant Past Learnings ---
No learnings matching 'cto-executive-system'.
=== End Scorecard ===
Over time, the trend data answers a fundamental question: is the system getting better? If corrections decrease and pass rates increase, the evolution loop is working. If they are flat, the rules are too vague or too disconnected from actual work.
What we deliberately did not build
Engineering is as much about what you leave out as what you put in. Here is what the Meta Alchemist article proposes that we skipped, and why.
JSONL files for corrections and observations
The article creates corrections.jsonl, observations.jsonl, violations.jsonl, and sessions.jsonl. Each is an append-only log of JSON objects that Claude parses at session start.
We already have terraphim-agent learn which provides structured file storage, thesaurus-expanded querying, project/global scoping, and correction chaining. Adding JSONL files would create two sources of truth for the same data. The agent CLI is the single source.
Auto-promotion on second correction
The article promotes automatically when the same correction appears twice. We flag for review. The difference matters:
- Auto-promotion: fast, no human bottleneck, but accumulates rules without judgement
- Reviewed promotion: slower, requires CTO time, but every rule is intentional
We chose reviewed promotion because our project spans multiple contexts (CTO executive system, Terraphim AI, client projects). A correction that is right in one context might be wrong in another. The human knows the difference.
SessionStart and Stop hooks
The article configures SessionStart and Stop hooks in Claude Code's settings.json. These hook types do not exist in Claude Code's documented hook system. The available types are PreToolUse, PostToolUse, and (in some versions) SubagentStart. The article either assumes a future feature or describes a different version.
We replaced the SessionStart hook with a /boot skill. We replaced the Stop hook with a manual session-scorecard.sh invocation. Both work reliably because they use mechanisms that actually exist.
Path-scoped rules
The article loads different rule files based on which file Claude is editing: security rules for auth code, API design rules for handlers, performance rules everywhere. This is a genuine Claude Code feature (.claude/rules/ with paths: frontmatter).
We skipped it because our project is not a single codebase. The CTO executive system contains knowledge articles, automation scripts, domain models, plans, and publishing workflows. Path-scoped rules make sense for a web application with clear directory boundaries. They are premature for a heterogeneous knowledge system.
Hard capacity cap
The article enforces a maximum of 50 lines in learned-rules.md. If you hit the cap, you must graduate or prune before adding more.
This is a useful forcing function for projects that might accumulate hundreds of rules. We started with 7. When we approach a natural limit, /evolve will recommend pruning. We do not need an artificial constraint to force a behaviour that good engineering practice already demands.
Architecture comparison
| Dimension | Meta Alchemist | terraphim-agent + verify layer |
|---|---|---|
| Storage | JSONL files parsed by LLM | Rust CLI with structured file storage |
| Capture trigger | Custom evolution SKILL.md (auto-triggered) | PostToolUse bash hook (fail-open) |
| Query mechanism | LLM reads and interprets JSONL | CLI with thesaurus-expanded search |
| Verification | Grep patterns in learned-rules.md | Same (we adopted this idea) |
| Promotion | Auto on 2nd correction | Manual via /evolve with CTO approval |
| Audit trail | evolution-log.md | Same (we adopted this idea) |
| Session scoring | sessions.jsonl (auto-written) | session-scorecard.sh (manual) |
| Cross-tool support | Claude Code only | Claude Code + OpenCode + any CLI |
| Safety guard | settings.json deny list | terraphim-agent guard (pattern matching) |
| Text replacement | Not included | terraphim-agent replace (KG-based) |
The fundamental difference: Meta Alchemist builds a complete system inside Claude Code's configuration. We build a thin verification layer on top of an existing CLI. The CLI handles storage, querying, and correction chaining. The verification layer handles enforcement and evolution. Each does what it is good at.
Where this fits in the broader landscape
The idea of self-improving AI coding agents is not new. Several approaches exist:
- Devin's knowledge suggestions: captures corrections as project-specific "knowledge" entries that load into future sessions
- OpenClaw's metacognitive loops: three-phase review cycle that captures Phase 2 findings as learnings for future Phase 1s
- Ouroboros pattern: self-modifying agents with constitutional guardrails, event sourcing, and multi-model review chains
- Compound agency learning architecture: six nested learning loops from failure-to-guardrail up to loop-evolution
Our approach sits between Devin (simple capture) and Ouroboros (full self-modification). We capture automatically, verify mechanically, but promote deliberately. The human stays in the loop for rule changes. The machine handles enforcement.
Getting started
If you already have terraphim-agent configured with the PostToolUse hook (see the foundation post), adding the verification layer takes five steps:
1. Create the directory structure
mkdir -p automation/learning .claude/skills/boot .claude/skills/evolve .claude/memory
2. Seed learned-rules.md
Start with 3 to 5 rules from your existing CLAUDE.md or project conventions. Each rule needs a verify pattern. If you cannot write a grep check for a rule, the rule is too vague.
3. Write verify-sweep.sh
Copy the script from above. Make it executable. Test it:
chmod +x automation/learning/verify-sweep.sh
bash automation/learning/verify-sweep.sh
You should see PASS/FAIL for each rule. If a rule fails, either fix the violation or refine the verify pattern.
4. Create the /boot and /evolve skills
These are SKILL.md files in .claude/skills/boot/ and .claude/skills/evolve/. The boot skill runs the sweep and surfaces learnings. The evolve skill reviews corrections and proposes rule changes. Full skill definitions are in our repository.
5. Add to CLAUDE.md
### Learning Evolution System
- Run /boot at session start to verify learned rules and surface past learnings
- Run /evolve weekly to review corrections and propose rule promotions
- Graduated rules with verify: patterns: .claude/memory/learned-rules.md
- Evolution audit trail: .claude/memory/evolution-log.md
Conclusion
The Meta Alchemist article gave us the idea we were missing: machine-checkable verification patterns on every rule. That single concept transforms a learning journal into an immune system. We credit the article for the insight.
What we brought to the table: a Rust CLI that already handles capture, storage, querying, and correction chaining. The verification layer is 100 lines of bash on top of a structured backend, not 500 lines of JSONL parsing instructions for an LLM.
The combination works. Corrections are captured automatically by the PostToolUse hook. Rules are verified mechanically by the sweep script. Promotions are approved deliberately by a human. The system gets better every week, and we can prove it with session scorecards.
Two principles emerged from building this:
From Meta Alchemist: A rule without a verification check is a wish.
From us: A rule without CTO approval is an assumption.
Only verified, approved guardrails survive.
The terraphim-agent learning system is open source at github.com/terraphim/terraphim-ai. The verification layer described in this post is configuration, not code: shell scripts and markdown files on top of the existing CLI.
This is the third post in a series: Part 1: Configuring terraphim-agent for Claude Code and OpenCode | Part 2: Verification Checklist | Part 3: Self-Evolving Rules (this post)
Top comments (0)