Gábor Mészáros

Posted on May 19

Instruction systems capability ladder: harness leveling

#hermesagentchallenge #devchallenge #agents #claude

Hermes Agent Challenge Submission

submission for the Hermes Agent Challenge

A few months ago I drew a maturity ladder for CLAUDE.md files — does the file exist, are constraints explicit, do skills load on demand. Useful for self-locating, and the ladder generalizes past Claude — CLAUDE.md, AGENTS.md, .cursorrules, Copilot instructions all live on the same rungs.

After a lot (lot lot lot) more time spent with these setups, the ladder is built on a different, broader axis than I first drew it on: the channel each rung runs on.

The new ladder

Level	Name	What's added	Channel
L0	System	System prompt only	attention
L1	Primer	One instruction file (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`)	attention
L2	Composite	Multiple files — user defaults, project overrides	attention
L3	Scoped	Path-scoped rules (`.claude/rules/*.md`)	attention
L4	Delegated	Skills — procedures invoked on demand	attention
L5	Abstracted	Sub-agents — child contexts called by the parent	attention (interface)
L6	Governed	Hooks, MCP gates, deny-permissions	enforcement
L7	Adaptive	Self-improving skills written by the agent	self-writing

Two cuts split the ladder: one between L5 and L6 (soft to hard), one between L6 and L7 (read to write). I'll use attention and soft channel interchangeably from here, and the same for enforcement / hard channel.

A quick tour

L0 (System) is the cold start: the model with whatever the vendor injected, nothing else.

L1 (Primer) is your single root file — the entry every model sees first.

L2 (Composite) is the moment you split user-level config from project-level: ~/.claude/CLAUDE.md vs ./CLAUDE.md, or your global Cursor settings vs a project .cursorrules.

L3 (Scoped) introduces path scoping — the rule about Python tests only loads when the agent touches tests/*.py.

L4 (Delegated) is skills, which let you ship procedures the agent can pull on demand instead of dumping every workflow into the root file.

L5 (Abstracted) is sub-agents — child processes with their own context, called by the parent for a focused subtask. The child's reasoning runs in its own context window, separate from the parent's. What flows back is the result, which re-enters the parent as a new source. The parent–child interface is on the soft channel; the child's internal work runs on its own soft channel, not the parent's.

L0 through L4 share one context — they all compete for the same finite slot against the user's prompt and the recent diff. L5 spawns a second context but couples to the parent on attention. Together that's the soft channel — attention dynamics, by another name.

L6 (Governed) is where it profoundly changes. Hooks are not in the model's context window. A PreToolUse hook that blocks git push on a non-zero pytest exit doesn't get downweighted by a long task. An MCP server that requires authentication before reading a file doesn't depend on the model remembering your auth rule. Deny-permissions in .claude/settings.json for .env and .pem files don't compete with the rest of the spec. L6 is enforcement — outside the context dynamics, deterministic, not subject to load or context rot.

L7 (Adaptive) is different again. The agent writes its own instructions — not because the user said "remember this," but because the agent finished a task and decided some part of the trajectory was worth saving for next time. At read time the artifact lands in the attention channel like anything else. What's different is the writer: the model wrote the file, the trigger was task completion, and the user never saw the prompt that produced it. L7 is self-writing.

That's the ladder.

Two cuts under the ladder: attention channel covers L0–L5, enforcement is L6 alone, self-writing is L7 alone. Cut 1 at L5/L6 marked soft to hard. Cut 2 at L6/L7 marked read to write.

The first cut: L5 / L6

The load-bearing observation in this reframe is the cut between L5 and L6.

L0 through L5 all run on the soft channel — either directly on the parent's field (L0–L4) or on a child's that couples back to the parent through prompts and results (L5). They compete. They decay with load. The model can downweight any of them, lose track of any of them, prioritize the user's prompt over any of them. You can tell a sub-agent "always check tests before reporting done" and it'll do it 80% of the time, or 95%, or 60% — you don't know without measurement. The same instruction in a CLAUDE.md and the same instruction passed to a sub-agent are running on the same physics, just on different fields.

L6 is outside that physics entirely.

Generic example. Suppose your CLAUDE.md says "never push without running tests." That's L1. The model reads it, integrates it into context, weights it against everything else loaded — your other rules, the recent diff, the user's prompt. If you have four thousand tokens of instructions and the model is mid-task, that line is competing with everything else for attention. Sometimes it follows. Sometimes it doesn't.

Now suppose you have a PreToolUse hook on Bash that exits non-zero if pytest fails. That's L6. The model can decide to push or not push. It doesn't matter. The push fails before the model's intent reaches the network.

Same constraint, two channels, two failure modes. Soft channel fails probabilistically. Hard channel fails deterministically. They take different fixes — soft constraints want better content and ordering (the Pink Elephant piece is about that fight), hard constraints want a better hook script, a tighter PreToolUse matcher, or a stricter permission rule.

Calling these the same thing because they're both "in your .claude/ directory" hides the architectural difference.

The second cut: L7 writes itself

L7 - Adaptive isn't a third channel exactly — at read time, what L7 wrote lands in the same context with everything else. The cut is at write time. The agent writes its own instructions.

Most "memory" features in shipping agents aren't L7 by this cut. Claude Code's saved memory writes when the user signals remember this or accepts a prompt to save. Cursor's notepads, Copilot's pinned context, Gemini's saved facts — same pattern. The agent keeps the artifact, but the user authored it. That's persistent context, not self-writing. Call it L6.5 if you want a name for it.

The clearest L7 in print today is Hermes Agent, released by Nous Research. The mechanism is documented: when the agent identifies a saveable trajectory — after a successful task with five or more tool calls, after recovering from errors and finding a working path, after the user corrects its approach, or after discovering a non-trivial workflow — it invokes its skill_manage tool to extract a SKILL.md (markdown with YAML frontmatter) into ~/.hermes/skills/. Future sessions load the skill automatically and it becomes available as a slash command. The user didn't ask for it. The agent decided the trajectory was worth saving.

Three of the four triggers are what make this clearly L7 and not L6.5. Error recovery, user correction, novel-workflow discovery — these are cases where only the agent knows the saveable moment happened. A user-driven memory feature can capture "this task was useful enough to want it remembered" by asking the user after the fact. It can't capture "I tried three approaches and the third worked" unless the agent volunteers it. The artifact format matters too: an auto-extracted SKILL.md lands in ~/.hermes/skills/ in the same format human-written skills use. Next session, the agent loads it and can't tell who the author was. That symmetry is what makes the loop close — every successful trajectory can shape the next one.

Concretely, here's what an auto-extracted skill might look like — illustrative, in the shape Hermes's documented SKILL.md schema specifies, fitting the second trigger (the agent worked through a pytest debugging session, found the working path, and saved the lesson):

---
name: debug-pytest-import-errors
description: When pytest reports ModuleNotFoundError despite a successful editable install, check src-layout configuration before chasing PYTHONPATH.
version: 1.0.0
platforms: [macos, linux]
metadata:
  hermes:
    tags: [python, testing]
    category: dev-workflow
---
# Debug pytest ImportError on src-layout projects

## When to Use
pytest fails with `ModuleNotFoundError` after a fresh clone, even though `pip install -e .` ran and the import works in a Python REPL.

## Procedure
1. Check `pyproject.toml` for `where = ["src"]` under the build-system packages section.
2. Confirm `pythonpath = ["src"]` is set in `[tool.pytest.ini_options]`.
3. Re-run `pip install -e .`; confirm `.egg-info` lands at the package root, not inside `src/`.

## Pitfalls
- `PYTHONPATH=src` as an env var works locally but doesn't survive CI.

## Verification
`uv run pytest` runs without `ModuleNotFoundError`.

The frontmatter is functional — tags and category route the skill in Hermes's index; platforms gates it by OS. The body's When to Use / Procedure / Pitfalls / Verification is the schema's recommended shape. Notice what the agent saved: not the original failing command, not the dead-ends, just the working path plus the trap that would have lured a next session into chasing PYTHONPATH. That's curation, not transcription.

This is why L7 is safe to leave unsupervised in Hermes and risky most places else. The SKILL.md schema enforces moves a well-coupled instruction needs — imperative voice, directive ordering, named constructs, the warning placed after the working path rather than before it. A free-form memory feature has no such structural prior; the agent writes whatever feels worth saving, and the writes degrade as the agent's writing discipline does.

Schema is the cheap version of supervision.

The new failure mode is the self-writing layer running unsupervised. An auto-extracted skill that overfits to one project. A trajectory summary written under a stale assumption that surfaces six weeks later as a phantom instruction. There's no rule file the user authored to grep for the source — the rule is in a markdown file the agent wrote and the user never read, sitting in ~/.hermes/skills/ or its equivalent.

L7 doesn't replace L0–L6. It runs alongside, with its own writes and its own decay. Most agent setups don't have it because most agents don't expose it. The ones that ship a memory feature mostly do L6.5 and call it L7.

When to climb

The dominant pattern I see in real repos is L1 with a thin L6: a CLAUDE.md, maybe a few rule files at L3, deny-permissions for .env. L4 (skills) is rare — most authors haven't built any. L5 (sub-agents) is rarer — most use cases haven't surfaced. L7 is mostly absent — most agents don't expose a self-writing surface, and the few setups that do have one running treat it as opt-in defaults nobody reviewed.

Across 28,721 public repositories with AI configs, 89.9% don't name specific constructs in their instructions — no backticks, no file paths, no function names. That's most of the soft channel running at low coupling: easily downweighted, easily lost. The hard channel is thinner. The adaptive channel is mostly absent.

Large spec, small contract, no adaptive layer. That's the asymmetry — but it's not always a bug. Each rung exists because the rung below it fails in a specific way. The trigger is the failure, not a feature wishlist.

From	To	Symptom that triggers the climb
L0	L1	Re-explaining the same project context every session
L1	L2	One file got long enough that important rules get ignored
L2	L3	Path-irrelevant rules pollute every task
L3	L4	The same procedure gets described inline across multiple rules
L4	L5	A procedure pollutes the parent's context with reasoning chains the parent doesn't need
L5	L6	A constraint must hold 100% of the time, not 95%
L6	L7	You keep correcting the same preference across sessions
L6	L7	You keep watching the agent re-derive the same workaround

The mistake is climbing without the symptom. A repo with three rules in one file doesn't need L3. A solo developer's CLAUDE.md doesn't need a sub-agent. Premature climbs cost context budget for no return; you've added structure the model has to navigate without solving a problem you actually had.

The opposite mistake is more common: under-building the higher rungs because the symptoms feel like model failures rather than rung failures. "The agent didn't run tests before pushing" reads like a prompt-engineering problem; it's a missing L6. "The agent forgot we use Cloudflare Workers" reads like context drift; it's a missing L7. "The agent keeps describing the deploy process every time I ask" reads like verbosity; it's a missing L4.

Climb when the rung below stops working.

Three questions for your repo

Not a recipe. A diagnostic. For any rule in your setup, ask:

Does this fail loudly when violated, or silently? Loudly is L6. Silently is L0–L5.
Does the model see this, or does the runtime enforce it? Sees is the soft channel. Runtime is the hard channel.
Does it get worse when you add unrelated rules to the same file? Yes is L0–L5. No is L6. "Sometimes" is probably L7.

Most rules answer silently / sees / yes. That tells you which channel you're in. The interesting question is whether anything in your setup is on the other channels at all.

A note on related taxonomies

There are other progressive ladders for AI agent setups in print. Vellum's L0–L5 is an autonomy axis — how much the agent decides on its own. Blake Crosley's 4-tier is a concurrent-decomposition axis — how many agents run in parallel. Anthropic's 5-layer ADK frame for Claude Code is a content-boundary axis — what kind of content goes where. Zylon's 5-architectural and GitHub's 3-tier carve different cuts again, mostly around how the agent is wired into a product surface.

The ladder above is on a different axis from any of those. It sorts by the channel each mechanism runs on — soft attention, hard enforcement, self-writing memory — and progresses through the named constructs an agent exposes (CLAUDE.md, scoped rule files, skills, sub-agents, hooks, auto-memory). The two cuts (L5/L6 and L6/L7) are the load-bearing claim; the autonomy and concurrency taxonomies don't draw those cuts because they're sorting on different things.

Different axis, different cuts, different diagnostic. Use whichever maps onto the question you're actually asking.