DEV Community: Gábor Mészáros

Prompt Engineering, Context Engineering, Loop Engineering: What Actually Changed

Gábor Mészáros — Wed, 08 Jul 2026 11:16:04 +0000

A few years back the skill had one name: prompt engineering. You rewrote a sentence until the model did the thing. Last year the same corner of the job picked up a new name, context engineering. This year the threads call it loop engineering, and the line repeated under it is the verifier is the bottleneck, not the model anymore.

Every rename draws the same reaction from the same developers: is this a real shift, or the same work with fresh paint so someone can sell a course. That reflex is usually right. Most AI-tooling vocabulary turns over faster than the problems it describes.

So here is the read, one rename at a time. At each step, the thing you were actually engineering changed. What changed was the unit of work, not the marketing around it. The problems the newest name points at have a dated, public record. The dates are on the articles; the arithmetic is yours to do.

The unit was a prompt

Call it 2022 through 2024. The unit of work is a single prompt. You engineer the wording of one request: few-shot examples, a role frame, think step by step, the ordering of the ask. The entire surface you controlled was the text of one message, and the craft was getting that message right.

That discipline is not gone. It is no longer the whole job, because the whole job stopped fitting in one message.

The unit became the surface

Around mid-2025 Andrej Karpathy put a name on what people were already doing: context engineering. The unit stopped being a message and became everything in the window: the system prompt, the retrieved documents, the tool definitions, and the CLAUDE.md / AGENTS.md instruction files that ride along on every single turn. You are no longer tuning a sentence. You are curating a surface.

That surface has a property developers kept rediscovering the expensive way: most of what you put on it does not couple to behavior. The State of AI Instruction Quality pointed a deterministic analyzer at 28,721 repositories and found the median instruction file carries 50 content items and 12 actual directives, and the rest is headings, context, and structure the model is free to ignore. A sharper failure mode got its own writeup. Do NOT Think of a Pink Elephant shows how a constraint phrased as a negation (do not use mocks) can raise the odds of the exact thing it forbids, for the same reason the title just did to you.

Both of those are context-engineering problems, measured and published.

The unit is the loop

June 2026, the term is loop engineering, popularized by Addy Osmani synthesizing Boris Cherny and Peter Steinberger, and everywhere on the timeline within a few weeks. The unit is now the running loop: generate, check, steer, retry, stop. The prompt is one node in it. The context surface is the state it carries between iterations. The claim repeated across every explainer is that the model is no longer the limiting part. The check that decides good enough, stop is.

Take the name generously. It points at something the two earlier names left implicit: the mechanism that decides whether an iteration was any good and whether to run another. That mechanism was always present; every agent that retries has one. Loop engineering's contribution is making it the object you engineer instead of a default you inherit.

Prompt engineering tuned node one. Context engineering curated the state feeding it. Loop engineering points at the diamond, and that is where the interesting question hides.

A verifier that checks what, exactly?

The verifier is the bottleneck is a good slogan and an incomplete one. It names the bottleneck without saying what the verifier is supposed to check, and that gap is the engineering problem the slogan skips.

There are two kinds of check, and they are not interchangeable. A deterministic check runs the code, asserts the exit status, scans for the forbidden import, and returns the same verdict for the same input every time, with no judgment in the middle. A model-graded check asks another model is this good? The second reaches criteria the first cannot express, like is this explanation clear or does this read as rude, and it pays for that reach by inheriting the exact unreliability the loop was built to contain. You have put a probabilistic judge in charge of deciding when the probabilistic generator is finished.

Watch where they diverge on a real loop. An agent is told to refactor a module and stop when the work is done. A deterministic verifier can prove the test suite still passes and no banned import crept in. Those are facts, checkable on every iteration, with no argument about them. It cannot decide whether the refactor was worth doing, so a model-graded check gets bolted on to answer that, and now the stop condition rests on one model grading another's taste. The loop terminates when the judge is satisfied, and the judge is the same class of component the loop exists to supervise. Neither check is wrong. They answer different questions, and confusing which question you asked is how a loop stops confidently on the wrong iteration.

That trade-off, what the check actually checks, is the content of the verifier, and it is not a new problem. Green Tests Don't Mean Better Software works the CI version of it: a green test proves the code conforms to its spec and says nothing about whether the change improved the system. The check answered a question you never asked. Aim the same distinction at an agent loop instead of a test suite and you have the loop-engineering question stated for a case every developer already trusts.

What the earlier articles measured

Three articles published earlier in 2026 each worked one piece of this problem.

CLAUDE.md: Check, Score, Improve & Repeat, 27 January 2026, describes a loop run over an instruction file: score it, change it, score it again. Do NOT Think of a Pink Elephant, 31 March 2026, shows a single steering instruction that produces the opposite of its intent, and names the mechanism behind it. The State of AI Instruction Quality, 21 April 2026, analyzed 28,721 repositories and found that most of a typical instruction file is content the model can ignore rather than directives it has to follow.

Between them they cover the three problems loop engineering now names: what the verifier checks, steering that drifts, and instructions that do not couple to behavior. None of the three used the term. The dates are on the links, and they are the whole of the evidence.

What the new name solves

Loop engineering gives you two things worth having. It makes the check an explicit object of engineering, something you design and revise on purpose instead of a default you inherit from whatever your agent framework happened to ship. And it gives the verifier question a name, so teams can argue about it directly: what should this check check, and can it be deterministic.

Both are important, and both have an edge worth stating plainly. The name hands you the object. Building the verifier is still yours to do, and it is most of the work: deciding what correct means for a task, then turning that into a check you can run.

Part of that work is choosing the kind of check. A criterion like is this explanation clear needs a model to judge it. A criterion like no banned import is a deterministic scan. The cost of picking wrong runs both ways: a model-graded check where a deterministic one would do buys unreliability for nothing, and a deterministic check standing in for a judgment it cannot make gives you a green light that means less than it looks like. The name makes that choice visible and leaves it in your hands.

Underneath the frame, the generator still guesses. What loop engineering adds is a place to put the check and a reason to take it seriously, which is more than the earlier names offered the same problem.

The loop, one part at a time

The vocabulary will change again. Harness engineering was an intermediate name for part of this, and something will come after loop engineering too. Each time the word changes, the useful question is the same: what is the new unit of work, and what is the hard part inside it?

For loop engineering the unit is the loop, and the hard part is the verifier: what it checks. The verifier is one component of several, and a name that points at the whole loop is easy to nod along to and harder to act on. So the plan from here is to take the loop apart a component at a time and put a measurement under each one. The questions this piece raised are the ones the next few pieces go after:

A check that fires on the wrong thing comes back quiet, and so does a signal that was never there. They look identical from the outside. How do you tell a broken instrument from an absent one, before you conclude that the problem you were checking for does not exist?
Not every rule in the loop is a verifier, and the rules that steer are not the rules that enforce. A prompt asks; a gate refuses. What belongs in the channel that can only ask, what belongs in the channel that can say no, and what breaks when a rule is filed under the wrong one?
Everything on the context surface is paid for on every turn, whether the model needs it that turn or not. What does an instruction actually cost to keep loaded, and what changes when you load it only where it applies instead of putting the whole surface in front of the model every time?

Each of those is one part of the loop, looked at on its own, with the evidence attached. Same discipline as this piece: a claim is worth exactly what the measurement under it is worth, and no more.

I work on Reporails, deterministic diagnostics for the instruction files, rules, and prompts that steer coding agents. It reads the steering surface and tells you, with measured evidence, where it drifts. It does not run your loop; it checks the part of the loop you wrote down.

Wiring a "read this first" hook as "ask" (don't)

Gábor Mészáros — Wed, 01 Jul 2026 14:52:30 +0000

In retrospect, it was pretty obvious... sharing it anyway.

I'm currently working on a self-steering progressive disclosure system, and I wanted my agent to read a design doc before it touched the code that doc governs. Simple PreToolUse hook: match Edit/Write, check whether the doc was Read this session, and if not, stop the edit and tell the agent to read it first.

I returned permissionDecision: "ask". My reasoning at that time: "ask has an escape hatch. Approve and the call proceeds, so a bad condition can never wedge the agent - you can always wave it through."

That safety was the whole problem. The approval is an escape from the prerequisite, not a path through it. The edits went through without the doc reads.

Here is what ask actually does. It surfaces the tool call for approval. Approve it - or auto-approve it - and the action runs. The doc never gets read. ask is a checkpoint for a human, not a gate on the agent. Nothing about it makes the agent do the prerequisite step.

The fix was one field. Return deny:

// PreToolUse hook, matcher: Edit|Write
if (!wasReadThisSession(governingDoc)) {
  return {
     hookSpecificOutput: {
         hookEventName: "PreToolUse",
         permissionDecision: "deny",
         permissionDecisionReason: `Read ${governingDoc} before editing this, then retry.`
     }
  };
}

deny refuses the tool call and hands the agent the reason. The agent can't proceed, so it does what the reason says - reads the file - and retries. The read clears the check, the retry passes.

This is the behavior Claude Code already ships natively: when it tries to Edit a file that haven't been Read before, the tool refuses the action until the relevant Read happens. The deny hook lets you apply that same read-before-edit rule to any file you designate, not only the file being edited.

The distinction I missed:

ask gates on approving the call.
deny gates on the agent doing the prerequisite work first.

If the goal is to force the agent's own behavior - read this, run that check, load that context - deny is the lever. ask just adds a human to the loop.

Important: this only works, if the gate releases accurately, when the prerequisite is actually done. A deny on a condition that never clears will wedge the agent. Mine keys off "was this file read this session," which the read itself flips.

Green Tests Don't Mean Better Software

Gábor Mészáros — Tue, 30 Jun 2026 17:15:05 +0000

You spent the week on a refactor because it was supposed to make the next change cheaper. The tests go green, the PR merges, and that's the last anyone thinks about it. Nobody comes back to check whether the next change actually got cheaper. The green bar said the code conforms to its spec, everyone read that as done, and the reason you did the work in the first place went unmeasured.

What green actually checks

Your CI is green. The test asserts assert response.status == 200, and it passes. It does not assert that the caching layer you just shipped cut p99 latency, or that the reworded onboarding step lifted activation, or that the prompt change made the agent follow instructions more often.

The green bar answered a question you never asked.

What did that green actually tell you? That the change conforms to the spec — the inputs map to the asserted outputs, the contract holds, nothing you wrote a test for regressed. That is correctness. It is real and it is necessary. It told you nothing about whether the system got better. You shipped because the bar was green, and the bar was never measuring the thing you shipped the change for.

Those latency, activation, and adherence numbers are effects. CI does not measure effects. It measures conformance, and then we read the green bar as if it answered the other question too.

The silent assumption is that "passes" and "better" are the same axis — that a change which is correct is, by that fact, an improvement. They are not the same axis. A perfectly correct change can make the system worse, and a perfectly correct change can leave every metric exactly where it was; the suite goes green for both. Green is necessary for "better" and nowhere near sufficient for it, and almost no team instruments the difference.

Two disciplines, pitched as rivals

The industry already named both halves of this. It named them as rivals.

Spec-Driven Development says: write the specification first, then build to it, then prove the build conforms. The test suite is the executable form of the spec. Correctness is the whole game. SDD is disciplined, auditable, and it is what most engineering orgs actually run. As a flow it terminates the moment the bar goes green:

Notice where it stops. It stops at correct. Nothing in that loop asks whether the shipped change improved anything.

Hypothesis-Driven Development asks the opposite question. Thoughtworks framed it as a triple: we believe X will result in outcome Y; we will know we are right when we see measurable signal Z. The unit of progress is not a passing build — it is validated learning. You predict an effect, you ship, you measure, and the measurement tells you whether you were right. PMI's expectation-management literature says the same thing from a different room: an expectation is a managed object with an owner and a due date, tracked and surfaced early rather than discovered late. As a flow it begins where SDD ends — at the prediction, and it runs past the ship:

These get pitched as competing philosophies — spec-first versus hypothesis-first, prove-correct versus prove-better, pick a camp. Look at the two flows again. One ends at correct. The other starts at predicted and ends at better. They answer two different questions, and the pick-a-camp framing assumes they answer one. Thoughtworks HDD and PMI expectation-management are not novel claims I am making here; they are prior art, decades of it, and I am citing them as validation. What none of them did was notice the two flows were never aimed at the same question.

The orthogonality move

Two different questions means two different axes. You cannot answer one by measuring the other — and a single pass/fail bar is built to answer only the first. Correct sits on one axis; better sits on the one perpendicular to it.

One axis is spec-conformance: does the change do what it was specified to do? Pass or fail, and pytest already answers it. The other axis is effect-verdict: did the change move the metric it was supposed to move? Confirmed, refuted, or not yet known — and nothing in your pipeline answers it today.

Lay them on a 2×2.


|                         | **effect confirmed**     | **effect refuted / unmeasured** |
|-------------------------|--------------------------|---------------------------------|
| **spec passes (green)** | shipped and works        | **green but no better**         |
| **spec fails (red)**    | blocked (correctly)      | blocked (correctly)             |

Three of those quadrants are familiar. Red blocks the merge, whichever way the effect would have gone. Green-and-confirmed is the win you wanted. The quadrant nobody names is the top-right: green but no better. The change is correct, it conforms to spec, and it shipped — and it did not improve the system, or you never checked, which from the system's point of view is the same thing.

That quadrant is where most shipped changes actually live, and it is invisible because the only instrument pointed at it is the green bar, which is pointed at the wrong axis. This is the reusable mental model: stop asking "did it pass" as if it were one question. It is two questions on two axes, and you are only instrumenting one of them.

A worked example: green architecture tests

Make it concrete with a case every Python team recognizes. You adopt the hexagonal layout — Ports and Adapters. A pure core (contract/, dto/, policy/) imports only the standard library. Adapters wrap the I/O. Subsystems compose the adapters. The dependency rule is one sentence: dependencies point inward, and the pure core points at nothing.

You enforce it with an architecture test that runs on every commit:

def test_core_purity():
    for module in pure_core_modules():
        assert no_imports(module, {"yaml", "requests", "subprocess", "open"})

It is green. Every pure module is import-clean. The dependency graph conforms to the rule — proven correct, mechanically, on every push.

Now ask the other question. You did not adopt the hexagonal layout for its own sake. You adopted it on a claim: changes would get cheaper, blast radius would shrink, the core would be testable without a single mock. Did that happen? test_core_purity cannot tell you. It measures the shape of the import graph, not whether the shape paid off. Green here means conformant. A conformant architecture that made nothing cheaper is the green-but-no-better quadrant with a .py file pointing straight at it.

That is the whole thesis in one test file. The architecture test lives on the spec axis. The promise that sold the refactor lives on the effect axis. They do not touch, and only one of them has an instrument.

The expectation primitive

If "better" is its own axis, it needs its own instrument. That instrument is an expectation — the HDD flow above, made into a first-class artifact.

An expectation is a change-scoped, falsifiable, deadline-carrying prediction. It is the HDD triple — believe X, expect outcome Y, know it by criterion Z — turned into an artifact that travels with the change instead of living in a planning doc nobody reopens. It carries a baseline, a bound measurement view — the query, tied to that change, that reads the metric — a threshold, and a due date. Concretely: the caching change should cut p99 latency below 200ms, measured against last week's baseline, re-checked Friday. That sentence is the whole artifact, and it rides attached to the diff.

Here is the part that is genuinely new, and the part I want to be precise about: the verdict is measured by the system.

HDD and PMI both already told you to make a falsifiable prediction with an owner and a deadline. That is not the advance. The advance is that the prediction binds to a real measurement view, and a system reads that view and stamps the verdict itself. The human who made the prediction does not come back and grade it. The human leaves the verification loop.

This is not a dashboard alert. A dashboard alert fires on a metric crossing a line, untethered to any one change. An expectation binds one prediction to one change and grades that prediction — the dashboard tells you latency rose; the expectation tells you the change you predicted would cut it did the opposite.

The verdict vocabulary is three words, and the discipline is in the third:

confirmed — the measured value met the threshold. Auto-resolved, silently. Announcing it would be filler.
refuted — the measured value missed the threshold. The expectation stays open and surfaces, with the predicted delta and the actual sitting next to each other.
inconclusive — the measurement produced no scalar yet. It stays open and surfaces. It is never read as confirmed. "We didn't measure" is not "it worked."

That third word is the one most dashboards quietly drop. A prediction you cannot yet grade is not a pass. Keeping inconclusive distinct from confirmed is what stops the green-but-no-better quadrant from refilling under a different name.

The selector routes by the change's effect, not its shape. A one-line diff can carry a large measurable effect and owe an expectation; a thousand-line refactor can be pure conformance and owe none — and you pick the lane when you ship, not after.

Run the selector over that same hexagonal project. Re-homing a module to clear a test_subsystem_isolation failure is a shape change. It restores the boundary, its effect resists measurement, it owes nothing past the green test. The refactor that promised cheaper changes is the other lane — it made a measurable claim, so it owes an expectation: adding the next artifact class touches three files or fewer, re-checked when the next one lands. Same codebase, two changes, two lanes.

This is what the spec-versus-hypothesis fight always missed. The selector does not pick a winner between SDD and HDD. It is conditional. It runs SDD's discipline on changes whose effect resists measurement and HDD's discipline on changes whose effect can be measured, and it picks per change. A discipline that reconciles two camps looks like a routing rule that knows when each camp is right.

There is one trap to name: do not attach an expectation to an unmeasurable change just to satisfy the selector. A prediction with no real threshold is conformance wearing a costume, and it teaches everyone to ignore the verdicts. If the effect resists measurement, declare none. The honesty of the refuted and inconclusive verdicts depends on the selector being willing to say "this change owes nothing."

This is the part that has to survive contact with a real backlog. The tempting first cut is to flag every shipped change as owing a prediction. Try it and it floods on contact: the renames, the prose-clarity passes, the reorganizations — the bulk of any change history — all light up red against a rule they cannot satisfy, because there was never a metric for them to move. The flood is the proof. A selector that cannot say "this change owes nothing" does not enforce the discipline; it discredits it. The conditional design is not a softening of the rule — it is the only version of the rule that does not collapse the first time you run it over changes that already shipped.

Proof of existence, and the cost of skipping it

A coordination system shipped exactly this layer, in working code. A change to a governed surface declares an expectation: predicted delta, bound measurement view, due date. The system re-measures when the due date passes, stamps confirmed / refuted / inconclusive against the threshold, auto-resolves the confirmed ones silently, and surfaces the refuted and inconclusive ones the next session, one line each. The selector decides which changes owe a prediction and which fall back to conformance. The point here is only that the pairing — a governance selector plus an auto-measured verdict — is buildable today, because it has been built. The mechanism is the proof; the product is not the subject.

`confirmed` is the minority verdict

What running it actually teaches is the part the diagram cannot. Before the verdicts come back, you assume most predictions will land — you shipped the change believing it would help, so of course the measurement will agree.

It does not. Once an instrument is actually pointed at the effect axis, confirmed turns out to be the minority verdict. Most changes come back inconclusive or refuted: the metric did not move past the threshold, or there was no scalar to read in the window at all.

The first time a batch of verdicts surfaces and almost none of them are confirmed, the 2×2 stops being a diagram. The green-but-no-better quadrant is not a rare corner you occasionally fall into — it is where most changes sit, and you only learn that the moment you can finally see it.

`inconclusive` is the common case

inconclusive is the verdict you underestimate most. "We didn't measure" happens far more often than any prediction would lead you to expect — the window was too short, the metric needed traffic that had not arrived, the effect was real but not yet legible.

The entire value of the third word is that it refuses to round itself up. Without it, every one of those un-graded predictions quietly refiles as a pass, and the quadrant refills under a name that looks like success. The discipline lives in holding inconclusive open, in plain sight, until there is a scalar — and discovering, run after run, how often that takes longer than you thought.

Even `confirmed` is not permanent

Even the confirmed ones are not closed for good. A threshold set too loose can confirm a change that later regresses, and a silent auto-resolve would bury exactly that. A confirmed verdict earns its silence; it does not earn permanent trust.

One discipline already exists for this. SkillOpt (Yang et al., 2026) accepts a self-authored skill edit only when it improves a held-out validation split — not the data the edit was tuned against. The held-out split is the guard against exactly that failure: a verdict that reads confirmed because the threshold was set on the same signal the change was shaped to move. An auto-measured verdict is only as honest as the separation between what tunes the change and what grades it.

The claim worth taking away is the framework, not any one implementation: a governance selector that routes each change to the right discipline, paired with an auto-measured verdict that proves the effect without a human grading it, is a reconciliation of spec-driven and hypothesis-driven development. It is not a third methodology stacked beside the two you have; it tells you which of the two you already know to apply, and then it does the grading.

That reconciliation stops being optional the moment the machine ships the change.

When a human writes every diff, the green-but-no-better quadrant is a slow tax — improvements that were not improvements, paid for in drift you notice eventually. When an AI-assisted or autonomous system writes and ships changes, there is no human in the loop reading the effect at all. The tests still run, so correctness still gets checked; nothing measures the effect, so nothing catches it slipping. Unmeasured-effect is precisely where a self-running system's regressions hide — each change correct, each change green, the system quietly getting worse along the one axis nobody instrumented.

The practice fits on one line. Before you merge the next change that makes a measurable claim, write down what you expect it to move, where that gets measured, and when you will look — then let the verdict come back to you instead of going to find it.

Green tells you the machine built the thing right. Only a measured verdict tells you the thing was worth building. If you are handing more of the building to machines, that second instrument is the one you cannot afford to leave off.

Reporails — deterministic instruction diagnostics and governance for the rules, prompts, and agent files that steer AI coding agents.

Instruction systems capability ladder: harness leveling

Gábor Mészáros — Tue, 19 May 2026 06:53:07 +0000

submission for the Hermes Agent Challenge

A few months ago I drew a maturity ladder for CLAUDE.md files — does the file exist, are constraints explicit, do skills load on demand. Useful for self-locating, and the ladder generalizes past Claude — CLAUDE.md, AGENTS.md, .cursorrules, Copilot instructions all live on the same rungs.

After a lot (lot lot lot) more time spent with these setups, the ladder is built on a different, broader axis than I first drew it on: the channel each rung runs on.

The new ladder

Level	Name	What's added	Channel
L0	System	System prompt only	attention
L1	Primer	One instruction file (`CLAUDE.md`, `AGENTS.md`, `.cursorrules`)	attention
L2	Composite	Multiple files — user defaults, project overrides	attention
L3	Scoped	Path-scoped rules (`.claude/rules/*.md`)	attention
L4	Delegated	Skills — procedures invoked on demand	attention
L5	Abstracted	Sub-agents — child contexts called by the parent	attention (interface)
L6	Governed	Hooks, MCP gates, deny-permissions	enforcement
L7	Adaptive	Self-improving skills written by the agent	self-writing

Two cuts split the ladder: one between L5 and L6 (soft to hard), one between L6 and L7 (read to write). I'll use attention and soft channel interchangeably from here, and the same for enforcement / hard channel.

A quick tour

L0 (System) is the cold start: the model with whatever the vendor injected, nothing else.

L1 (Primer) is your single root file — the entry every model sees first.

L2 (Composite) is the moment you split user-level config from project-level: ~/.claude/CLAUDE.md vs ./CLAUDE.md, or your global Cursor settings vs a project .cursorrules.

L3 (Scoped) introduces path scoping — the rule about Python tests only loads when the agent touches tests/*.py.

L4 (Delegated) is skills, which let you ship procedures the agent can pull on demand instead of dumping every workflow into the root file.

L5 (Abstracted) is sub-agents — child processes with their own context, called by the parent for a focused subtask. The child's reasoning runs in its own context window, separate from the parent's. What flows back is the result, which re-enters the parent as a new source. The parent–child interface is on the soft channel; the child's internal work runs on its own soft channel, not the parent's.

L0 through L4 share one context — they all compete for the same finite slot against the user's prompt and the recent diff. L5 spawns a second context but couples to the parent on attention. Together that's the soft channel — attention dynamics, by another name.

L6 (Governed) is where it profoundly changes. Hooks are not in the model's context window. A PreToolUse hook that blocks git push on a non-zero pytest exit doesn't get downweighted by a long task. An MCP server that requires authentication before reading a file doesn't depend on the model remembering your auth rule. Deny-permissions in .claude/settings.json for .env and .pem files don't compete with the rest of the spec. L6 is enforcement — outside the context dynamics, deterministic, not subject to load or context rot.

L7 (Adaptive) is different again. The agent writes its own instructions — not because the user said "remember this," but because the agent finished a task and decided some part of the trajectory was worth saving for next time. At read time the artifact lands in the attention channel like anything else. What's different is the writer: the model wrote the file, the trigger was task completion, and the user never saw the prompt that produced it. L7 is self-writing.

That's the ladder.

Two cuts under the ladder: attention channel covers L0–L5, enforcement is L6 alone, self-writing is L7 alone. Cut 1 at L5/L6 marked soft to hard. Cut 2 at L6/L7 marked read to write.

The first cut: L5 / L6

The load-bearing observation in this reframe is the cut between L5 and L6.

L0 through L5 all run on the soft channel — either directly on the parent's field (L0–L4) or on a child's that couples back to the parent through prompts and results (L5). They compete. They decay with load. The model can downweight any of them, lose track of any of them, prioritize the user's prompt over any of them. You can tell a sub-agent "always check tests before reporting done" and it'll do it 80% of the time, or 95%, or 60% — you don't know without measurement. The same instruction in a CLAUDE.md and the same instruction passed to a sub-agent are running on the same physics, just on different fields.

L6 is outside that physics entirely.

Generic example. Suppose your CLAUDE.md says "never push without running tests." That's L1. The model reads it, integrates it into context, weights it against everything else loaded — your other rules, the recent diff, the user's prompt. If you have four thousand tokens of instructions and the model is mid-task, that line is competing with everything else for attention. Sometimes it follows. Sometimes it doesn't.

Now suppose you have a PreToolUse hook on Bash that exits non-zero if pytest fails. That's L6. The model can decide to push or not push. It doesn't matter. The push fails before the model's intent reaches the network.

Same constraint, two channels, two failure modes. Soft channel fails probabilistically. Hard channel fails deterministically. They take different fixes — soft constraints want better content and ordering (the Pink Elephant piece is about that fight), hard constraints want a better hook script, a tighter PreToolUse matcher, or a stricter permission rule.

Calling these the same thing because they're both "in your .claude/ directory" hides the architectural difference.

The second cut: L7 writes itself

L7 - Adaptive isn't a third channel exactly — at read time, what L7 wrote lands in the same context with everything else. The cut is at write time. The agent writes its own instructions.

Most "memory" features in shipping agents aren't L7 by this cut. Claude Code's saved memory writes when the user signals remember this or accepts a prompt to save. Cursor's notepads, Copilot's pinned context, Gemini's saved facts — same pattern. The agent keeps the artifact, but the user authored it. That's persistent context, not self-writing. Call it L6.5 if you want a name for it.

The clearest L7 in print today is Hermes Agent, released by Nous Research. The mechanism is documented: when the agent identifies a saveable trajectory — after a successful task with five or more tool calls, after recovering from errors and finding a working path, after the user corrects its approach, or after discovering a non-trivial workflow — it invokes its skill_manage tool to extract a SKILL.md (markdown with YAML frontmatter) into ~/.hermes/skills/. Future sessions load the skill automatically and it becomes available as a slash command. The user didn't ask for it. The agent decided the trajectory was worth saving.

Three of the four triggers are what make this clearly L7 and not L6.5. Error recovery, user correction, novel-workflow discovery — these are cases where only the agent knows the saveable moment happened. A user-driven memory feature can capture "this task was useful enough to want it remembered" by asking the user after the fact. It can't capture "I tried three approaches and the third worked" unless the agent volunteers it. The artifact format matters too: an auto-extracted SKILL.md lands in ~/.hermes/skills/ in the same format human-written skills use. Next session, the agent loads it and can't tell who the author was. That symmetry is what makes the loop close — every successful trajectory can shape the next one.

Concretely, here's what an auto-extracted skill might look like — illustrative, in the shape Hermes's documented SKILL.md schema specifies, fitting the second trigger (the agent worked through a pytest debugging session, found the working path, and saved the lesson):

---
name: debug-pytest-import-errors
description: When pytest reports ModuleNotFoundError despite a successful editable install, check src-layout configuration before chasing PYTHONPATH.
version: 1.0.0
platforms: [macos, linux]
metadata:
  hermes:
    tags: [python, testing]
    category: dev-workflow
---
# Debug pytest ImportError on src-layout projects

## When to Use
pytest fails with `ModuleNotFoundError` after a fresh clone, even though `pip install -e .` ran and the import works in a Python REPL.

## Procedure
1. Check `pyproject.toml` for `where = ["src"]` under the build-system packages section.
2. Confirm `pythonpath = ["src"]` is set in `[tool.pytest.ini_options]`.
3. Re-run `pip install -e .`; confirm `.egg-info` lands at the package root, not inside `src/`.

## Pitfalls
- `PYTHONPATH=src` as an env var works locally but doesn't survive CI.

## Verification
`uv run pytest` runs without `ModuleNotFoundError`.

The frontmatter is functional — tags and category route the skill in Hermes's index; platforms gates it by OS. The body's When to Use / Procedure / Pitfalls / Verification is the schema's recommended shape. Notice what the agent saved: not the original failing command, not the dead-ends, just the working path plus the trap that would have lured a next session into chasing PYTHONPATH. That's curation, not transcription.

This is why L7 is safe to leave unsupervised in Hermes and risky most places else. The SKILL.md schema enforces moves a well-coupled instruction needs — imperative voice, directive ordering, named constructs, the warning placed after the working path rather than before it. A free-form memory feature has no such structural prior; the agent writes whatever feels worth saving, and the writes degrade as the agent's writing discipline does.

Schema is the cheap version of supervision.

The new failure mode is the self-writing layer running unsupervised. An auto-extracted skill that overfits to one project. A trajectory summary written under a stale assumption that surfaces six weeks later as a phantom instruction. There's no rule file the user authored to grep for the source — the rule is in a markdown file the agent wrote and the user never read, sitting in ~/.hermes/skills/ or its equivalent.

L7 doesn't replace L0–L6. It runs alongside, with its own writes and its own decay. Most agent setups don't have it because most agents don't expose it. The ones that ship a memory feature mostly do L6.5 and call it L7.

When to climb

The dominant pattern I see in real repos is L1 with a thin L6: a CLAUDE.md, maybe a few rule files at L3, deny-permissions for .env. L4 (skills) is rare — most authors haven't built any. L5 (sub-agents) is rarer — most use cases haven't surfaced. L7 is mostly absent — most agents don't expose a self-writing surface, and the few setups that do have one running treat it as opt-in defaults nobody reviewed.

Across 28,721 public repositories with AI configs, 89.9% don't name specific constructs in their instructions — no backticks, no file paths, no function names. That's most of the soft channel running at low coupling: easily downweighted, easily lost. The hard channel is thinner. The adaptive channel is mostly absent.

Large spec, small contract, no adaptive layer. That's the asymmetry — but it's not always a bug. Each rung exists because the rung below it fails in a specific way. The trigger is the failure, not a feature wishlist.

From	To	Symptom that triggers the climb
L0	L1	Re-explaining the same project context every session
L1	L2	One file got long enough that important rules get ignored
L2	L3	Path-irrelevant rules pollute every task
L3	L4	The same procedure gets described inline across multiple rules
L4	L5	A procedure pollutes the parent's context with reasoning chains the parent doesn't need
L5	L6	A constraint must hold 100% of the time, not 95%
L6	L7	You keep correcting the same preference across sessions
L6	L7	You keep watching the agent re-derive the same workaround

The mistake is climbing without the symptom. A repo with three rules in one file doesn't need L3. A solo developer's CLAUDE.md doesn't need a sub-agent. Premature climbs cost context budget for no return; you've added structure the model has to navigate without solving a problem you actually had.

The opposite mistake is more common: under-building the higher rungs because the symptoms feel like model failures rather than rung failures. "The agent didn't run tests before pushing" reads like a prompt-engineering problem; it's a missing L6. "The agent forgot we use Cloudflare Workers" reads like context drift; it's a missing L7. "The agent keeps describing the deploy process every time I ask" reads like verbosity; it's a missing L4.

Climb when the rung below stops working.

Three questions for your repo

Not a recipe. A diagnostic. For any rule in your setup, ask:

Does this fail loudly when violated, or silently? Loudly is L6. Silently is L0–L5.
Does the model see this, or does the runtime enforce it? Sees is the soft channel. Runtime is the hard channel.
Does it get worse when you add unrelated rules to the same file? Yes is L0–L5. No is L6. "Sometimes" is probably L7.

Most rules answer silently / sees / yes. That tells you which channel you're in. The interesting question is whether anything in your setup is on the other channels at all.

A note on related taxonomies

There are other progressive ladders for AI agent setups in print. Vellum's L0–L5 is an autonomy axis — how much the agent decides on its own. Blake Crosley's 4-tier is a concurrent-decomposition axis — how many agents run in parallel. Anthropic's 5-layer ADK frame for Claude Code is a content-boundary axis — what kind of content goes where. Zylon's 5-architectural and GitHub's 3-tier carve different cuts again, mostly around how the agent is wired into a product surface.

The ladder above is on a different axis from any of those. It sorts by the channel each mechanism runs on — soft attention, hard enforcement, self-writing memory — and progresses through the named constructs an agent exposes (CLAUDE.md, scoped rule files, skills, sub-agents, hooks, auto-memory). The two cuts (L5/L6 and L6/L7) are the load-bearing claim; the autonomy and concurrency taxonomies don't draw those cuts because they're sorting on different things.

Different axis, different cuts, different diagnostic. Use whichever maps onto the question you're actually asking.

Terminal output: ails check on a .claude directory. LADDER reads 8 rungs across 3 channels. SETUP reads L1 + L3 + L6 (Primer, Scoped, Governed). Channel/Levels/Count table shows attention L0–L5 count 5, enforcement L6 count 1, self-writing L7 count 1. Cuts named at L5/L6 soft→hard and L6/L7 read→write.

*Previously: CLAUDE.md Best Practices: From Basic to Adaptive — where I drew the ladder the first way. The State of AI Instruction Quality for additional data.

I'm building Reporails, measurement for the attention channel. npx @reporails/cli check runs locally, no account needed.*

The State of AI Instruction Quality

Gábor Mészáros — Tue, 21 Apr 2026 12:41:52 +0000

Everybody has opinions about AGENTS.md/CLAUDE.md files.

Best practices get shared. Templates get copied, and this folk-type knowledge dominates the industry. Last year, GitHub analyzed 2,500 repos and published best-practice advice. We wanted to go further: measure at scale, publish the data, and let anyone verify.

When the agent doesn't follow instructions and does something contradictory, the usual suspects are: the model is inconsistent, LLMs are not deterministic, you need better guardrails, you need retries.

The failures almost always get attributed to the model.

So we decided to measure. We built a diagnostic tool that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge. Then we pointed it at GitHub repositories with instruction files for five agents - Claude, Codex, Copilot, Cursor, and Gemini.

28,721 repositories. 165,063 files. 3.3 million instructions.

... and one question:

What if the instructions are the problem?

The dataset

28,721 projects. Sourced from GitHub via API search, cloned, and deterministically analyzed. Each project was scanned for instruction files across five coding agents — then deduplicated to remove false positives from agent detection overlap.

Agent	Projects	% of corpus
Claude	12,356	43.0%
Codex	11,206	39.0%
Copilot	7,755	27.0%
Cursor	7,291	25.4%
Gemini	5,942	20.7%

The percentages add up to more than 100% because 37% of projects configure multiple agents. More on that later.

Key distributions stabilized early. A 9,582-repo sub-sample produced identical tier shares (±0.2pp) and the same mean scores as the full 12,076-repo intermediate sample. The final 28,721-repo corpus moved nothing. The patterns reported below are not small-sample artifacts.

All classifications are deterministic — the same file produces the same result every time. No LLM-as-judge. Sample classifications are published for inspection (methodology below). The tool is source-available.

How we measured

The analyzer parses each instruction file into atoms — the smallest semantically distinct units of content. A heading is one atom. A bullet point is one atom. A paragraph is one atom. Each atom gets classified along a few dimensions, all deterministic, no LLM involved:

Charge classification. A three-phase pipeline determines whether an atom is a directive ("use X"), a constraint ("do not use Y"), neutral content (context, explanation, structure), or ambiguous (could be read either way). Phase 1 detects negation and prohibition patterns. Phase 2 detects modal auxiliaries and direct commands. Phase 3 uses syntactic dependency parsing to catch imperatives that the first two phases missed. First definitive match wins. Atoms that partially match but don't clear any phase are marked ambiguous. Everything else is neutral.

Specificity. Binary: does the instruction name a specific construct — a tool, file, command, flag, function, or config key — or does it stay at the category level? "Use consistent formatting" is abstract. "Format with ruff format" is named. This is a text property, not a judgment call.

File categorization. Each file is classified as base config (your main CLAUDE.md or .cursorrules), a rule file, a skill definition, or a sub-agent definition — based on file path conventions for each agent.

Content type. Charge classification separates behavioral content (directives and constraints) from structural content (headings, context paragraphs, examples). That's how we know what fraction of your file is actually doing work.

The full tool is source-available (BUSL-1.1). You can run npx @reporails/cli check on your own project and inspect every finding. More on that at the end.

Finding 1: Most of your instruction file isn't instructions

Here's what the median instruction file actually contains:

50 content items total
12 of those are actual directives
The rest is headings, context paragraphs, examples, structure

Only 27% of your instruction file is doing what you think it does.

The other 73% is scaffolding. Headings that organize but don't instruct. Explanation paragraphs that compete for the model's attention without adding behavioral weight. Example blocks. Context-setting prose.

That's not inherently bad. Structure matters. But if you're writing a 200-line CLAUDE.md and only 54 lines are actual instructions, you should probably know that.

The average instruction is 8.9 words long. That's a sentence fragment.

Finding 2: 90% of instructions don't name what they're talking about

This is the big one.

We measured whether each instruction references specific tools, files, commands, or constructs by name — or whether it stays at the category level.

Two-thirds of all instructions are abstract.

Agent	Names specific constructs	Uses category language
Gemini	39.3%	60.7%
Codex	38.3%	61.7%
Copilot	33.3%	66.7%
Cursor	30.8%	69.2%
Claude	30.6%	69.4%

What does this look like in practice?

Abstract: "Use consistent code formatting"
Specific: "Format with ruff format before committing"

Abstract: "Avoid using mocks in tests"
Specific: "Do not use unittest.mock — use the real database via test_db fixture"

In previous controlled experiments, specificity produced a 10.9x odds ratio in compliance (N=1000, p<10⁻³⁰). The instruction that names the exact construct gets followed. The one that describes it abstractly... mostly doesn't. This is consistent with independent findings from RuleArena (Zhou et al., ACL 2025), where LLMs struggled systematically with complex rule-following tasks — even strong models fail when the rules themselves are ambiguous or underspecified.

89.9% of all agent configurations contain at least one instruction that doesn't name what it means. It's not a few projects. It's nearly everyone.

Finding 3: `agents.md` is the most common instruction file

Before we get into quality, let's look at what people are actually naming their files:

#	File	Count
1	`agents.md` / `AGENTS.md`	20,654
2	`claude.md` / `CLAUDE.md`	14,014
3	`gemini.md` / `GEMINI.md`	5,703
4	`.github/copilot-instructions.md`	5,647
5	`.cursorrules`	2,415

49,071 unique file paths across the corpus. That's not a typo. The format fragmentation is real.

A few things jumped out:

claude.md (lowercase, 10,642) is 3x more common than CLAUDE.md (3,372). Both work. The community clearly prefers lowercase.
agents.md dominates — the Codex/generic format is the single most popular instruction file name.
Skills and rules are already showing up in meaningful numbers: .claude/rules/testing.md (422), .agents/skills/tailwindcss-development/skill.md (334).

Finding 4: Different agents, completely different config philosophies

Not all agents are configured the same way. Not even close.

We categorized every file into four types: base config (your main CLAUDE.md, .cursorrules, etc.), rules (scoped rule files), skills (task-specific skill definitions), and sub-agents (role-based agent definitions).

Agent	Base	Rules	Skills	Sub-agents	Total files
Claude	18,733	4,638	10,692	10,538	44,601
Cursor	5,903	19,843	6,237	1,716	33,699
Copilot	16,026	4,486	10,352	3,012	33,876
Codex	19,001	81	8,911	165	28,158
Gemini	10,253	74	3,039	53	13,419

Cursor is 60% rules files. The .cursor/rules/ system dominates its configuration surface. One agent's config looks nothing like another's.

Claude is the only agent with a roughly balanced architecture across all four config types. Codex and Gemini are almost entirely base config — single-file setups.

The median Cursor project has 3 instruction files. The median Codex project has 1. These aren't just different tools. They're different configuration philosophies.

Finding 5: 37% of projects configure multiple agents

10,620 projects in the corpus target two or more agents. That's not a niche pattern — it's over a third of all projects.

Agents	Projects
1	18,101
2	6,776
3	2,687
4	949
5	208

The dominant pair is Claude + Codex (5,038 projects). Makes sense — CLAUDE.md + AGENTS.md is the most natural multi-agent starting point.

Here's what's interesting about multi-agent repos: the same developer, writing instructions at the same time, for the same project, produces measurably different instruction quality across agents. The person didn't change. The project didn't change. The instruction format did.

Some of that is structural. Cursor's .mdc rules enforce a different format than Claude's markdown. Codex's AGENTS.md invites a different writing style than Copilot's copilot-instructions.md. The format shapes the content.

Finding 6: The most-copied skills are the vaguest

This is where it gets interesting.

13,309 unique skills across the corpus. Some of them appear in hundreds of repos — clearly copied from shared templates or community sources. So we measured them.

Named% = what fraction of a skill's instructions name a specific tool, file, or command (instead of using category language).

Skill	Repos	Named%	What it means
`frontend-design`	271	2.8%	Almost entirely abstract advice
`web-design-guidelines`	197	10.2%	Generic design principles
`vercel-react-best-practices`	315	30.7%	Mix of specific and vague
`pest-testing`	216	55.1%	Names actual test constructs
`livewire-development`	87	75.5%	Names specific Livewire components
`next-best-practices`	76	92.6%	Names almost everything

frontend-design is in 271 repos with 2.8% specificity. It's a wall of "follow responsive design principles" and "ensure accessibility compliance." That reads well. It sounds professional. It gives the model almost nothing concrete to act on.

next-best-practices is in 76 repos with 92.6% specificity. It says things like "use next/image for all images" and "prefer server components over client." It reads like a checklist. It tells the model exactly what to do.

One is shared 3.5x more than the other.

The most popular skills are the most decorative. The well-written ones barely spread.

The best and worst skills (>50 repos)

Most specific:

Skill	Repos	Named%
`next-best-practices`	76	92.6%
`shadcn`	74	82.6%
`livewire-development`	87	75.5%
`pest-testing`	216	55.1%
`laravel-best-practices`	94	49.7%

Most vague:

Skill	Repos	Named%
`openspec-explore`	110	2.5%
`frontend-design`	271	2.8%
`web-design-guidelines`	197	10.2%
`vercel-composition-patterns`	131	10.7%
`find-skills`	113	18.9%

Notice a pattern? The Laravel/Livewire ecosystem produces specific skills. The generic frontend/design ones stay abstract. Domain-specific communities write better instructions than cross-cutting ones.

Finding 7: Sub-agents are almost entirely persona prompts

5,526 unique sub-agent roles in the corpus. Developers are building agent teams: code reviewers, architects, debuggers, testers, security auditors.

The problem? Sub-agents are the most abstract config type in the entire corpus. Only 17% of sub-agent instructions name specific constructs.

Role	Repos	Named%
`code-reviewer.md`	236	14.4%
`architect.md`	89	18.2%
`debugger.md`	66	9.4%
`security-auditor.md`	57	14.8%
`test-runner.md`	54	10.5%
`frontend-developer.md`	47	9.0%

Most of these are persona prompts. "You are a senior code reviewer. You care about code quality, security, and maintainability." That's a role description, not an instruction set. It tells the model who to be, not what to do.

Compare this to a base config that says "run uv run pytest tests/ -v before suggesting any commit" — that's 100% named, and the model knows exactly what action to take.

The anatomy chart: more directives, worse quality

Here's where it all comes together.

We measured three things for each config type: how big the files are, how many directives they contain, and what fraction of those directives actually name something specific.

Sub-agents have the largest files (61 items median), the most directives (17), and the worst specificity (17%). They're the wordiest config type in the corpus and the least effective.

Base configs are the opposite. Fewer directives (11), but 40% of them name specific constructs. The developer writing their own CLAUDE.md by hand, for their own project, produces the most actionable instructions.

Config type	Files	Median size	Median directives	Specificity
Base configs	69,916	50 items	11	39.8%
Rules files	29,122	34 items	9	31.2%
Skills	39,231	59 items	14	30.8%
Sub-agents	15,484	61 items	17	17.0%

The pattern is clear: what developers write by hand is the most specific. What gets templated and shared gets progressively vaguer. And what tries hardest to sound authoritative — sub-agent persona prompts — is the most hollow.

More instructions is not better instructions.

Independent research supports the structural angle: FlowBench (Xiao et al., 2024) found that presenting workflow knowledge in structured formats (flowcharts, numbered steps) improved LLM agent planning by 5-6 percentage points over prose — across GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo. Structure is not decoration. It changes what the model retrieves.

Limitations

Five things to know about these numbers.

Sampling bias. GitHub API search, public repos only, English-skewed. Enterprise configurations, private repos, and non-English projects are not represented. This is not a random sample of all instruction files in production.

Classification accuracy. The charge classifier is deterministic but not perfect. Edge cases exist: mixed-charge sentences, implicit constructs, domain jargon that looks like a category term but is actually a named tool. Specificity detection (named vs abstract) is simpler and more robust. Sample classifications are published for inspection.

Association, not causation. "More directives correlate with lower specificity" is an observed pattern. We do not claim that adding directives causes quality to drop.

Snapshot. Collected March–April 2026. Instruction practices are changing fast — agents.md didn't exist six months ago. These numbers describe the ecosystem at collection time.

No popularity weighting. A 10-star hobby project counts the same as a 50K-star production repo. The distribution of instruction quality in production agent work may differ.

What this means

This isn't an article about AI models being bad at following instructions. The models are fine.

This is an article about what we actually give them to work with.

Most instruction files are three-quarters scaffolding. Two-thirds of the actual instructions don't name what they're talking about. The most popular community skills are the most decorative. Sub-agent definitions are the wordiest files in the corpus and the least specific.

None of that is obvious from reading your own files. It wasn't obvious to us before we measured it. A well-structured CLAUDE.md feels thorough. A shared skill with 271 repos feels battle-tested. A sub-agent with 17 directives feels comprehensive.

Measurement shows something different.

In The Undiagnosed Input Problem, I argued that the industry is great at inspecting outputs and weak at inspecting inputs. This corpus analysis is the evidence for that claim.

The instruction files are there. The developers wrote them. They just have no way to know which parts are working and which parts are wallpaper.

Try it yourself

The analyzer we used for this corpus analysis is available as a CLI you can run against your own instruction files.

Reporails — instruction diagnostics for coding agents. Deterministic. No LLM-as-judge. 97 rules across structure, content, efficiency, maintenance, and governance.

npx @reporails/cli check

That scans your project, detects which agents are configured, and reports findings with specific line numbers and rule IDs. Here's what the output looks like:

Reporails — Diagnostics

  ┌─ Main (1)
  │ CLAUDE.md
  │   ⚠       Missing directory layout             CORE:C:0035
  │   ⚠ L9    7 of 7 instruction(s) lack reinfor…  CORE:C:0053
  │     ... and 16 more
  │
  └─ 21 findings

  Score: 7.9 / 10  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░

  21 findings · 4 warnings · 1 info
  Compliance: HIGH

The corpus analysis used the same classification pipeline at scale. Fix the findings, run again, watch your score improve.

The dataset

The full corpus is published at reporails/30k-corpus. Three files:

File	Records	What it contains
`repos.jsonl`	28,721	Per-project record: agents configured, stars, language, license, topics
`stats_public.json`	1	Every aggregate statistic in this article
`validation_key.csv`	2,814	Sample classifications with source text for inspection

Verify any claim:

# "28,721 repositories"
cat repos.jsonl | wc -l

# "43% Claude"
cat repos.jsonl | python3 -c "
import sys, json
repos = [json.loads(l) for l in sys.stdin]
claude = sum(1 for r in repos if 'claude' in r['canonical_agents'])
print(f'{claude}/{len(repos)} = {claude/len(repos)*100:.1f}%')
"

Every number in every table traces to that dataset. If you disagree with a finding, count the rows.

This is part of the Instruction Quality series. Previous: The Undiagnosed Input Problem. Related: Precision Beats Clarity · Do Not Think of a Pink Elephant · 7 Formatting Rules for the Machine.

The Undiagnosed Input Problem

Gábor Mészáros — Wed, 08 Apr 2026 11:51:12 +0000

The AI agent ecosystem has built a serious industry around controlling outputs. Guardrails. Safety classifiers. Output validation. Monitoring. Retry systems. Human review.

All of that matters, but there is simpler upstream question that still goes mostly unmeasured:

Are the instructions any good?

That sounds obvious, yet it is not how the industry behaves.

When an agent fails to follow instructions, the usual explanations come fast:

Models are probabilistic
Agents are inconsistent
You need stronger guardrails
You need better monitoring
You need retries
You need humans in the loop

… and while those explanations are right to a certain degree, they also have a side effect: they turn instruction quality into a blind spot.

The ecosystem has become extremely good at inspecting what comes out of the model, and surprisingly weak at inspecting what goes in.

The symptom

Consider τ-bench.

It gives agents policy instructions and measures whether they follow them in realistic customer-service tasks. Airline and retail workflows. Real constraints. Real multi-step behavior.

The benchmark result that gets repeated is the model result: even strong systems still fail a large share of tasks, and consistency across repeated attempts remains weak.

The conclusion most people draw is straightforward: we need better models, better agents, better orchestration.

My take: Maybe.

But there is another question sitting underneath the benchmark:

Were the instructions themselves well-formed and well structured?

Not just present. Not just long enough. Not just sincere.

Well-formed. Well-structured. Well-organized.

Specific enough to anchor behavior. Structured enough to survive context mixing. Non-conflicting across files. Positioned where the model can actually use them.

Those questions usually never gets asked.

The industry response

I had a conversation recently where a lead solutions architect put the standard view plainly:

“The instruction merely influences the probability distribution over outputs. It doesn’t override it.”

That is right about the mechanism but it is wrong about what follows from it.

Yes, instructions operate probabilistically. But that does not mean all instructions are weak in the same way.

The shape of the distribution is not fixed. It changes with the properties of the instruction itself. Specificity sharpens it. Structure sharpens it. Conflict flattens it. Vague abstractions flatten it. Bad formatting can suppress it almost entirely.

Across my earlier controlled experiments, small changes in wording and placement produced large changes in compliance:

Instruction ordering moved compliance by 25 percentage points with the same model and the same directive.
Specificity produced roughly a 10x compliance effect when the instruction named the exact construct instead of describing it abstractly.
Formatting changed whether the model reliably registered the instruction at all.

The problem is that most instruction systems are built without diagnostics.

That is not an AI limitation. That is an engineering failure.

The folk system

Right now, instruction practice spreads mostly through imitation.

A popular repository posts “best practices” for Claude Code. Shared Cursor rules circulate as templates. People copy AGENTS.md files between projects. Teams accumulate CLAUDE.md, .cursorrules, copilot-instructions.md, etc and project-specific rule files across multiple tools.

Copy, paste, hope, repeat.

Some of that advice is useful. Almost none of it is tested in any controlled, reproducible way. That would be fine if instruction quality were self-evident. It is not.

A long instruction file can feel thorough while being internally contradictory. A highly opinionated ruleset can feel disciplined while producing almost no behavioral influence on the model.

A sprawling multi-file setup can look sophisticated while making the system worse.

Without diagnostics, developers do not know which instructions are binding, which are noise, and which are actively interfering with each other.

The gap

The tooling split is now pretty clear.

Output tooling is mature. Guardrails AI validates structure. Lakera focuses on prompt injection and security. NeMo Guardrails enforces safety and conversational rails. Llama Guard classifies risky content. The output edge is crowded.

Prompt testing is real. Promptfoo, Braintrust, and LangSmith can all help evaluate behavior. But they are primarily black-box systems: did the prompt produce the output you wanted?

That is useful.

It is not the same as measuring the instruction artifact itself.

Instruction-quality tooling exists only in fragments. Some tools use LLM-as-judge. Some use deterministic local rules. But the category is still early, inconsistent, and mostly disconnected from measured behavioral outcomes.

What is still largely missing is a deterministic way to inspect instruction files as engineered objects:

how specific they are
how directly they state intent
whether they conflict across files
whether they overuse headings
whether they provide alternatives instead of bare prohibitions
whether the system is getting denser while getting weaker

Code gets static analysis.

Instruction systems usually get vibes.

What we measured

We built an analyzer that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge.

I am running it across a large live corpus of real repositories. The full run completes this week; what follows is what the partial sample already shows - stable enough to publish, not yet the full picture.

Quality is reported on a 0-to-100 scale: 0 means the file produces no measurable influence on model behavior, 100 is the ceiling the framework can score.

A fresh aggregation over 12,076 completed instruction-file scans is virtually identical to an earlier 9,582-repo sample:

bottom tier: 40.3% vs 40.1%
top tier: 12.1% vs 12.2%
mean quality score: 27 vs 27
directive content ratio: 27.9% vs 27.9% - the share of instruction sentences that directly tell the model what to do

That matters because it means the pattern is stable.

This does not look like a small-sample artifact.

And the strongest finding is not what I expected.

More rules, lower quality

The common response to bad agent behavior is to add more rules.

More files. More guidance. More scoping. More edge-case coverage.

The corpus says that strategy tends to backfire.

Across 12,076 repositories, instruction quality falls as instruction-file count rises:

Files per repo     N      Mean score   Bottom tier %   Top tier %
1                  4681   28           46.3%           16.9%
2-5                4796   26           37.3%            9.5%
6-20               1972   26           36.0%            8.8%
21-50               438   25           31.3%            5.7%
51-500              186   25           33.3%            5.4%

The key number is the top-tier share.

It collapses from 16.9% in single-file setups to 5.4% in repositories with 51 to 500 instruction files.

That is a roughly 3x drop.

The article version of that finding is simple:

Developers respond to bad agent behavior by adding more rules. In the corpus, that strategy correlates with a 3x collapse in the probability of landing in the top tier.

That does not prove file count causes low quality by itself.

But it does show that rule proliferation is not rescuing these systems. At scale, it is associated with weaker instruction quality, not stronger.

The sweet spot

There is also a more subtle result in the partial sample. Instruction quality appears to be non-monotonic in directive density: more directives help at first, then stop helping, and past a point start to hurt.

The full curve is in next week’s piece. The short version is that there is an optimal density range, after which additional directives stop strengthening the system.

Enough force to bind behavior. Not so much that the system turns into an overpacked rules document.

A real example

Here is the kind of instruction block the corpus is full of:

# Code should be clear, well documented, clear PHPDocs.

# Code must meet SOLID DRY KISS principles.

# Should be compatible with PSR standards when it need.

# Take care about performance

It is not malicious. It is not absurd.

It is just weak.

Everything is abstract. Nothing is anchored. Headings are doing the work prose should do. The agent can read it, represent it, and still walk past most of it.

Now compare:

Never use `var_dump()` or `dd()` in committed code. Use `Log::debug()` instead.
Run `./vendor/bin/phpstan analyse src/` before every commit. Level 6 minimum.

Same general intent. Completely different binding strength.

The second version names the construct, names the alternative, names the command, and names the threshold. It gives the model something concrete to hold onto.

That is what diagnostics should make visible.

What this means

Output guardrails still matter.

Prompt evaluation still matters.

Safety systems still matter.

But they do not answer the upstream question: Are the instructions themselves well-formed?

If the answer is no, then a large class of downstream failures will keep showing up as mysterious agent unreliability when the real problem is earlier and simpler.

The agent loaded the instruction and walked past it.

That is often not a model problem.

It is an input problem.

And input quality is measurable.

What’s next

These are corpus-level findings from a partial sample, not universal laws.

The sample is still in flight. The strongest claims here are about association, not proof of causality. Specific conflict-count case studies need source verification before publication. Popularity weighting is not yet applied, so “40% of repositories score in the bottom tier” is not the same claim as “40% of production agent work scores in the bottom tier.”

The full corpus run completes this week. Next week I publish the end-of-run analysis across the full sample — the complete distribution, the cross-cuts the partial sample cannot yet support, and the specific case studies this article deliberately held back. If you want to know where your stack lands, that is the piece to come back for.

For now, the central pattern is already stable enough to matter:

The ecosystem keeps responding to weak agent behavior by adding more instructions, while the corpus shows that more instruction files are usually associated with lower measured quality.

That is the undiagnosed input problem.

Not that instructions do not matter.
That they matter, measurably, and most teams still have no way to see whether theirs are helping or hurting.

This is part of the Instruction Best Practices series. Previous: Do NOT Think of a Pink Elephant, Precision Beats Clarity, 7 Formatting Rules for the Machine. I’m building instruction diagnostics for coding agents. Follow for the full corpus analysis.

Do NOT Think of a Pink Elephant

Gábor Mészáros — Tue, 31 Mar 2026 12:19:14 +0000

You thought of a pink elephant, didn't you?

Same goes for LLMs too.

"Do not use mocks in tests."

Clear, direct, unambiguous instruction. The agent read it — I can see it in the trace. Then it wrote a test file with unittest.mock on line 3. Thanks...

I've seen this play out hundreds of times. A developer writes a rule, the agent loads it, and it does exactly what the rule said not to do. The natural conclusion: instructions are unreliable. The agent is probabilistic. You can't trust it.

That's wrong. The instruction was the problem.

The pink elephant

There's a well-known effect in psychology called ironic process theory (Daniel Wegner, 1987). Tell someone "don't think of a pink elephant," and they immediately think of a pink elephant. The act of suppressing a thought requires activating it first.

Something structurally similar happens with AI instructions.

"Do not use mocks in tests" introduces the concept of mocking into the context. The tokens mock, tests, use — these are exactly the tokens the model would produce when writing test code with mocks. You've put the thing you're banning right in the generation path.

This doesn't mean restrictive instructions are useless. It means a bare restriction is incomplete.

The anatomy of a complete instruction

The instructions that work — reliably, across thousands of runs — have three components. But the order you write them in matters as much as whether they're there at all.

Here's how most people write it:

# Human-natural ordering — constraint first
Do not use unittest.mock in tests.
Use real service clients from tests/fixtures/.
Mocked tests passed CI last quarter while the production
integration was broken — real clients catch this.

All three components are present. Restriction, directive, context. But the restriction fires first — the model activates {mock, unittest, tests} before it ever sees the alternative. You've front-loaded the pink elephant.

Now flip it:

# Golden ordering — directive first
Use real service clients from tests/fixtures/.
Real integration tests catch deployment failures and configuration
errors that would otherwise reach production undetected.
Do not use unittest.mock.

Same three components. Different order. The directive establishes the desired pattern first. The reasoning reinforces it. The restriction fires last, when the positive frame is already dominant.

In my experiments — 500 runs per condition, same model, same context — constraint-first produces violations 31% of the time. Directive-first with positive reasoning: 7%.

The pink elephant isn't just about missing components. It's about which concept the model sees first.

Three layers, in this order:

Directive — what to do. This goes first. It establishes the pattern you want in the generation path before the prohibited concept appears.
Context — why. Reasoning that reinforces the directive without mentioning the prohibited concept. "Real integration tests catch deployment failures" adds mass to the positive pattern. Reasoning that mentions the prohibited concept doubles the violation rate.
Restriction — what not to do. This goes last. Negation provides weak suppression — but weak suppression is enough when the positive pattern is already dominant.

The part nobody expects

Here's what surprised me: the ordering effect is larger than any other variable I've measured.

Precise naming vs. vague categories? 28 percentage points. Exact scope vs. broad scope? 74 points across the range. But reordering — same words, same components, just flipped — accounts for 25 points on its own. And it compounds with everything else.

Most developers write instructions the way they'd write them for a human: state the problem, then the solution. "Don't do X. Instead, do Y." It's natural. It's also the worst ordering for an LLM.

Never write "Don't use X. Instead, use Y." Write "Use Y. Here's why Y works. Don't use X."

Formatting helps too — structure is not decoration. I covered that in depth in 7 Formatting Rules for the Machine. But formatting on top of bad ordering is polishing the wrong end. Get the order right first.

What this looks like in practice

Here's a real instruction I see in the wild:

When writing tests, avoid mocking external services. Try to
use real implementations where possible. This helps catch
integration issues early. If you must mock, keep mocks minimal
and focused.

Count the problems:

"Avoid" — hedged, not direct
"external services" — category, not construct
"Try to" — escape hatch built into the instruction
"where possible" — another escape hatch
"If you must mock" — reintroduces mocking as an option within the instruction that prohibits it
Constraint-first ordering — the prohibition leads, the alternative follows
No structural separation — restriction, directive, hedge, and escape hatch all in one paragraph

Now rewrite it:

**Use the service clients** in `tests/fixtures/stripe.py` and
`tests/fixtures/redis.py`.

> Real service clients caught a breaking Stripe API change
> that went undetected for 3 weeks in payments - integration
> tests against live endpoints surface these immediately.

*Do not import* `unittest.mock` or `pytest.monkeypatch`.

Directive first — names the exact files. Context second — the specific incident, reinforcing why the directive matters without mentioning the prohibited concept. Restriction last — names the exact imports, fires after the positive pattern is established. No hedging. No escape hatches.

Try it

For any instruction in your AGENTS.md/CLAUDE.md or SKILLS.md files:

Start with the directive. Name the file, the path, the pattern. Use backticks. If there's no alternative to lead with, you're writing a pink elephant.
Add the context. One sentence. The specific incident or the specific reason the directive works. Do not mention the thing you're about to prohibit — reasoning that references the prohibited concept halves the benefit.
End with the restriction. Name the construct — the import, the class, the function. Bold it. No "try to avoid" or "where possible."
Format each component distinctly. The directive, context, and restriction should be visually and structurally separate. Don't merge them into one paragraph.

If your instruction is just "don't do X" — you've told the model to think about X.

Tell it what to think about instead. And tell it first.

Instruction Best Practices: Precision Beats Clarity

Gábor Mészáros — Tue, 24 Mar 2026 13:12:30 +0000

Two rules in the same file. Both say "don't mock."

When working with external services, avoid using mock objects in tests.

When writing tests for src/payments/, do not use unittest.mock.

Same intent. Same file. Same model. One gets followed. One gets ignored.

I stared at the diff for a while, convinced something was broken. The model loaded the file. It read both rules. It followed one and walked past the other like it wasn't there.

Nothing was broken. The words were wrong.

The experiment

I ran controlled behavioral experiments: same model, same context window, same position in the file. One variable changed at a time. Over a thousand runs per finding, with statistically significant differences between conditions.

Two findings stood out.

First (and the one that surprised me most): when instructions have a conditional scope ("When doing X..."), precision matters enormously. A broad scope is worse than a wrong scope.

Second: instructions that name the exact construct get followed roughly 10 times more often than instructions that describe the category. "unittest.mock" vs "mock objects" — same rule, same meaning to a human. Not the same to the model.

Scope it or drop it

Most instructions I see in the wild look like this:

When working with external services, do not use unittest.mock.

That "When working with external services" is the scope — it tells the agent when to apply the rule. Scopes are useful. But the wording matters more than you'd expect.

I tested four scope wordings for the same instruction:

# Exact scope — best compliance
When writing tests for src/payments/, do not use unittest.mock.

# Universal scope — nearly as good
When writing tests, do not use unittest.mock.

# Wrong domain — degraded
When working with databases, do not use unittest.mock.

# Broad category — worst compliance
When working with external services, do not use unittest.mock.

Read that ranking again. Broad is worse than wrong.

"When working with databases" has nothing to do with the test at hand. But it gives the agent something concrete - a specific domain to anchor on. The instruction is scoped to the wrong context, but it's still a clear, greppable constraint.

"When working with external services" is technically correct. It even sounds more helpful. But it activates a cloud of associations - HTTP clients, API wrappers, service meshes, authentication, retries - and the instruction gets lost in the noise.

The rule: if your scope wouldn't work as a grep pattern, rewrite it or drop it.

An unconditional instruction beats a badly-scoped conditional:

# Broad scope — fights itself
When working with external services, prefer real implementations
over mock objects in your test suite.

# No scope — just say it
Do not use unittest.mock.

The second version is blunter. It's also more effective. Universal scopes ("When writing tests") cost almost nothing — they frame the context without introducing noise. But broad category scopes actively hurt.

Name the thing

Here's what the difference looks like across domains.

# Describes the category — low compliance
Avoid using mock objects in tests.

# Names the construct — high compliance
Do not use unittest.mock.

# Category
Handle errors properly in API calls.

# Construct
Wrap calls to stripe.Customer.create() in try/except StripeError.

# Category
Don't use unsafe string formatting.

# Construct
Do not use f-strings in SQL queries. Use parameterized queries
with cursor.execute().

# Category
Avoid storing secrets in code.

# Construct
Do not hardcode values in os.environ[]. Read from .env
via python-dotenv.

The pattern: if the agent could tab-complete it, use that form. If it's something you'd type into an import statement, a grep, or a stack trace - that's the word the agent needs.

Category names feel clearer to us, humans. "Mock objects" is plain English. But the model matches against what it would actually generate, not against what the words mean in English. "unittest.mock" matches the tokens the model would produce when writing test code. "Mock objects" matches everything and nothing.

Think of it like search. A query for unittest.mock returns one result. A query for "mocking libraries" returns a thousand. The agent faces the same problem: a vague instruction activates too many associations, and the signal drowns.

The compound effect

When both parts of the instruction are vague - vague scope, vague body - the failures compound. When both are precise, the gains compound.

# Before — vague everywhere
When working with external services, prefer using real implementations
over mock objects in your test suite.

# After — precise everywhere
When writing tests for `src/payments/`:
Do not import `unittest.mock`.
Use the sandbox client from `tests/fixtures/stripe.py`.

Same intent. The rewrite takes ten seconds. The difference is not incremental, it's categorical.

Formatting gets the instruction read - headers, code blocks, hierarchy make it scannable. Precision gets the instruction followed - exact constructs and tight scopes make it actionable. They work together. A well-formatted vague instruction still gets ignored. A precise instruction buried in a wall of text still gets missed. You need both.

When to adopt this

This matters most when:

Your instruction files mention categories more than constructs, like "services," "libraries," "objects," "errors" etc.
You use broad conditional scopes: "when working with...," "for external...," "in general..."
You have rules that are loaded and read but not followed
You want to squeeze more compliance out of existing instructions without restructuring the file

It matters less when your instructions are already construct-level ("do not call eval()") or unconditional.

Try it

Open your instruction files.
Find every instruction that uses a category word -> "services," "objects," "libraries," "errors," "dependencies."
Replace it with the construct the agent would encounter at runtime - the import path, the class name, the file glob, the CLI flag.
For conditional instructions: replace broad scopes with exact paths or file patterns. If you can't be exact, drop the condition entirely - unconditional is better than vague.

Then run your agent on the same task that was failing. You'll see the difference.

Formatting is the signal. Precision is the target.

CLAUDE.md Best Practices: 7 formatting rules for the Machine

Gábor Mészáros — Tue, 03 Mar 2026 13:06:00 +0000

I watched an agent ignore a rule I wrote 2 hours earlier.

Not a vague rule. A specific one. "run pytest before committing." It was right there in the CLAUDE.md, paragraph two, between the project description and the linting setup. The agent read the file. I saw it in the context. It just... didn't follow it.

I moved the same instruction under a ## Testing header, wrapped pytest in backticks, and added a one-line rationale. Next run, the agent followed it to the letter.

The instruction didn't change. The signal strength did.

In the last post, we got the agent oriented — /bootstrap loads the map, the workflows, the boundaries. But orientation and compliance are different things. You can hand someone a perfect briefing and still lose them if the briefing is a wall of text. Same with agents.

The question isn't whether your instructions are loaded. It's whether the agent follows them.

The comparison

Here's the same instruction, two ways.

Version A:

When working on this project, always make sure to run the test suite
before committing any changes. The command to run tests is pytest and
you should run it from the project root. If tests fail, fix them before
committing. Also make sure to use ruff for formatting.

Version B:

## Testing

- `pytest` — run from project root before every commit
- Fix failures before committing

## Formatting

- `ruff check --fix && ruff format` — run before committing

Same content. Version B gets followed. Version A gets buried.

This isn't about aesthetics. Structural elements — headers, code fences, lists — create anchor points that agents latch onto. Prose paragraphs don't. The more structure you provide, the more reliably each instruction lands.

It's not just about length

You already learned to keep your CLAUDE.md short. It's a good start but it's not sufficient. A 20-line prose paragraph gets lost just as easily as a 200-line one. The variable isn't word count. It's structure.

A short file with no headers, no code blocks, and no rationale will underperform a longer file that's well-structured.

Length is the ceiling. Formatting is the signal.

Seven structural rules

These aren't content guidelines. They're formatting choices that determine whether instructions survive the trip from file to agent behavior. I'll start with the three you won't find in other guides, then cover the four that everyone mentions but nobody explains why.

1. Include rationale

"Never force push" is an instruction. "Never force push — rewrites shared history, unrecoverable for collaborators" is an instruction the agent weighs.

# Without rationale
- Never use `rm -rf` on the project root
- Always run tests before committing
- Don't modify package-lock.json manually

# With rationale
- Never use `rm -rf` on the project root — irrecoverable
- Always run tests before committing — CI will reject untested code
- Don't modify package-lock.json manually — causes merge conflicts
  and dependency resolution issues

The rationale doesn't just explain — it gives the agent a way to generalize. An agent that understands why force push is forbidden will also avoid git reset --hard origin/main without being told. The "why" turns a single rule into a class of behaviors.

This is the most undervalued formatting choice. Every prohibition should carry its reason.

2. Keep heading hierarchy shallow

Three levels is enough. h1 for the file title, h2 for sections, h3 for subsections. That's it.

# Before (5 levels deep)
# Project
## Development
### Testing
#### Unit Tests
##### Mocking Strategy

# After (3 levels max)
# Project
## Testing
### Unit tests

Deep nesting dilutes attention. An h5 competes with every heading above it for the agent's focus. It doesn't lose the h2, but the hierarchy creates ambiguity about which level governs. Flat structures keep every instruction at the surface. If you need an h4, you probably need a separate file.

3. Name files descriptively

When an agent searches your project - browsing a directory listing, running a glob, deciding which file to read - the file name is the first filter. Before content, before headers, before anything.

# Before
docs/guide.md
docs/notes.md
scripts/setup.sh

# After
docs/api-authentication.md
docs/deployment-checklist.md
scripts/setup-local-dev.sh

The agent sees a directory listing and picks what to open. api-authentication.md tells it whether the file might be relevant to the current task. guide.md forces it to open and read before it can decide. Descriptive names save the agent a round trip. In a project with dozens of files, that adds up.

This applies to any file the agent might discover: docs, scripts, configs.

Now the four you've heard before - but with a why.

4. Use headers

Agents scan headers the way developers scan a README: as a table of contents. A header says "new topic, reset attention."

# Before
The project uses TypeScript with strict mode enabled. For testing we
use vitest. The CI pipeline runs on GitHub Actions.

# After
## Language

TypeScript with strict mode enabled.

## Testing

- `npx vitest` — run from project root

## CI

- `.github/workflows/` — GitHub Actions

One topic per header. The agent navigates to the right section instead of parsing the whole paragraph. Without headers, every instruction competes with every other instruction for attention.

5. Put commands in code blocks

Commands in prose get read as descriptions. Commands in code blocks get treated as executable.

# Before
You can run the linter by running npm run lint and the tests
by running npm test.

# After
- `npm run lint` — check for issues
- `npm test` — run test suite

If you do nothing else from this post, wrap your commands in backticks. It's the single highest-impact change - a command in a code fence is a command. A command in a sentence is a suggestion.

6. Use standard section names

## Testing gets recognized instantly. ## Quality Assurance Verification Process doesn't.

Agents have been trained on millions of README files. They know what ## Testing, ## Commands, ## Structure, and ## Conventions mean. Those names carry built-in context.

Instead of	Use
Quality Assurance	Testing
Development Guidelines	Conventions
Operational Instructions	Commands
Safety and Compliance	Boundaries
Project Organization	Structure

The familiar name is the signal. The creative name is noise.

7. Make instructions actionable

"Follow best practices" is not an instruction. "Use ruff for formatting, run before committing" is.

The test: could an agent execute this instruction right now, without asking a clarifying question? If not, it's too vague.

# Before
Make sure code quality is maintained and follows our standards.

# After
## Conventions

- Format with `ruff format` before committing
- Type annotations on all public functions
- No `print()` in production code — use `logging`

Every instruction should pass the "act on it immediately" test. If it can't be acted on, it's a wish, not an instruction.

The compound effect

Each rule alone is a small improvement. Together, they're multiplicative - not because the rules add up, but because they reinforce each other. Headers create sections. Sections hold code blocks. Code blocks contain actionable commands. Rationale explains why. Descriptive file names route attention to the right file. Shallow hierarchy keeps everything findable.

Here's a realistic before/after applying all seven:

Before:

This project is a Python CLI tool. We use pytest for testing and ruff
for linting. Make sure to run tests before you commit anything. The
source code is in src/myapp and tests are in tests/. Don't modify
anything in the dist/ folder because that's generated. Also we have
some rules about how to write tests — they should test behavior not
implementation details, and use parametrize instead of writing lots
of individual test functions that do the same thing.

After:

## Testing

- `pytest` — run from project root before every commit
- Test behavior, not implementation — assert on outcomes, not internal calls
- Use `@pytest.mark.parametrize` when cases share the same assertion shape

## Formatting

- `ruff check --fix && ruff format`

## Structure

- Source: `src/myapp/`
- Tests: `tests/`

## Boundaries

- `dist/` — generated, do not modify

Same information. Half the words. Every instruction lands.

When to reformat

If you notice:

The agent apologizes for missing an instruction that's in your file
The same rule gets violated in consecutive sessions
You keep adding more words to an instruction hoping the agent will "get it"
Your CLAUDE.md is one long section with no headers
Commands appear in sentences instead of code blocks

Your instructions don't need more content. They need more structure.

The connection to /bootstrap

In the previous posts we built the delivery system: backbone.yml maps the project, Mermaid draws the workflows, /bootstrap loads both in seconds. That's the orientation layer - the agent knows where it is and how things work.

This is about attention budget allocation. The agent has a limited context window. What matters isn't just what's in it — it's how the agent decides what's relevant at each step. Structure is what makes your instructions win that competition.

Orientation without compliance means the agent knows your project but ignores your rules. Compliance without orientation means the agent follows instructions but works in the wrong place. You need both.

Try it

Open your CLAUDE.md (or whatever instruction file your agent reads)
Find the longest prose paragraph
Break it: one header per topic, one code block per command, one sentence of rationale per prohibition
Run your agent on the same task you ran yesterday

The instructions didn't change. The signal did.

Don't just write more instructions. Format the ones you have.

Why /bootstrap should be the first Command in every Agent session

Gábor Mészáros — Tue, 24 Feb 2026 12:39:23 +0000

After a 2.5 hour session you accidentally close your coding agent terminal mid session. The output is there, the commits are there, but something important is gone.

That synergy that you spent hours to build up.

You reopen the console and hope you two can start over, but it feels like now you are strangers. The agent is now "Somebody that you used to know."

No, this is not an intro of a light love novel, it's the usual experience with coding agents. Coding agents are stateless by design so each and every new session is a new beginning.

The resume illusion

Some agents have --resume functionality. Claude Code has it. Codex has it. Gemini CLI has it. It's useful, but it has limitations.

--resume only replays the conversation log. It doesn't restore the loaded and curated mental model - the understanding of your project's topology, constraints, and current state that the agent built up over those 2.5 hours.

Resume gives you only the transcript. Not the understanding.

Two primitives I already had

Over the last few weeks I wrote about two separate ideas:

In The backbone.yml Pattern, I introduced a YAML manifest that maps your project's topology - agents, directories, configs, schemas. Information. The agent reads it once and knows where everything is. No more exploration tax.

In Mermaid for Workflows, I showed how flowcharts give agents reliable step-by-step processes to follow. Process. Structured syntax that sticks out in a context window full of prose, backed by research showing agents follow flowcharts more reliably than natural language.

Backbone tells the agent what exists. Workflows tell the agent how to operate.

But I was using them separately. I'd tell Claude "read the backbone" at session start, then invoke workflows as needed. Manual orchestration. Every session, same ritual.

Why am I doing this separately? Isn't context just Information + Process ?

Read the map. Follow the process. Produce a working mental model. Every session, one command.

That's /bootstrap.

What /bootstrap does

One command. Two modes.

First run (no backbone exists): scans the project, detects agents and structure, generates a backbone.yml, then synthesizes a context report.

Every subsequent run (backbone exists): reads the backbone, maps agents, loads constraints, checks project state, and produces a mental model.

Both modes use the diagram + prose combo from the mermaid post - flowcharts for the branching, prose for the reasoning behind each step.

Bootstrap workflow

The output looks like this:

Bootstrap complete.

Project: my-app v1.2.0 (branch: feature/auth)
Agents: claude (CLAUDE.md), copilot (.github/copilot-instructions.md)
Structure: src/, tests/, docs/, config/

Navigation:
  Agent config → backbone.agents.{agent}
  Project dirs → backbone.paths.{key}
  Schemas      → backbone.schemas.{name}

Operations:
  Build  → npm run build
  Test   → npm test
  Deploy → ./scripts/deploy.sh

Constraints:
  - Never modify config/production.yml directly
  - Always run tests before committing

State: v1.2.0, 3 unreleased changes (auth module)

After this, the agent knows where things are, how to operate, what's off limits, and what's in progress. No exploration. No guessing.

Seed mode: the smart first run

Most bootstrapping tools drop a blank template and say "fill this in." That's 0% useful on day one.

/bootstrap scans first, generates second. It detects agents across the ecosystem:

Pattern	Agent
`CLAUDE.md`	Claude
`AGENTS.md`	Codex
`.github/copilot-instructions.md`	Copilot
`.cursorrules`	Cursor
`.windsurfrules`	Windsurf
`.clinerules`	Cline
`.aider*`	Aider
`.continue/config.json`	Continue

It maps directories, finds configs, detects build/test workflows from package.json, Makefile, CI configs. The generated backbone is 70-80% correct from the scan alone.

The remaining 20% - semantic connections, domain concepts - gets marked with # TODO: refine so you know exactly where to invest review time. Verified topology. Flagged guesses. One command.

The skill structure

I built this as an Agent Skill - the open standard for packaging reusable instructions across agents:

bootstrap/
  SKILL.md              # Entry point - frontmatter + instructions
  workflows/
    seed.md             # Scan + generate (mermaid flowchart)
    bootstrap.md        # Read + synthesize (mermaid flowchart)
  templates/
    backbone.yml        # Starter backbone shape

See the two primitives? The templates/backbone.yml is the information layer from the backbone post. The workflows/*.md files are the process layer from the mermaid post - complete with flowcharts, key decisions, and edge cases.

/bootstrap is their love child. One skill that reads both primitives and turns them into a loaded context.

Cross-agent by design

The SKILL.md format is an open standard created by Anthropic and now adopted by OpenAI, Google, Cursor, and others. A skill authored once works across 30+ agents - the format is filesystem-based, not API-dependent.

Drop the bootstrap/ folder into .claude/skills/ for Claude Code, .agents/skills/ for Codex CLI, or wherever your agent looks. Same skill, same result.

This matters because the bootstrap concept isn't Claude-specific. Every coding agent is stateless. Every agent benefits from a loaded mental model at session start. The problem is universal, so the solution should be too.

What changes after bootstrap

Before bootstrap, every session starts with the agent exploring. After bootstrap, every session starts with the agent understanding.

No more find / ls / grep loops to discover what the backbone already maps
No more wrong assumptions about where configs live
No more repeated corrections - "no, the tests are in spec/, not tests/"
No more context poisoning from exploration artifacts cluttering the window

The agent reads the backbone, follows the workflow, synthesizes the context, and starts working. Every session. In seconds.

The progression

Looking back at this series, the progression is clear:

In the capability levels post - what maturity looks like for instruction files.
In the backbone.yml post - give the agent a map (information).
In the mermaid post - give the agent reliable processes (workflows).
Now - combine both into a single command that loads a mental model.

Map + Process = Understanding. That's the whole idea.

Try it

The bootstrap skill will be published as a cross-agent compatible Agent Skill in the Reporails skills repo this week.

In the meantime, the pattern works even without the skill:

Create a backbone.yml mapping your project (template here)
Add a workflow with a mermaid flowchart for session initialization (approach here)
Start every session with: "Load the backbone, follow the bootstrap workflow, and tell me what you understand"

That's manual bootstrap. The skill just makes it /bootstrap.

Don't start a session. Bootstrap it.

This post is part of the Reporails series. Previous: Mermaid for Workflows.

CLAUDE.md Best Practices: Mermaid for Workflows

Gábor Mészáros — Tue, 17 Feb 2026 12:04:57 +0000

I picture says a thousand words. I wanted to see my system.

Not the code. I wanted to see the workflows. What happens when a rule gets validated. What happens when a session starts. What happens when compaction triggers. Systems are workflows, and I couldn't see mine.

I had them written down, of course. Prose paragraphs in CLAUDE.md/SKILL.md or RULES describing each process step by step. But past four or five steps with branching, the prose became unreadable. I'd write it, come back a week later, and need to re-parse the whole thing to understand what I'd written. Mental overload, every time.

My coding agent had the same problem. Research calls it "lost in the middle" - LLMs perform best with information at the beginning and end of their context, and significantly worse with information buried in the middle. My prose workflows were exactly that: critical branching logic buried in paragraphs, sandwiched between other instructions. Claude would miss steps. Skip branches. Drift from the intended process.

And the workflows themselves drifted too. I'd remove a pipeline phase and update one paragraph but miss another. Prose makes that invisible - three sentences can reference a removed step and nothing looks broken.

So I rewrote my workflows as Mermaid diagrams. And three things happened at once:

I could see the system. Rendered Mermaid gives you a visual map of what's happening - for free.
Claude followed them more reliably. Structured syntax sticks out in a context window full of prose.
They stopped rotting. You can't leave a dangling arrow in a flowchart the way you can leave a stale sentence in a paragraph.

Turns out there's research backing all three.

The research

FlowBench (Xiao et al., EMNLP 2024) tested how LLM agents perform when given the same workflow knowledge in different formats - natural language, pseudo-code, and flowcharts. Across 51 scenarios on GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo:

Flowcharts achieved the best trade-off for agent performance
Combining formats (text + code + flowcharts) outperformed any single format

Format matters. It measurably affects how well the agent follows your instructions.

What to convert

Not everything benefits equally from a diagram. The rule:

If it has branches, it needs a diagram. If it has judgment, it also needs prose. Most real workflows need both.

Deterministic pipelines - CI/CD, deployment, validation, review workflows - are pure flowchart territory. Every step has a defined outcome, every branch has a condition.

But most workflows aren't purely deterministic. They have branching and judgment: "if the tests fail with a type error, fix inline; if it's a logic error, rethink the approach." The diagram captures the branch. The prose below it captures the judgment. Neither format alone carries both.

Before and after

Here's what my rule validation workflow looked like before - prose only, describing the same process:

## Rule Validation

Run validation on all rules. For each rule, first validate the
schema (fields, types, format). If that passes, check the contract
(.md and .yml matching). If the contract is valid, resolve template
variables and run OpenGrep validation on pattern syntax. If OpenGrep
returns exit 2 or 7, report the error. If it returns 0 or 1,
the rule passes. After all rules are checked, output a summary.

And here's what the Mermaid version looks like:

flowchart TD
    START([/validate-rules options]) --> COLLECT[Collect rules from paths]
    COLLECT --> LOOP[For each rule]
    LOOP --> SCHEMA[1. Schema validation<br/>Fields, types, format]
    SCHEMA -->|fail| REPORT
    SCHEMA -->|pass| CONTRACT[2. Contract validation<br/>.md and .yml matching]
    CONTRACT -->|fail| REPORT
    CONTRACT -->|pass| RESOLVE[Resolve template variables]
    RESOLVE --> OPENGREP[3. OpenGrep validation<br/>Pattern syntax]
    OPENGREP -->|exit 2 or 7| REPORT
    OPENGREP -->|exit 0 or 1| REPORT[Report results]
    REPORT --> NEXT{More rules?}
    NEXT -->|yes| LOOP
    NEXT -->|no| SUMMARY[Summary output]

And the result:

Rendered Mermaid workflow from Reporails rule validation

Same information. But the flowchart makes every branch explicit and every failure path visible. Claude can't accidentally skip a validation step or misinterpret which exit codes mean failure.

But the diagram alone is still only half the answer.

The combo: diagram + prose

FlowBench's strongest finding wasn't "use flowcharts" - it was "combine formats." Each format carries what it's best at.

Here's what one of my actual workflows looks like after conversion - rule-validation.md from Reporails:

## Rule Validation Workflow

mermaid
flowchart TD
    START([/validate-rules options]) --> COLLECT[Collect rules from paths]
    COLLECT --> LOOP[For each rule]
    LOOP --> SCHEMA[1. Schema validation<br/>Fields, types, format]
    SCHEMA -->|fail| REPORT
    SCHEMA -->|pass| CONTRACT[2. Contract validation<br/>.md and .yml matching]
    CONTRACT -->|fail| REPORT
    CONTRACT -->|pass| RESOLVE[Resolve template variables]
    RESOLVE --> OPENGREP[3. OpenGrep validation<br/>Pattern syntax]
    OPENGREP -->|exit 2 or 7| REPORT
    OPENGREP -->|exit 0 or 1| REPORT[Report results]
    REPORT --> NEXT{More rules?}
    NEXT -->|yes| LOOP
    NEXT -->|no| SUMMARY[Summary output]


## Why Three Layers in This Order

1. **Schema validation** catches structural errors (missing fields, wrong
   types) with zero external dependencies. Cheapest check - filters out
   rules that would cause confusing downstream failures.

2. **Contract validation** confirms that rule.md and rule.yml agree.
   Catches the class of bugs where one file was updated but the other
   wasn't. Requires both files to be schema-valid first.

3. **OpenGrep validation** runs actual patterns against the syntax
   checker. Most expensive step - requires template resolution, file I/O,
   agent config loading. Only runs on rules that are already structurally
   sound.

The diagram shows the three-step pipeline with its branches. The prose explains why that ordering - cheapest first, most expensive last, each layer depending on the previous one being clean. Neither format alone carries both the flow and the reasoning.

When to adopt this

If your CLAUDE.md has any of these, you have a flowchart waiting to happen:

"First do X. If X passes, do Y. If Y fails, do Z."
"Run A, then B, then C. If any step fails, stop."
"Check for X. If found, do Y. Otherwise, do Z."

Sequential steps with conditions = flowchart. Convert those, leave everything else as prose.

Try it

Find a workflow in your CLAUDE.md that reads like a recipe with conditions
Rewrite the control flow as Mermaid
Keep the rationale and judgment calls as prose below the diagram
Delete the original prose-only version

One converted workflow. See if Claude follows it more reliably - and enjoy being able to see your system for the first time.

Don't describe the path. Draw the map.

*The FlowBench paper is at arxiv.org/abs/2406.14884. The "lost in the middle" paper is at arxiv.org/abs/2307.03172.

I'm building instruction file governance at Reporails - this finding led to a new rule category (Context Quality) that I'll cover in the next post.*

Previous in series: The backbone.yml Pattern

Reporails: Copilot adapter, built with copilot, for copilot.

Gábor Mészáros — Mon, 16 Feb 2026 07:54:28 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

Reporails is a validator for AI agent instruction files: CLAUDE.md, AGENTS.md, copilot-instructions.md. It scores your files, tells you what's missing, and helps you fix it.

The project already supported Claude Code and Codex. For this challenge, I added GitHub Copilot CLI as a first-class supported agent - using Copilot CLI itself to build the adapter.

The architecture was already multi-agent by design. A .shared/ directory holds agent-agnostic workflows and knowledge. Each agent gets its own adapter that wires into the shared content. Claude does it through .claude/skills/, Copilot through .github/copilot-instructions.md.

Adding Copilot took 113 lines. Not because the work was trivial - but because the architecture was ready.

Repos:

CLI: reporails/cli (v0.3.0)
Rules: reporails/rules (v0.4.0)
Recommended: reporails/recommended (v0.2.0)

Demo

After adding Copilot support, each agent gets its own rule set with no cross-contamination:

Agent	Rules	Breakdown
Copilot	29	30 CORE - 1 excluded + 0 COPILOT-specific
Claude	39	30 CORE - 1 excluded + 10 CLAUDE-specific
Codex	37	30 CORE + 7 CODEX-specific

Run it yourself:

npx @reporails/cli check --agent copilot

My Experience with GitHub Copilot CLI

It understood the architecture immediately

I explained the .shared/ folder — that it was created specifically so both Claude and Copilot (and other agents) can reference the same workflows and knowledge without duplication. Copilot got it on the first exchange:

Copilot understanding .shared/ architecture

The key insight it surfaced: "The .shared/ content is already agent-agnostic. Both agents reference the same workflows. No duplication is needed - just different entry points."

That's exactly right. Claude reaches shared workflows through /generate-rule → .claude/skills/ → .shared/workflows/rule-creation.md. Copilot reads instructions → .shared/workflows/rule-creation.md. Same destination, different front doors.

What it built

Copilot created the full adapter in three phases:

Foundation - .github/copilot-instructions.md, agents/copilot/config.yml, updated backbone.yml, verified test harness supports --agent copilot
Workflow Wiring - entry points in copilot-instructions.md, context-specific conditional instructions, wired to .shared/workflows/ and .shared/knowledge/
Documentation - updated README and CONTRIBUTING with agent-agnostic workflow guidance

Copilot Contribution Parity Complete

The bug it found (well, helped find)

While testing the Copilot adapter, I discovered that the test harness had a cross-contamination bug. When running --agent copilot, it was testing CODEX rules too — because _scan_root() scanned ALL agents/*/rules/ directories indiscriminately.

The fix was three lines of Python:

# If agent is specified, only scan that agent's rules directory
if agent and agent_dir.name != agent:
    continue

Test Harness Agent Isolation Fix

The model selector surprise

When I opened the Copilot CLI model selector, the default model was Claude Sonnet 4.5. The irony of building a Copilot adapter using Copilot CLI running Claude was not lost on me.

What worked, honestly

Copilot CLI understood multi-agent architecture without hand-holding. It generated correct config files matching existing adapter patterns. The co-author signature was properly included in all commits. It didn't try to duplicate content that was already shared - it just wired the entry points.

The whole experience reinforced something I've been thinking about: the tool matters less than the architecture underneath. If your project is structured well, any competent agent can extend it. That's the whole point of reporails - making sure your instruction files are good enough that the agent can actually help you.

What also happened during this challenge

While building the Copilot adapter, I also rebuilt the entire rules framework from scratch. Went from 47 rules (v0.3.1) to 35 rules (v0.4.0) - fewer rules, dramatically higher quality. Every rule is now distinct, detectable, and backed by evidence. But that's a story for another post.

Try it: npx @reporails/cli check

GitHub | Previous posts

DEV Community: Gábor Mészáros

Prompt Engineering, Context Engineering, Loop Engineering: What Actually Changed

The unit was a prompt

The unit became the surface

The unit is the loop

A verifier that checks what, exactly?

What the earlier articles measured

What the new name solves

The loop, one part at a time

Wiring a "read this first" hook as "ask" (don't)

Green Tests Don't Mean Better Software

What green actually checks

Two disciplines, pitched as rivals

The orthogonality move

A worked example: green architecture tests

The expectation primitive

Proof of existence, and the cost of skipping it

confirmed is the minority verdict

inconclusive is the common case

Even confirmed is not permanent

Instruction systems capability ladder: harness leveling

The new ladder

A quick tour

The first cut: L5 / L6

The second cut: L7 writes itself

When to climb

Three questions for your repo

A note on related taxonomies

The State of AI Instruction Quality

The dataset

How we measured

Finding 1: Most of your instruction file isn't instructions

Finding 2: 90% of instructions don't name what they're talking about

Finding 3: agents.md is the most common instruction file

Finding 4: Different agents, completely different config philosophies

Finding 5: 37% of projects configure multiple agents

Finding 6: The most-copied skills are the vaguest

The best and worst skills (>50 repos)

Finding 7: Sub-agents are almost entirely persona prompts

The anatomy chart: more directives, worse quality

Limitations

What this means

Try it yourself

The dataset

The Undiagnosed Input Problem

The symptom

The industry response

The folk system

The gap

What we measured

More rules, lower quality

The sweet spot

A real example

What this means

What’s next

Do NOT Think of a Pink Elephant

The pink elephant

The anatomy of a complete instruction

The part nobody expects

What this looks like in practice

Try it

Instruction Best Practices: Precision Beats Clarity

The experiment

Scope it or drop it

Name the thing

The compound effect

When to adopt this

Try it

CLAUDE.md Best Practices: 7 formatting rules for the Machine

The comparison

It's not just about length

Seven structural rules

1. Include rationale

2. Keep heading hierarchy shallow

3. Name files descriptively

4. Use headers

5. Put commands in code blocks

6. Use standard section names

7. Make instructions actionable

The compound effect

`confirmed` is the minority verdict

`inconclusive` is the common case

Even `confirmed` is not permanent

Finding 3: `agents.md` is the most common instruction file