DEV Community: Reporails

Loop Engineering: Fine-Tuning the Guardrail That Fired Wrong

Gábor Mészáros — Tue, 14 Jul 2026 11:28:12 +0000

The check had been green for a week. It greps every diff under src/ for import mock, because production code has no business importing a mock library. Then it went red on src/payments.py, a file with no mock import anywhere in it. The word mock was sitting in a docstring, in a sentence telling the next developer not to import one. The grep matched the sentence. The quickest way to clear that red is to delete the check.

The series opener named the loop that runs generate, check, steer, retry, stop, and put the hard part on the check that decides good enough, stop. Last week's piece built that check as no-mocks.sh and set a gate beside it, a hook that refuses a bad write instead of catching it after the fact. This one stays with the check and asks what happens when it is wrong: how to tell a broken instrument from a signal that was never there.

What a check is here

A check is a function from the agent's output to a verdict: pass or fail. In the loop it is the thing that decides whether an iteration is acceptable and whether to run another. A check comes in two kinds. A deterministic check runs the code or scans the text and returns the same verdict every time: grep, pytest, an exit status. A model-graded check asks another model is this good? and reaches criteria the first cannot express, at the cost of the same unreliability the loop was built to contain.

This piece is about the deterministic kind, because it is the kind you can hold in your hand and debug. When a deterministic check gives you a verdict you disagree with, the verdict is wrong in a way you can inspect line by line. That is exactly the property the argument below leans on.

Two readings of the same quiet

A check that comes back quiet is telling you one of two things, and it does not tell you which:

Absent signal. The thing the check guards against genuinely is not there. This is the good news you wanted.
Broken instrument. The thing is there, or would be there, but the check cannot see it. The guard is off and the diff looks the same as if it were on.

A false alarm, the check firing on a clean file, is the visible failure. It is annoying, but it announces itself; you know the instrument is off because it is screaming at nothing. The dangerous failure is the opposite one, and it is silent: a check weakened until it stops seeing real violations returns the same green as a check that correctly found nothing. You cannot feel the difference between "safe" and "blind" from the verdict alone.

So the misfire is really about the check, not the file it landed on, and it sets a trap worth naming early. A red mark on a file you know is clean reads as a reason to stop trusting the check, and it tends to show up on the very run where that same check is catching something real elsewhere. Look at what the pattern actually matched before you act on the verdict it returned.

The misfire, up close

Make it concrete with the check itself. The version that misfires is the naive one, a bare substring match, before the ^ anchor last week's piece put on the pattern:

grep -rn 'import mock' src/ && exit 1

For a week it did its job. Then you added src/payments.py, whose docstring happens to document the rule the check enforces:

"""Charge processing. Do not import mock here, use the fakes in tests/."""

import stripe

There is no mock import in that file. Run the check and it fires anyway, alongside a real leftover it correctly caught in src/badcache.py:

$ bash no-mocks.sh
src/payments.py:1:"""Charge processing. Do not import mock here, use the fakes in tests/."""
src/badcache.py:1:import mock   # leftover from a spike

The badcache.py line is the catch, an import mock that has been sitting in production code. The payments.py line is the misfire: that file is clean, and the check flagged it because import mock appears as a substring inside prose documenting the very rule the check enforces.

The tempting fix, and what it costs

You have a red check on a clean file. The path of least resistance has three forms, and all three are the same move:

Delete no-mocks.sh. The red goes away.
Weaken it: add --exclude=payments.py, or only run it on Tuesdays, or drop it to a warning nobody reads.
Conclude the check was never worth much: "grep is dumb, mock imports are rare anyway, this is more trouble than it catches."

Every one of these makes the noise stop, and stopping the noise feels like fixing the problem. Re-read the output. The same run that misfired on payments.py also, correctly, caught the real import mock in badcache.py. Delete or defang the check and that catch disappears with the false alarm. The next real mock import to land in src/ sails through, and the suite stays green, because green is now what this check returns for everything.

That is the inversion. The misfire tempts you to conclude the signal is not worth watching, at the exact moment the same check proved the signal is real. You would be reasoning about the world from the failure of your instrument to see it.

Fix the instrument instead

The misfire told you something specific: the pattern matched prose, not code. So make the pattern match code. That is the one-character change last week's piece already carried, the ^ that anchors the pattern to the start of a line, where an import statement lives and a docstring sentence does not. Add it and payments.py goes quiet while badcache.py still fails.

Both make the red on payments.py disappear, and only one keeps catching badcache.py. Deleting the check clears the false alarm by throwing away the real catch with it; anchoring the pattern clears the false alarm and leaves the real catch standing.

I run the same shape of check on my own agent's replies. A deterministic scan looks for a short list of banned phrases: a placation opener, or an unbacked the work is verified when nothing was actually run. One day it fired on a reply that was clean. The banned phrase was sitting inside a quotation of the rule that bans it, and inside a backticked identifier the agent was naming rather than using. The check had matched the words, not their use, the same way the grep matched the docstring. Loosening the phrase list until that alarm stopped would have let the next real unbacked claim through, so I scoped the matcher instead. It blanks the cited spans, the backtick spans and quoted strings and negated clauses, before the scan runs, so an empty the work is verified still fires while a citation of the rule goes quiet.

The anchored check still is not a perfect mock detector. An aliased import unittest.mock as m is a different pattern, and an indented import inside a function slips past a start-of-line anchor. A real linter closes both without any of this hand-rolling: flake8-tidy-imports carries a banned-modules list, and ruff's F401 flags imports the code never actually uses. If mock-in-production is the exact thing you are guarding, reach for one of those. The grep still earns its place as the first thing you write, because it is the cheapest deterministic check a team reaches for in a hook, one line in a script that already runs. And the lesson holds after you upgrade the instrument, because a linter rule is a pattern too. F401 matches what looks like an unused import, not the fact of use, and it has edge cases where it reads the text wrong. Every one of those is a calibration reading you use to sharpen the rule, never a reason to stop measuring.

The rule that generalizes

A misfire is data about the instrument, never a verdict about the world. When a deterministic check fires on something you know is clean, the finding is "my check matches too much," not "the thing I was checking for does not exist."

The asymmetry is what makes the discipline worth holding. A false alarm is loud and self-correcting: it interrupts you until you deal with it, and it costs you the few minutes the fix takes. A deleted or defanged check fails the other way. It goes quiet and stays quiet: no output, no red, no interruption, just a signal nobody is watching anymore, indistinguishable from a signal that was never there. Every real violation that lands afterward passes clean, and nothing on screen tells you the guard is gone.

This was one of the questions the series opener left open: how to tell a broken instrument from an absent one. Its sibling piece took another, the gate that refuses a bad write instead of catching it a beat too late, and the line between a rule that observes and one that refuses. The question underneath both is still open. Every check and every gate is fed by rules the agent carries as loaded text, paid for on every turn whether they apply that turn or not. What an instruction costs to keep loaded, and what changes when you load it only where it applies, is where the series goes next.

I work on Reporails, deterministic diagnostics for the instruction files, rules, and prompts that steer coding agents. It reads the steering surface and tells you, with measured evidence, where it drifts. The checks here are the same idea pointed at a diff: what a rule actually matches, held up against what you meant it to match.

Deterministic Guardrails: Prompts Steer, Hooks Enforce

Gábor Mészáros — Thu, 09 Jul 2026 20:35:46 +0000

A loop that refactors code until a check passes is a few lines of shell. Most guardrails beside such a loop only watch what an agent already wrote; the one worth building refuses a bad edit before it lands, and that is what this piece constructs.

The rule it enforces: production code under src/ must not import a mock library. That rule can live in two places. Put it in the prompt, do not import mock, and it is clear and still only a request: text loaded into a model is a lever on probability, not a switch, so it leaves the model free to write the import anyway. Put it in a gate that fires before a write lands, and it refuses the edit outright. Both state the same rule; only one can make it hold every time.

The ask channel has a ceiling: a prompt pushes the odds up, never to one. How close a prompt gets to that ceiling is measurable work. Reporails reads the asking channel and shows you, with measured evidence, which instructions pull the lever and which are text the model is free to ignore, so you can write the ask against evidence instead of on faith. Even a well-measured prompt tops out below certainty, because the model underneath stays probabilistic. Closing that last gap is what the gate is for.

The loop, and where it can only ask

This is the second component of the loop taken on its own: the gate. The series opener named the loop (generate, check, steer, retry, stop) and posed three questions about the components it turns on. The first is the check, the arm that decides good enough, stop. This piece takes the second: the rules that can refuse a diff, and the difference between a channel that can ask and a channel that can say no.

Here is the loop the series is dissecting. It refactors src/ until a guard holds. The agent never decides it is done; the guard does.

#!/usr/bin/env bash
# work-until-checked: refactor src/ until the guard holds.
# The steering rule lives in the agent's own instructions (CLAUDE.md),
# stated once; the loop feeds back only what changed, the guard's output.
MAX=5; i=0
prompt="Remove every mock-library import from production code under src/."
while [ "$i" -lt "$MAX" ]; do
    run_agent --task "$prompt"                      # GENERATE (rule already in context)
    if bash no-mocks.sh; then                       # CHECK
        echo "stop: guard holds after $i retries"; exit 0
    fi
    prompt="The last attempt still tripped the guard; fix it:
$(bash no-mocks.sh 2&gt;&amp;1)"                            # STEER: only the new signal
    i=$((i + 1))
done
echo "stop: budget exhausted, guard still red"; exit 1

Read the arms one at a time. The steering rule lives once in the agent's own instructions, its CLAUDE.md, so run_agent --task "$prompt" never re-ships it. That first $prompt is the ask, and the rule already in context is text the model reads as a lever on probability, not a switch: Remove every mock-library import moves the odds that the next diff is clean without setting them to one. The STEER arm rewrites $prompt to carry only what changed, the guard's latest output, and asks again against that. A tighter, more specific ask, but an ask.

no-mocks.sh is the only arm that is not asking. It scans src/ after run_agent returns, so it is a detector: it reads the diff the agent already wrote and returns a verdict on it. When the verdict is red, the loop retries. Nothing here stops the bad write from landing in the first place. The guard catches after the fact, and the loop re-rolls. Between the write and that catch, a forbidden import can sit in production code for a whole iteration before the detector sees it.

For a refactor loop that converges, the window is usually harmless, because the next iteration overwrites it. But "usually harmless" is doing a lot of work, and the cases where it is not are exactly the constraints you cared enough to write down twice.

Two channels: one asks, one refuses

A prompt is the channel that can only ask. Everything on the context surface, the task, the agent's instructions, the system prompt, is an input to a probabilistic process. It shifts the distribution of what the model produces, which is why prompt and context engineering are real disciplines. It is also, by construction, unable to force any single outcome. The prompt asks, and any given iteration is free to ignore it.

A gate is the channel that can refuse. It is a deterministic function that fires at a fixed point and returns a hard verdict the model does not get to route around. It does not shift odds; it either lets the diff through or it does not.

The base loop has an asker (the task), a steerer (the retry), and a detector (the guard). It has no refuser. Nothing in it can stop a write before the write happens, so a loop that checks for exactly the thing you forbade can still let it land and catch it a beat too late. Adding the refuser is the rest of this piece.

Build the missing arm: a hook that refuses

Claude Code fires hooks at named transitions in the harness: SessionStart, PreToolUse, PostToolUse, Stop. A PreToolUse hook runs before a tool call executes and can block it. That is the transition to key on, because it is the one point where you can refuse a write before it happens rather than detect it after. Here is where the gate sits in the lifecycle of a single write:

The gate is the diamond, and everything downstream of the allow edge is the write reaching disk. Refuse there and the write never reaches the Tool executes node.

Here is a hook that blocks any Write or Edit whose content adds a mock import under src/. Save it as .claude/hooks/block-mock-imports.sh:

#!/usr/bin/env bash
# block-mock-imports.sh: refuse any Write/Edit that adds a mock import to src/.
# Runs as a Claude Code PreToolUse hook. Exit 2 blocks the tool call.
payload=$(cat)   # the tool call arrives as JSON on stdin
path=$(jq -r '.tool_input.file_path // ""' &lt;&lt;&lt;"$payload")
content=$(jq -r '.tool_input.content // .tool_input.new_string // ""' &lt;&lt;&lt;"$payload")

case "$path" in src/*) ;; *) exit 0 ;; esac   # only guard production code

if grep -qE '^(import mock|from mock|from unittest import mock)\b' &lt;&lt;&lt;"$content"; then
    echo "blocked: '$path' would add a mock import to production code. \
Use a real dependency or a constructor-injected fake." &gt;&amp;2
    exit 2   # exit 2 == hard block; the message on stderr goes back to the model
fi
exit 0

Wire it up as a PreToolUse hook matched to the write tools, in .claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": ".claude/hooks/block-mock-imports.sh" }
        ]
      }
    ]
  }
}

Now the agent tries to write import mock into src/badcache.py. The transition fires before the write:

The write never happened. src/badcache.py is untouched. The hook read the pending tool call, matched ^import mock in the content the agent was about to write, and returned exit code 2, which Claude Code treats as a block with the stderr message handed back to the model. The model sees the refusal in the same turn and has to try something else. There was no window, because there was no write to catch.

Notice the pattern is anchored: ^import mock, not a bare substring. A docstring that merely names the rule does not start a line with an import, so it does not trip the hook. Anchoring matters more here than in a detector that only reports: a match keyed too broadly refuses clean work, and that carries higher stakes at a gate, which is the next section.

Compare the hook to no-mocks.sh. They match nearly the same rule. What differs is where they sit. The guard sits after the write and reports; the hook sits before the write and refuses. Same constraint, two channels, and only one of them closes the window between the write and the catch.

Which rule goes in which channel

Not every rule belongs in the refusing channel, and filing one in the wrong channel fails in a specific, quiet way in each direction.

A must-hold constraint filed as mere steering gets ignored the one time it matters. The prompt asks, the model is free to decline, and on the iteration where it declines, the diff lands. A must-hold rule in an ask-only channel holds on every iteration except the one where the model declines, and that iteration is the case you filed it for.

A rule that only needed to steer, but is hard-gated, causes friction and false blocks. This is the over-matching failure pointed at a gate instead of a detector, and it costs more here. A gate keyed too broadly refuses clean work: the developer who names the rule in a docstring, the test file that legitimately imports a mock, the production helper named mock_response that a bare substring match trips on. A false alarm from a detector is noise you can ignore for one iteration. A false block from a gate is work that cannot proceed until someone fixes the gate. Refusal is load-bearing in the moment, so an over-broad refuser has a larger blast radius than an over-broad detector.

The heuristic: gate what must hold every time and is cheaply, deterministically checkable; steer what needs judgment or flexibility. no mock import in production code is a fixed, must-hold, one-line-of-grep constraint, so gate it. prefer constructor injection over a service locator is a judgment call with real exceptions, so steer it and let a check observe rather than refuse. A criterion you cannot express as a deterministic match at a transition is a criterion you cannot gate; if it needs a model to judge it, it belongs in the asking channel, with all the probability that implies.

The generator's "done" is a claim to re-derive

Underneath both channels is the posture the series keeps returning to. An agent's done is a claim to re-derive, not a fact to relay. The loop already refuses to trust the generator's report that the work is finished; that is what the check is for. The gate extends the same distrust one step earlier, to the moment of the write. The agent believes its edit is fine. The gate does not accept the belief; it re-derives the verdict from the diff, deterministically, before the edit is allowed to exist. Steering is where you tell the generator what you want. Gating is where you stop taking its word for it.

What holds, and what it still costs to keep loaded

With the hook wired in, the rule you needed to hold every time does. do not import mock holds on every iteration now, on the ones you read and the ones you never open, because a PreToolUse gate does not depend on your attention or the model's cooperation. That is what "hold every time" costs: a channel that can refuse. No amount of rewording the ask gets you there. The prompt still does its job, moving the odds toward a good diff, and the gate makes the one diff you cannot afford impossible. You want both, filed correctly.

One component the opener flagged sits underneath both channels: the context surface. Every rule you steer with is text loaded into the model, and the whole surface is paid for on every turn, whether the model needs a given rule that turn or not. The loop here states its one rule once, but most agents carry far more. What does an instruction cost to keep loaded, and what changes when you load it only where it applies instead of keeping the whole surface in front of the model every turn? That is the question the next piece takes up.

I work on Reporails, deterministic diagnostics and governance for the instruction files, rules, and prompts that steer coding agents. It reads the steering channel, the text you use to ask, and tells you, with measured evidence, where it drifts. The hooks here are the enforcing channel that sits beside it: what a rule refuses at a transition, not what it asks for on the surface.

Prompt Engineering, Context Engineering, Loop Engineering: What Actually Changed

Gábor Mészáros — Wed, 08 Jul 2026 11:16:04 +0000

A few years back the skill had one name: prompt engineering. You rewrote a sentence until the model did the thing. Last year the same corner of the job picked up a new name, context engineering. This year the threads call it loop engineering, and the line repeated under it is the verifier is the bottleneck, not the model anymore.

Every rename draws the same reaction from the same developers: is this a real shift, or the same work with fresh paint so someone can sell a course. That reflex is usually right. Most AI-tooling vocabulary turns over faster than the problems it describes.

So here is the read, one rename at a time. At each step, the thing you were actually engineering changed. What changed was the unit of work, not the marketing around it. The problems the newest name points at have a dated, public record. The dates are on the articles; the arithmetic is yours to do.

The unit was a prompt

Call it 2022 through 2024. The unit of work is a single prompt. You engineer the wording of one request: few-shot examples, a role frame, think step by step, the ordering of the ask. The entire surface you controlled was the text of one message, and the craft was getting that message right.

That discipline is not gone. It is no longer the whole job, because the whole job stopped fitting in one message.

The unit became the surface

Around mid-2025 Andrej Karpathy put a name on what people were already doing: context engineering. The unit stopped being a message and became everything in the window: the system prompt, the retrieved documents, the tool definitions, and the CLAUDE.md / AGENTS.md instruction files that ride along on every single turn. You are no longer tuning a sentence. You are curating a surface.

That surface has a property developers kept rediscovering the expensive way: most of what you put on it does not couple to behavior. The State of AI Instruction Quality pointed a deterministic analyzer at 28,721 repositories and found the median instruction file carries 50 content items and 12 actual directives, and the rest is headings, context, and structure the model is free to ignore. A sharper failure mode got its own writeup. Do NOT Think of a Pink Elephant shows how a constraint phrased as a negation (do not use mocks) can raise the odds of the exact thing it forbids, for the same reason the title just did to you.

Both of those are context-engineering problems, measured and published.

The unit is the loop

June 2026, the term is loop engineering, popularized by Addy Osmani synthesizing Boris Cherny and Peter Steinberger, and everywhere on the timeline within a few weeks. The unit is now the running loop: generate, check, steer, retry, stop. The prompt is one node in it. The context surface is the state it carries between iterations. The claim repeated across every explainer is that the model is no longer the limiting part. The check that decides good enough, stop is.

Take the name generously. It points at something the two earlier names left implicit: the mechanism that decides whether an iteration was any good and whether to run another. That mechanism was always present; every agent that retries has one. Loop engineering's contribution is making it the object you engineer instead of a default you inherit.

Prompt engineering tuned node one. Context engineering curated the state feeding it. Loop engineering points at the diamond, and that is where the interesting question hides.

A verifier that checks what, exactly?

The verifier is the bottleneck is a good slogan and an incomplete one. It names the bottleneck without saying what the verifier is supposed to check, and that gap is the engineering problem the slogan skips.

There are two kinds of check, and they are not interchangeable. A deterministic check runs the code, asserts the exit status, scans for the forbidden import, and returns the same verdict for the same input every time, with no judgment in the middle. A model-graded check asks another model is this good? The second reaches criteria the first cannot express, like is this explanation clear or does this read as rude, and it pays for that reach by inheriting the exact unreliability the loop was built to contain. You have put a probabilistic judge in charge of deciding when the probabilistic generator is finished.

Watch where they diverge on a real loop. An agent is told to refactor a module and stop when the work is done. A deterministic verifier can prove the test suite still passes and no banned import crept in. Those are facts, checkable on every iteration, with no argument about them. It cannot decide whether the refactor was worth doing, so a model-graded check gets bolted on to answer that, and now the stop condition rests on one model grading another's taste. The loop terminates when the judge is satisfied, and the judge is the same class of component the loop exists to supervise. Neither check is wrong. They answer different questions, and confusing which question you asked is how a loop stops confidently on the wrong iteration.

That trade-off, what the check actually checks, is the content of the verifier, and it is not a new problem. Green Tests Don't Mean Better Software works the CI version of it: a green test proves the code conforms to its spec and says nothing about whether the change improved the system. The check answered a question you never asked. Aim the same distinction at an agent loop instead of a test suite and you have the loop-engineering question stated for a case every developer already trusts.

What the earlier articles measured

Three articles published earlier in 2026 each worked one piece of this problem.

CLAUDE.md: Check, Score, Improve & Repeat, 27 January 2026, describes a loop run over an instruction file: score it, change it, score it again. Do NOT Think of a Pink Elephant, 31 March 2026, shows a single steering instruction that produces the opposite of its intent, and names the mechanism behind it. The State of AI Instruction Quality, 21 April 2026, analyzed 28,721 repositories and found that most of a typical instruction file is content the model can ignore rather than directives it has to follow.

Between them they cover the three problems loop engineering now names: what the verifier checks, steering that drifts, and instructions that do not couple to behavior. None of the three used the term. The dates are on the links, and they are the whole of the evidence.

What the new name solves

Loop engineering gives you two things worth having. It makes the check an explicit object of engineering, something you design and revise on purpose instead of a default you inherit from whatever your agent framework happened to ship. And it gives the verifier question a name, so teams can argue about it directly: what should this check check, and can it be deterministic.

Both are important, and both have an edge worth stating plainly. The name hands you the object. Building the verifier is still yours to do, and it is most of the work: deciding what correct means for a task, then turning that into a check you can run.

Part of that work is choosing the kind of check. A criterion like is this explanation clear needs a model to judge it. A criterion like no banned import is a deterministic scan. The cost of picking wrong runs both ways: a model-graded check where a deterministic one would do buys unreliability for nothing, and a deterministic check standing in for a judgment it cannot make gives you a green light that means less than it looks like. The name makes that choice visible and leaves it in your hands.

Underneath the frame, the generator still guesses. What loop engineering adds is a place to put the check and a reason to take it seriously, which is more than the earlier names offered the same problem.

The loop, one part at a time

The vocabulary will change again. Harness engineering was an intermediate name for part of this, and something will come after loop engineering too. Each time the word changes, the useful question is the same: what is the new unit of work, and what is the hard part inside it?

For loop engineering the unit is the loop, and the hard part is the verifier: what it checks. The verifier is one component of several, and a name that points at the whole loop is easy to nod along to and harder to act on. So the plan from here is to take the loop apart a component at a time and put a measurement under each one. The questions this piece raised are the ones the next few pieces go after:

A check that fires on the wrong thing comes back quiet, and so does a signal that was never there. They look identical from the outside. How do you tell a broken instrument from an absent one, before you conclude that the problem you were checking for does not exist?
Not every rule in the loop is a verifier, and the rules that steer are not the rules that enforce. A prompt asks; a gate refuses. What belongs in the channel that can only ask, what belongs in the channel that can say no, and what breaks when a rule is filed under the wrong one?
Everything on the context surface is paid for on every turn, whether the model needs it that turn or not. What does an instruction actually cost to keep loaded, and what changes when you load it only where it applies instead of putting the whole surface in front of the model every time?

Each of those is one part of the loop, looked at on its own, with the evidence attached. Same discipline as this piece: a claim is worth exactly what the measurement under it is worth, and no more.

I work on Reporails, deterministic diagnostics for the instruction files, rules, and prompts that steer coding agents. It reads the steering surface and tells you, with measured evidence, where it drifts. It does not run your loop; it checks the part of the loop you wrote down.

Green Tests Don't Mean Better Software

Gábor Mészáros — Tue, 30 Jun 2026 17:15:05 +0000

You spent the week on a refactor because it was supposed to make the next change cheaper. The tests go green, the PR merges, and that's the last anyone thinks about it. Nobody comes back to check whether the next change actually got cheaper. The green bar said the code conforms to its spec, everyone read that as done, and the reason you did the work in the first place went unmeasured.

What green actually checks

Your CI is green. The test asserts assert response.status == 200, and it passes. It does not assert that the caching layer you just shipped cut p99 latency, or that the reworded onboarding step lifted activation, or that the prompt change made the agent follow instructions more often.

The green bar answered a question you never asked.

What did that green actually tell you? That the change conforms to the spec — the inputs map to the asserted outputs, the contract holds, nothing you wrote a test for regressed. That is correctness. It is real and it is necessary. It told you nothing about whether the system got better. You shipped because the bar was green, and the bar was never measuring the thing you shipped the change for.

Those latency, activation, and adherence numbers are effects. CI does not measure effects. It measures conformance, and then we read the green bar as if it answered the other question too.

The silent assumption is that "passes" and "better" are the same axis — that a change which is correct is, by that fact, an improvement. They are not the same axis. A perfectly correct change can make the system worse, and a perfectly correct change can leave every metric exactly where it was; the suite goes green for both. Green is necessary for "better" and nowhere near sufficient for it, and almost no team instruments the difference.

Two disciplines, pitched as rivals

The industry already named both halves of this. It named them as rivals.

Spec-Driven Development says: write the specification first, then build to it, then prove the build conforms. The test suite is the executable form of the spec. Correctness is the whole game. SDD is disciplined, auditable, and it is what most engineering orgs actually run. As a flow it terminates the moment the bar goes green:

Notice where it stops. It stops at correct. Nothing in that loop asks whether the shipped change improved anything.

Hypothesis-Driven Development asks the opposite question. Thoughtworks framed it as a triple: we believe X will result in outcome Y; we will know we are right when we see measurable signal Z. The unit of progress is not a passing build — it is validated learning. You predict an effect, you ship, you measure, and the measurement tells you whether you were right. PMI's expectation-management literature says the same thing from a different room: an expectation is a managed object with an owner and a due date, tracked and surfaced early rather than discovered late. As a flow it begins where SDD ends — at the prediction, and it runs past the ship:

These get pitched as competing philosophies — spec-first versus hypothesis-first, prove-correct versus prove-better, pick a camp. Look at the two flows again. One ends at correct. The other starts at predicted and ends at better. They answer two different questions, and the pick-a-camp framing assumes they answer one. Thoughtworks HDD and PMI expectation-management are not novel claims I am making here; they are prior art, decades of it, and I am citing them as validation. What none of them did was notice the two flows were never aimed at the same question.

The orthogonality move

Two different questions means two different axes. You cannot answer one by measuring the other — and a single pass/fail bar is built to answer only the first. Correct sits on one axis; better sits on the one perpendicular to it.

One axis is spec-conformance: does the change do what it was specified to do? Pass or fail, and pytest already answers it. The other axis is effect-verdict: did the change move the metric it was supposed to move? Confirmed, refuted, or not yet known — and nothing in your pipeline answers it today.

Lay them on a 2×2.


|                         | **effect confirmed**     | **effect refuted / unmeasured** |
|-------------------------|--------------------------|---------------------------------|
| **spec passes (green)** | shipped and works        | **green but no better**         |
| **spec fails (red)**    | blocked (correctly)      | blocked (correctly)             |

Three of those quadrants are familiar. Red blocks the merge, whichever way the effect would have gone. Green-and-confirmed is the win you wanted. The quadrant nobody names is the top-right: green but no better. The change is correct, it conforms to spec, and it shipped — and it did not improve the system, or you never checked, which from the system's point of view is the same thing.

That quadrant is where most shipped changes actually live, and it is invisible because the only instrument pointed at it is the green bar, which is pointed at the wrong axis. This is the reusable mental model: stop asking "did it pass" as if it were one question. It is two questions on two axes, and you are only instrumenting one of them.

A worked example: green architecture tests

Make it concrete with a case every Python team recognizes. You adopt the hexagonal layout — Ports and Adapters. A pure core (contract/, dto/, policy/) imports only the standard library. Adapters wrap the I/O. Subsystems compose the adapters. The dependency rule is one sentence: dependencies point inward, and the pure core points at nothing.

You enforce it with an architecture test that runs on every commit:

def test_core_purity():
    for module in pure_core_modules():
        assert no_imports(module, {"yaml", "requests", "subprocess", "open"})

It is green. Every pure module is import-clean. The dependency graph conforms to the rule — proven correct, mechanically, on every push.

Now ask the other question. You did not adopt the hexagonal layout for its own sake. You adopted it on a claim: changes would get cheaper, blast radius would shrink, the core would be testable without a single mock. Did that happen? test_core_purity cannot tell you. It measures the shape of the import graph, not whether the shape paid off. Green here means conformant. A conformant architecture that made nothing cheaper is the green-but-no-better quadrant with a .py file pointing straight at it.

That is the whole thesis in one test file. The architecture test lives on the spec axis. The promise that sold the refactor lives on the effect axis. They do not touch, and only one of them has an instrument.

The expectation primitive

If "better" is its own axis, it needs its own instrument. That instrument is an expectation — the HDD flow above, made into a first-class artifact.

An expectation is a change-scoped, falsifiable, deadline-carrying prediction. It is the HDD triple — believe X, expect outcome Y, know it by criterion Z — turned into an artifact that travels with the change instead of living in a planning doc nobody reopens. It carries a baseline, a bound measurement view — the query, tied to that change, that reads the metric — a threshold, and a due date. Concretely: the caching change should cut p99 latency below 200ms, measured against last week's baseline, re-checked Friday. That sentence is the whole artifact, and it rides attached to the diff.

Here is the part that is genuinely new, and the part I want to be precise about: the verdict is measured by the system.

HDD and PMI both already told you to make a falsifiable prediction with an owner and a deadline. That is not the advance. The advance is that the prediction binds to a real measurement view, and a system reads that view and stamps the verdict itself. The human who made the prediction does not come back and grade it. The human leaves the verification loop.

This is not a dashboard alert. A dashboard alert fires on a metric crossing a line, untethered to any one change. An expectation binds one prediction to one change and grades that prediction — the dashboard tells you latency rose; the expectation tells you the change you predicted would cut it did the opposite.

The verdict vocabulary is three words, and the discipline is in the third:

confirmed — the measured value met the threshold. Auto-resolved, silently. Announcing it would be filler.
refuted — the measured value missed the threshold. The expectation stays open and surfaces, with the predicted delta and the actual sitting next to each other.
inconclusive — the measurement produced no scalar yet. It stays open and surfaces. It is never read as confirmed. "We didn't measure" is not "it worked."

That third word is the one most dashboards quietly drop. A prediction you cannot yet grade is not a pass. Keeping inconclusive distinct from confirmed is what stops the green-but-no-better quadrant from refilling under a different name.

The selector routes by the change's effect, not its shape. A one-line diff can carry a large measurable effect and owe an expectation; a thousand-line refactor can be pure conformance and owe none — and you pick the lane when you ship, not after.

Run the selector over that same hexagonal project. Re-homing a module to clear a test_subsystem_isolation failure is a shape change. It restores the boundary, its effect resists measurement, it owes nothing past the green test. The refactor that promised cheaper changes is the other lane — it made a measurable claim, so it owes an expectation: adding the next artifact class touches three files or fewer, re-checked when the next one lands. Same codebase, two changes, two lanes.

This is what the spec-versus-hypothesis fight always missed. The selector does not pick a winner between SDD and HDD. It is conditional. It runs SDD's discipline on changes whose effect resists measurement and HDD's discipline on changes whose effect can be measured, and it picks per change. A discipline that reconciles two camps looks like a routing rule that knows when each camp is right.

There is one trap to name: do not attach an expectation to an unmeasurable change just to satisfy the selector. A prediction with no real threshold is conformance wearing a costume, and it teaches everyone to ignore the verdicts. If the effect resists measurement, declare none. The honesty of the refuted and inconclusive verdicts depends on the selector being willing to say "this change owes nothing."

This is the part that has to survive contact with a real backlog. The tempting first cut is to flag every shipped change as owing a prediction. Try it and it floods on contact: the renames, the prose-clarity passes, the reorganizations — the bulk of any change history — all light up red against a rule they cannot satisfy, because there was never a metric for them to move. The flood is the proof. A selector that cannot say "this change owes nothing" does not enforce the discipline; it discredits it. The conditional design is not a softening of the rule — it is the only version of the rule that does not collapse the first time you run it over changes that already shipped.

Proof of existence, and the cost of skipping it

A coordination system shipped exactly this layer, in working code. A change to a governed surface declares an expectation: predicted delta, bound measurement view, due date. The system re-measures when the due date passes, stamps confirmed / refuted / inconclusive against the threshold, auto-resolves the confirmed ones silently, and surfaces the refuted and inconclusive ones the next session, one line each. The selector decides which changes owe a prediction and which fall back to conformance. The point here is only that the pairing — a governance selector plus an auto-measured verdict — is buildable today, because it has been built. The mechanism is the proof; the product is not the subject.

`confirmed` is the minority verdict

What running it actually teaches is the part the diagram cannot. Before the verdicts come back, you assume most predictions will land — you shipped the change believing it would help, so of course the measurement will agree.

It does not. Once an instrument is actually pointed at the effect axis, confirmed turns out to be the minority verdict. Most changes come back inconclusive or refuted: the metric did not move past the threshold, or there was no scalar to read in the window at all.

The first time a batch of verdicts surfaces and almost none of them are confirmed, the 2×2 stops being a diagram. The green-but-no-better quadrant is not a rare corner you occasionally fall into — it is where most changes sit, and you only learn that the moment you can finally see it.

`inconclusive` is the common case

inconclusive is the verdict you underestimate most. "We didn't measure" happens far more often than any prediction would lead you to expect — the window was too short, the metric needed traffic that had not arrived, the effect was real but not yet legible.

The entire value of the third word is that it refuses to round itself up. Without it, every one of those un-graded predictions quietly refiles as a pass, and the quadrant refills under a name that looks like success. The discipline lives in holding inconclusive open, in plain sight, until there is a scalar — and discovering, run after run, how often that takes longer than you thought.

Even `confirmed` is not permanent

Even the confirmed ones are not closed for good. A threshold set too loose can confirm a change that later regresses, and a silent auto-resolve would bury exactly that. A confirmed verdict earns its silence; it does not earn permanent trust.

One discipline already exists for this. SkillOpt (Yang et al., 2026) accepts a self-authored skill edit only when it improves a held-out validation split — not the data the edit was tuned against. The held-out split is the guard against exactly that failure: a verdict that reads confirmed because the threshold was set on the same signal the change was shaped to move. An auto-measured verdict is only as honest as the separation between what tunes the change and what grades it.

The claim worth taking away is the framework, not any one implementation: a governance selector that routes each change to the right discipline, paired with an auto-measured verdict that proves the effect without a human grading it, is a reconciliation of spec-driven and hypothesis-driven development. It is not a third methodology stacked beside the two you have; it tells you which of the two you already know to apply, and then it does the grading.

That reconciliation stops being optional the moment the machine ships the change.

When a human writes every diff, the green-but-no-better quadrant is a slow tax — improvements that were not improvements, paid for in drift you notice eventually. When an AI-assisted or autonomous system writes and ships changes, there is no human in the loop reading the effect at all. The tests still run, so correctness still gets checked; nothing measures the effect, so nothing catches it slipping. Unmeasured-effect is precisely where a self-running system's regressions hide — each change correct, each change green, the system quietly getting worse along the one axis nobody instrumented.

The practice fits on one line. Before you merge the next change that makes a measurable claim, write down what you expect it to move, where that gets measured, and when you will look — then let the verdict come back to you instead of going to find it.

Green tells you the machine built the thing right. Only a measured verdict tells you the thing was worth building. If you are handing more of the building to machines, that second instrument is the one you cannot afford to leave off.

Reporails — deterministic instruction diagnostics and governance for the rules, prompts, and agent files that steer AI coding agents.

The State of AI Instruction Quality

Gábor Mészáros — Tue, 21 Apr 2026 12:41:52 +0000

Everybody has opinions about AGENTS.md/CLAUDE.md files.

Best practices get shared. Templates get copied, and this folk-type knowledge dominates the industry. Last year, GitHub analyzed 2,500 repos and published best-practice advice. We wanted to go further: measure at scale, publish the data, and let anyone verify.

When the agent doesn't follow instructions and does something contradictory, the usual suspects are: the model is inconsistent, LLMs are not deterministic, you need better guardrails, you need retries.

The failures almost always get attributed to the model.

So we decided to measure. We built a diagnostic tool that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge. Then we pointed it at GitHub repositories with instruction files for five agents - Claude, Codex, Copilot, Cursor, and Gemini.

28,721 repositories. 165,063 files. 3.3 million instructions.

... and one question:

What if the instructions are the problem?

The dataset

28,721 projects. Sourced from GitHub via API search, cloned, and deterministically analyzed. Each project was scanned for instruction files across five coding agents — then deduplicated to remove false positives from agent detection overlap.

Agent	Projects	% of corpus
Claude	12,356	43.0%
Codex	11,206	39.0%
Copilot	7,755	27.0%
Cursor	7,291	25.4%
Gemini	5,942	20.7%

The percentages add up to more than 100% because 37% of projects configure multiple agents. More on that later.

Key distributions stabilized early. A 9,582-repo sub-sample produced identical tier shares (±0.2pp) and the same mean scores as the full 12,076-repo intermediate sample. The final 28,721-repo corpus moved nothing. The patterns reported below are not small-sample artifacts.

All classifications are deterministic — the same file produces the same result every time. No LLM-as-judge. Sample classifications are published for inspection (methodology below). The tool is source-available.

How we measured

The analyzer parses each instruction file into atoms — the smallest semantically distinct units of content. A heading is one atom. A bullet point is one atom. A paragraph is one atom. Each atom gets classified along a few dimensions, all deterministic, no LLM involved:

Charge classification. A three-phase pipeline determines whether an atom is a directive ("use X"), a constraint ("do not use Y"), neutral content (context, explanation, structure), or ambiguous (could be read either way). Phase 1 detects negation and prohibition patterns. Phase 2 detects modal auxiliaries and direct commands. Phase 3 uses syntactic dependency parsing to catch imperatives that the first two phases missed. First definitive match wins. Atoms that partially match but don't clear any phase are marked ambiguous. Everything else is neutral.

Specificity. Binary: does the instruction name a specific construct — a tool, file, command, flag, function, or config key — or does it stay at the category level? "Use consistent formatting" is abstract. "Format with ruff format" is named. This is a text property, not a judgment call.

File categorization. Each file is classified as base config (your main CLAUDE.md or .cursorrules), a rule file, a skill definition, or a sub-agent definition — based on file path conventions for each agent.

Content type. Charge classification separates behavioral content (directives and constraints) from structural content (headings, context paragraphs, examples). That's how we know what fraction of your file is actually doing work.

The full tool is source-available (BUSL-1.1). You can run npx @reporails/cli check on your own project and inspect every finding. More on that at the end.

Finding 1: Most of your instruction file isn't instructions

Here's what the median instruction file actually contains:

50 content items total
12 of those are actual directives
The rest is headings, context paragraphs, examples, structure

Only 27% of your instruction file is doing what you think it does.

The other 73% is scaffolding. Headings that organize but don't instruct. Explanation paragraphs that compete for the model's attention without adding behavioral weight. Example blocks. Context-setting prose.

That's not inherently bad. Structure matters. But if you're writing a 200-line CLAUDE.md and only 54 lines are actual instructions, you should probably know that.

The average instruction is 8.9 words long. That's a sentence fragment.

Finding 2: 90% of instructions don't name what they're talking about

This is the big one.

We measured whether each instruction references specific tools, files, commands, or constructs by name — or whether it stays at the category level.

Two-thirds of all instructions are abstract.

Agent	Names specific constructs	Uses category language
Gemini	39.3%	60.7%
Codex	38.3%	61.7%
Copilot	33.3%	66.7%
Cursor	30.8%	69.2%
Claude	30.6%	69.4%

What does this look like in practice?

Abstract: "Use consistent code formatting"
Specific: "Format with ruff format before committing"

Abstract: "Avoid using mocks in tests"
Specific: "Do not use unittest.mock — use the real database via test_db fixture"

In previous controlled experiments, specificity produced a 10.9x odds ratio in compliance (N=1000, p<10⁻³⁰). The instruction that names the exact construct gets followed. The one that describes it abstractly... mostly doesn't. This is consistent with independent findings from RuleArena (Zhou et al., ACL 2025), where LLMs struggled systematically with complex rule-following tasks — even strong models fail when the rules themselves are ambiguous or underspecified.

89.9% of all agent configurations contain at least one instruction that doesn't name what it means. It's not a few projects. It's nearly everyone.

Finding 3: `agents.md` is the most common instruction file

Before we get into quality, let's look at what people are actually naming their files:

#	File	Count
1	`agents.md` / `AGENTS.md`	20,654
2	`claude.md` / `CLAUDE.md`	14,014
3	`gemini.md` / `GEMINI.md`	5,703
4	`.github/copilot-instructions.md`	5,647
5	`.cursorrules`	2,415

49,071 unique file paths across the corpus. That's not a typo. The format fragmentation is real.

A few things jumped out:

claude.md (lowercase, 10,642) is 3x more common than CLAUDE.md (3,372). Both work. The community clearly prefers lowercase.
agents.md dominates — the Codex/generic format is the single most popular instruction file name.
Skills and rules are already showing up in meaningful numbers: .claude/rules/testing.md (422), .agents/skills/tailwindcss-development/skill.md (334).

Finding 4: Different agents, completely different config philosophies

Not all agents are configured the same way. Not even close.

We categorized every file into four types: base config (your main CLAUDE.md, .cursorrules, etc.), rules (scoped rule files), skills (task-specific skill definitions), and sub-agents (role-based agent definitions).

Agent	Base	Rules	Skills	Sub-agents	Total files
Claude	18,733	4,638	10,692	10,538	44,601
Cursor	5,903	19,843	6,237	1,716	33,699
Copilot	16,026	4,486	10,352	3,012	33,876
Codex	19,001	81	8,911	165	28,158
Gemini	10,253	74	3,039	53	13,419

Cursor is 60% rules files. The .cursor/rules/ system dominates its configuration surface. One agent's config looks nothing like another's.

Claude is the only agent with a roughly balanced architecture across all four config types. Codex and Gemini are almost entirely base config — single-file setups.

The median Cursor project has 3 instruction files. The median Codex project has 1. These aren't just different tools. They're different configuration philosophies.

Finding 5: 37% of projects configure multiple agents

10,620 projects in the corpus target two or more agents. That's not a niche pattern — it's over a third of all projects.

Agents	Projects
1	18,101
2	6,776
3	2,687
4	949
5	208

The dominant pair is Claude + Codex (5,038 projects). Makes sense — CLAUDE.md + AGENTS.md is the most natural multi-agent starting point.

Here's what's interesting about multi-agent repos: the same developer, writing instructions at the same time, for the same project, produces measurably different instruction quality across agents. The person didn't change. The project didn't change. The instruction format did.

Some of that is structural. Cursor's .mdc rules enforce a different format than Claude's markdown. Codex's AGENTS.md invites a different writing style than Copilot's copilot-instructions.md. The format shapes the content.

Finding 6: The most-copied skills are the vaguest

This is where it gets interesting.

13,309 unique skills across the corpus. Some of them appear in hundreds of repos — clearly copied from shared templates or community sources. So we measured them.

Named% = what fraction of a skill's instructions name a specific tool, file, or command (instead of using category language).

Skill	Repos	Named%	What it means
`frontend-design`	271	2.8%	Almost entirely abstract advice
`web-design-guidelines`	197	10.2%	Generic design principles
`vercel-react-best-practices`	315	30.7%	Mix of specific and vague
`pest-testing`	216	55.1%	Names actual test constructs
`livewire-development`	87	75.5%	Names specific Livewire components
`next-best-practices`	76	92.6%	Names almost everything

frontend-design is in 271 repos with 2.8% specificity. It's a wall of "follow responsive design principles" and "ensure accessibility compliance." That reads well. It sounds professional. It gives the model almost nothing concrete to act on.

next-best-practices is in 76 repos with 92.6% specificity. It says things like "use next/image for all images" and "prefer server components over client." It reads like a checklist. It tells the model exactly what to do.

One is shared 3.5x more than the other.

The most popular skills are the most decorative. The well-written ones barely spread.

The best and worst skills (>50 repos)

Most specific:

Skill	Repos	Named%
`next-best-practices`	76	92.6%
`shadcn`	74	82.6%
`livewire-development`	87	75.5%
`pest-testing`	216	55.1%
`laravel-best-practices`	94	49.7%

Most vague:

Skill	Repos	Named%
`openspec-explore`	110	2.5%
`frontend-design`	271	2.8%
`web-design-guidelines`	197	10.2%
`vercel-composition-patterns`	131	10.7%
`find-skills`	113	18.9%

Notice a pattern? The Laravel/Livewire ecosystem produces specific skills. The generic frontend/design ones stay abstract. Domain-specific communities write better instructions than cross-cutting ones.

Finding 7: Sub-agents are almost entirely persona prompts

5,526 unique sub-agent roles in the corpus. Developers are building agent teams: code reviewers, architects, debuggers, testers, security auditors.

The problem? Sub-agents are the most abstract config type in the entire corpus. Only 17% of sub-agent instructions name specific constructs.

Role	Repos	Named%
`code-reviewer.md`	236	14.4%
`architect.md`	89	18.2%
`debugger.md`	66	9.4%
`security-auditor.md`	57	14.8%
`test-runner.md`	54	10.5%
`frontend-developer.md`	47	9.0%

Most of these are persona prompts. "You are a senior code reviewer. You care about code quality, security, and maintainability." That's a role description, not an instruction set. It tells the model who to be, not what to do.

Compare this to a base config that says "run uv run pytest tests/ -v before suggesting any commit" — that's 100% named, and the model knows exactly what action to take.

The anatomy chart: more directives, worse quality

Here's where it all comes together.

We measured three things for each config type: how big the files are, how many directives they contain, and what fraction of those directives actually name something specific.

Sub-agents have the largest files (61 items median), the most directives (17), and the worst specificity (17%). They're the wordiest config type in the corpus and the least effective.

Base configs are the opposite. Fewer directives (11), but 40% of them name specific constructs. The developer writing their own CLAUDE.md by hand, for their own project, produces the most actionable instructions.

Config type	Files	Median size	Median directives	Specificity
Base configs	69,916	50 items	11	39.8%
Rules files	29,122	34 items	9	31.2%
Skills	39,231	59 items	14	30.8%
Sub-agents	15,484	61 items	17	17.0%

The pattern is clear: what developers write by hand is the most specific. What gets templated and shared gets progressively vaguer. And what tries hardest to sound authoritative — sub-agent persona prompts — is the most hollow.

More instructions is not better instructions.

Independent research supports the structural angle: FlowBench (Xiao et al., 2024) found that presenting workflow knowledge in structured formats (flowcharts, numbered steps) improved LLM agent planning by 5-6 percentage points over prose — across GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo. Structure is not decoration. It changes what the model retrieves.

Limitations

Five things to know about these numbers.

Sampling bias. GitHub API search, public repos only, English-skewed. Enterprise configurations, private repos, and non-English projects are not represented. This is not a random sample of all instruction files in production.

Classification accuracy. The charge classifier is deterministic but not perfect. Edge cases exist: mixed-charge sentences, implicit constructs, domain jargon that looks like a category term but is actually a named tool. Specificity detection (named vs abstract) is simpler and more robust. Sample classifications are published for inspection.

Association, not causation. "More directives correlate with lower specificity" is an observed pattern. We do not claim that adding directives causes quality to drop.

Snapshot. Collected March–April 2026. Instruction practices are changing fast — agents.md didn't exist six months ago. These numbers describe the ecosystem at collection time.

No popularity weighting. A 10-star hobby project counts the same as a 50K-star production repo. The distribution of instruction quality in production agent work may differ.

What this means

This isn't an article about AI models being bad at following instructions. The models are fine.

This is an article about what we actually give them to work with.

Most instruction files are three-quarters scaffolding. Two-thirds of the actual instructions don't name what they're talking about. The most popular community skills are the most decorative. Sub-agent definitions are the wordiest files in the corpus and the least specific.

None of that is obvious from reading your own files. It wasn't obvious to us before we measured it. A well-structured CLAUDE.md feels thorough. A shared skill with 271 repos feels battle-tested. A sub-agent with 17 directives feels comprehensive.

Measurement shows something different.

In The Undiagnosed Input Problem, I argued that the industry is great at inspecting outputs and weak at inspecting inputs. This corpus analysis is the evidence for that claim.

The instruction files are there. The developers wrote them. They just have no way to know which parts are working and which parts are wallpaper.

Try it yourself

The analyzer we used for this corpus analysis is available as a CLI you can run against your own instruction files.

Reporails — instruction diagnostics for coding agents. Deterministic. No LLM-as-judge. 97 rules across structure, content, efficiency, maintenance, and governance.

npx @reporails/cli check

That scans your project, detects which agents are configured, and reports findings with specific line numbers and rule IDs. Here's what the output looks like:

Reporails — Diagnostics

  ┌─ Main (1)
  │ CLAUDE.md
  │   ⚠       Missing directory layout             CORE:C:0035
  │   ⚠ L9    7 of 7 instruction(s) lack reinfor…  CORE:C:0053
  │     ... and 16 more
  │
  └─ 21 findings

  Score: 7.9 / 10  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░

  21 findings · 4 warnings · 1 info
  Compliance: HIGH

The corpus analysis used the same classification pipeline at scale. Fix the findings, run again, watch your score improve.

The dataset

The full corpus is published at reporails/30k-corpus. Three files:

File	Records	What it contains
`repos.jsonl`	28,721	Per-project record: agents configured, stars, language, license, topics
`stats_public.json`	1	Every aggregate statistic in this article
`validation_key.csv`	2,814	Sample classifications with source text for inspection

Verify any claim:

# "28,721 repositories"
cat repos.jsonl | wc -l

# "43% Claude"
cat repos.jsonl | python3 -c "
import sys, json
repos = [json.loads(l) for l in sys.stdin]
claude = sum(1 for r in repos if 'claude' in r['canonical_agents'])
print(f'{claude}/{len(repos)} = {claude/len(repos)*100:.1f}%')
"

Every number in every table traces to that dataset. If you disagree with a finding, count the rows.

This is part of the Instruction Quality series. Previous: The Undiagnosed Input Problem. Related: Precision Beats Clarity · Do Not Think of a Pink Elephant · 7 Formatting Rules for the Machine.

The Undiagnosed Input Problem

Gábor Mészáros — Wed, 08 Apr 2026 11:51:12 +0000

The AI agent ecosystem has built a serious industry around controlling outputs. Guardrails. Safety classifiers. Output validation. Monitoring. Retry systems. Human review.

All of that matters, but there is simpler upstream question that still goes mostly unmeasured:

Are the instructions any good?

That sounds obvious, yet it is not how the industry behaves.

When an agent fails to follow instructions, the usual explanations come fast:

Models are probabilistic
Agents are inconsistent
You need stronger guardrails
You need better monitoring
You need retries
You need humans in the loop

… and while those explanations are right to a certain degree, they also have a side effect: they turn instruction quality into a blind spot.

The ecosystem has become extremely good at inspecting what comes out of the model, and surprisingly weak at inspecting what goes in.

The symptom

Consider τ-bench.

It gives agents policy instructions and measures whether they follow them in realistic customer-service tasks. Airline and retail workflows. Real constraints. Real multi-step behavior.

The benchmark result that gets repeated is the model result: even strong systems still fail a large share of tasks, and consistency across repeated attempts remains weak.

The conclusion most people draw is straightforward: we need better models, better agents, better orchestration.

My take: Maybe.

But there is another question sitting underneath the benchmark:

Were the instructions themselves well-formed and well structured?

Not just present. Not just long enough. Not just sincere.

Well-formed. Well-structured. Well-organized.

Specific enough to anchor behavior. Structured enough to survive context mixing. Non-conflicting across files. Positioned where the model can actually use them.

Those questions usually never gets asked.

The industry response

I had a conversation recently where a lead solutions architect put the standard view plainly:

“The instruction merely influences the probability distribution over outputs. It doesn’t override it.”

That is right about the mechanism but it is wrong about what follows from it.

Yes, instructions operate probabilistically. But that does not mean all instructions are weak in the same way.

The shape of the distribution is not fixed. It changes with the properties of the instruction itself. Specificity sharpens it. Structure sharpens it. Conflict flattens it. Vague abstractions flatten it. Bad formatting can suppress it almost entirely.

Across my earlier controlled experiments, small changes in wording and placement produced large changes in compliance:

Instruction ordering moved compliance by 25 percentage points with the same model and the same directive.
Specificity produced roughly a 10x compliance effect when the instruction named the exact construct instead of describing it abstractly.
Formatting changed whether the model reliably registered the instruction at all.

The problem is that most instruction systems are built without diagnostics.

That is not an AI limitation. That is an engineering failure.

The folk system

Right now, instruction practice spreads mostly through imitation.

A popular repository posts “best practices” for Claude Code. Shared Cursor rules circulate as templates. People copy AGENTS.md files between projects. Teams accumulate CLAUDE.md, .cursorrules, copilot-instructions.md, etc and project-specific rule files across multiple tools.

Copy, paste, hope, repeat.

Some of that advice is useful. Almost none of it is tested in any controlled, reproducible way. That would be fine if instruction quality were self-evident. It is not.

A long instruction file can feel thorough while being internally contradictory. A highly opinionated ruleset can feel disciplined while producing almost no behavioral influence on the model.

A sprawling multi-file setup can look sophisticated while making the system worse.

Without diagnostics, developers do not know which instructions are binding, which are noise, and which are actively interfering with each other.

The gap

The tooling split is now pretty clear.

Output tooling is mature. Guardrails AI validates structure. Lakera focuses on prompt injection and security. NeMo Guardrails enforces safety and conversational rails. Llama Guard classifies risky content. The output edge is crowded.

Prompt testing is real. Promptfoo, Braintrust, and LangSmith can all help evaluate behavior. But they are primarily black-box systems: did the prompt produce the output you wanted?

That is useful.

It is not the same as measuring the instruction artifact itself.

Instruction-quality tooling exists only in fragments. Some tools use LLM-as-judge. Some use deterministic local rules. But the category is still early, inconsistent, and mostly disconnected from measured behavioral outcomes.

What is still largely missing is a deterministic way to inspect instruction files as engineered objects:

how specific they are
how directly they state intent
whether they conflict across files
whether they overuse headings
whether they provide alternatives instead of bare prohibitions
whether the system is getting denser while getting weaker

Code gets static analysis.

Instruction systems usually get vibes.

What we measured

We built an analyzer that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge.

I am running it across a large live corpus of real repositories. The full run completes this week; what follows is what the partial sample already shows - stable enough to publish, not yet the full picture.

Quality is reported on a 0-to-100 scale: 0 means the file produces no measurable influence on model behavior, 100 is the ceiling the framework can score.

A fresh aggregation over 12,076 completed instruction-file scans is virtually identical to an earlier 9,582-repo sample:

bottom tier: 40.3% vs 40.1%
top tier: 12.1% vs 12.2%
mean quality score: 27 vs 27
directive content ratio: 27.9% vs 27.9% - the share of instruction sentences that directly tell the model what to do

That matters because it means the pattern is stable.

This does not look like a small-sample artifact.

And the strongest finding is not what I expected.

More rules, lower quality

The common response to bad agent behavior is to add more rules.

More files. More guidance. More scoping. More edge-case coverage.

The corpus says that strategy tends to backfire.

Across 12,076 repositories, instruction quality falls as instruction-file count rises:

Files per repo     N      Mean score   Bottom tier %   Top tier %
1                  4681   28           46.3%           16.9%
2-5                4796   26           37.3%            9.5%
6-20               1972   26           36.0%            8.8%
21-50               438   25           31.3%            5.7%
51-500              186   25           33.3%            5.4%

The key number is the top-tier share.

It collapses from 16.9% in single-file setups to 5.4% in repositories with 51 to 500 instruction files.

That is a roughly 3x drop.

The article version of that finding is simple:

Developers respond to bad agent behavior by adding more rules. In the corpus, that strategy correlates with a 3x collapse in the probability of landing in the top tier.

That does not prove file count causes low quality by itself.

But it does show that rule proliferation is not rescuing these systems. At scale, it is associated with weaker instruction quality, not stronger.

The sweet spot

There is also a more subtle result in the partial sample. Instruction quality appears to be non-monotonic in directive density: more directives help at first, then stop helping, and past a point start to hurt.

The full curve is in next week’s piece. The short version is that there is an optimal density range, after which additional directives stop strengthening the system.

Enough force to bind behavior. Not so much that the system turns into an overpacked rules document.

A real example

Here is the kind of instruction block the corpus is full of:

# Code should be clear, well documented, clear PHPDocs.

# Code must meet SOLID DRY KISS principles.

# Should be compatible with PSR standards when it need.

# Take care about performance

It is not malicious. It is not absurd.

It is just weak.

Everything is abstract. Nothing is anchored. Headings are doing the work prose should do. The agent can read it, represent it, and still walk past most of it.

Now compare:

Never use `var_dump()` or `dd()` in committed code. Use `Log::debug()` instead.
Run `./vendor/bin/phpstan analyse src/` before every commit. Level 6 minimum.

Same general intent. Completely different binding strength.

The second version names the construct, names the alternative, names the command, and names the threshold. It gives the model something concrete to hold onto.

That is what diagnostics should make visible.

What this means

Output guardrails still matter.

Prompt evaluation still matters.

Safety systems still matter.

But they do not answer the upstream question: Are the instructions themselves well-formed?

If the answer is no, then a large class of downstream failures will keep showing up as mysterious agent unreliability when the real problem is earlier and simpler.

The agent loaded the instruction and walked past it.

That is often not a model problem.

It is an input problem.

And input quality is measurable.

What’s next

These are corpus-level findings from a partial sample, not universal laws.

The sample is still in flight. The strongest claims here are about association, not proof of causality. Specific conflict-count case studies need source verification before publication. Popularity weighting is not yet applied, so “40% of repositories score in the bottom tier” is not the same claim as “40% of production agent work scores in the bottom tier.”

The full corpus run completes this week. Next week I publish the end-of-run analysis across the full sample — the complete distribution, the cross-cuts the partial sample cannot yet support, and the specific case studies this article deliberately held back. If you want to know where your stack lands, that is the piece to come back for.

For now, the central pattern is already stable enough to matter:

The ecosystem keeps responding to weak agent behavior by adding more instructions, while the corpus shows that more instruction files are usually associated with lower measured quality.

That is the undiagnosed input problem.

Not that instructions do not matter.
That they matter, measurably, and most teams still have no way to see whether theirs are helping or hurting.

This is part of the Instruction Best Practices series. Previous: Do NOT Think of a Pink Elephant, Precision Beats Clarity, 7 Formatting Rules for the Machine. I’m building instruction diagnostics for coding agents. Follow for the full corpus analysis.

DEV Community: Reporails

Loop Engineering: Fine-Tuning the Guardrail That Fired Wrong

What a check is here

Two readings of the same quiet

The misfire, up close

The tempting fix, and what it costs

Fix the instrument instead

The rule that generalizes

Deterministic Guardrails: Prompts Steer, Hooks Enforce

The loop, and where it can only ask

Two channels: one asks, one refuses

Build the missing arm: a hook that refuses

Which rule goes in which channel

The generator's "done" is a claim to re-derive

What holds, and what it still costs to keep loaded

Prompt Engineering, Context Engineering, Loop Engineering: What Actually Changed

The unit was a prompt

The unit became the surface

The unit is the loop

A verifier that checks what, exactly?

What the earlier articles measured

What the new name solves

The loop, one part at a time

Green Tests Don't Mean Better Software

What green actually checks

Two disciplines, pitched as rivals

The orthogonality move

A worked example: green architecture tests

The expectation primitive

Proof of existence, and the cost of skipping it

confirmed is the minority verdict

inconclusive is the common case

Even confirmed is not permanent

The State of AI Instruction Quality

The dataset

How we measured

Finding 1: Most of your instruction file isn't instructions

Finding 2: 90% of instructions don't name what they're talking about

Finding 3: agents.md is the most common instruction file

Finding 4: Different agents, completely different config philosophies

Finding 5: 37% of projects configure multiple agents

Finding 6: The most-copied skills are the vaguest

The best and worst skills (>50 repos)

Finding 7: Sub-agents are almost entirely persona prompts

The anatomy chart: more directives, worse quality

Limitations

What this means

Try it yourself

The dataset

The Undiagnosed Input Problem

The symptom

The industry response

The folk system

The gap

What we measured

More rules, lower quality

The sweet spot

A real example

What this means

What’s next

`confirmed` is the minority verdict

`inconclusive` is the common case

Even `confirmed` is not permanent

Finding 3: `agents.md` is the most common instruction file