DEV Community

Cover image for The Undiagnosed Input Problem
Gábor Mészáros Subscriber for Reporails

Posted on • Originally published at Medium

The Undiagnosed Input Problem

The AI agent ecosystem has built a serious industry around controlling outputs. Guardrails. Safety classifiers. Output validation. Monitoring. Retry systems. Human review.

All of that matters, but there is simpler upstream question that still goes mostly unmeasured:

Are the instructions any good?

That sounds obvious, yet it is not how the industry behaves.

When an agent fails to follow instructions, the usual explanations come fast:

  • Models are probabilistic
  • Agents are inconsistent
  • You need stronger guardrails
  • You need better monitoring
  • You need retries
  • You need humans in the loop

… and while those explanations are right to a certain degree, they also have a side effect: they turn instruction quality into a blind spot.

The ecosystem has become extremely good at inspecting what comes out of the model, and surprisingly weak at inspecting what goes in.

The symptom

Consider τ-bench.

It gives agents policy instructions and measures whether they follow them in realistic customer-service tasks. Airline and retail workflows. Real constraints. Real multi-step behavior.

The benchmark result that gets repeated is the model result: even strong systems still fail a large share of tasks, and consistency across repeated attempts remains weak.

The conclusion most people draw is straightforward: we need better models, better agents, better orchestration.

My take: Maybe.

But there is another question sitting underneath the benchmark:

Were the instructions themselves well-formed and well structured?

Not just present. Not just long enough. Not just sincere.

Well-formed. Well-structured. Well-organized.

Specific enough to anchor behavior. Structured enough to survive context mixing. Non-conflicting across files. Positioned where the model can actually use them.

Those questions usually never gets asked.

The industry response

I had a conversation recently where a lead solutions architect put the standard view plainly:

The instruction merely influences the probability distribution over outputs. It doesn’t override it.

That is right about the mechanism but it is wrong about what follows from it.

Yes, instructions operate probabilistically. But that does not mean all instructions are weak in the same way.

The shape of the distribution is not fixed. It changes with the properties of the instruction itself. Specificity sharpens it. Structure sharpens it. Conflict flattens it. Vague abstractions flatten it. Bad formatting can suppress it almost entirely.

Across my earlier controlled experiments, small changes in wording and placement produced large changes in compliance:

  • Instruction ordering moved compliance by 25 percentage points with the same model and the same directive.
  • Specificity produced roughly a 10x compliance effect when the instruction named the exact construct instead of describing it abstractly.
  • Formatting changed whether the model reliably registered the instruction at all.

The problem is that most instruction systems are built without diagnostics.

That is not an AI limitation. That is an engineering failure.

The folk system

Right now, instruction practice spreads mostly through imitation.

A popular repository posts “best practices” for Claude Code. Shared Cursor rules circulate as templates. People copy AGENTS.md files between projects. Teams accumulate CLAUDE.md, .cursorrules, copilot-instructions.md, etc and project-specific rule files across multiple tools.

Copy, paste, hope, repeat.

Some of that advice is useful. Almost none of it is tested in any controlled, reproducible way. That would be fine if instruction quality were self-evident. It is not.

A long instruction file can feel thorough while being internally contradictory. A highly opinionated ruleset can feel disciplined while producing almost no behavioral influence on the model.

A sprawling multi-file setup can look sophisticated while making the system worse.

Without diagnostics, developers do not know which instructions are binding, which are noise, and which are actively interfering with each other.

The gap

The tooling split is now pretty clear.

Output tooling is mature. Guardrails AI validates structure. Lakera focuses on prompt injection and security. NeMo Guardrails enforces safety and conversational rails. Llama Guard classifies risky content. The output edge is crowded.

Prompt testing is real. Promptfoo, Braintrust, and LangSmith can all help evaluate behavior. But they are primarily black-box systems: did the prompt produce the output you wanted?

That is useful.

It is not the same as measuring the instruction artifact itself.

Instruction-quality tooling exists only in fragments. Some tools use LLM-as-judge. Some use deterministic local rules. But the category is still early, inconsistent, and mostly disconnected from measured behavioral outcomes.

What is still largely missing is a deterministic way to inspect instruction files as engineered objects:

  • how specific they are
  • how directly they state intent
  • whether they conflict across files
  • whether they overuse headings
  • whether they provide alternatives instead of bare prohibitions
  • whether the system is getting denser while getting weaker

Code gets static analysis.

Instruction systems usually get vibes.

What we measured

We built an analyzer that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge.

I am running it across a large live corpus of real repositories. The full run completes this week; what follows is what the partial sample already shows - stable enough to publish, not yet the full picture.

Quality is reported on a 0-to-100 scale: 0 means the file produces no measurable influence on model behavior, 100 is the ceiling the framework can score.

A fresh aggregation over 12,076 completed instruction-file scans is virtually identical to an earlier 9,582-repo sample:

bottom tier: 40.3% vs 40.1%
top tier: 12.1% vs 12.2%
mean quality score: 27 vs 27
directive content ratio: 27.9% vs 27.9% - the share of instruction sentences that directly tell the model what to do

That matters because it means the pattern is stable.

This does not look like a small-sample artifact.

And the strongest finding is not what I expected.

More rules, lower quality

The common response to bad agent behavior is to add more rules.

More files. More guidance. More scoping. More edge-case coverage.

The corpus says that strategy tends to backfire.

Across 12,076 repositories, instruction quality falls as instruction-file count rises:

Files per repo     N      Mean score   Bottom tier %   Top tier %
1                  4681   28           46.3%           16.9%
2-5                4796   26           37.3%            9.5%
6-20               1972   26           36.0%            8.8%
21-50               438   25           31.3%            5.7%
51-500              186   25           33.3%            5.4%
Enter fullscreen mode Exit fullscreen mode

The key number is the top-tier share.

It collapses from 16.9% in single-file setups to 5.4% in repositories with 51 to 500 instruction files.

That is a roughly 3x drop.

The article version of that finding is simple:

Developers respond to bad agent behavior by adding more rules. In the corpus, that strategy correlates with a 3x collapse in the probability of landing in the top tier.

That does not prove file count causes low quality by itself.

But it does show that rule proliferation is not rescuing these systems. At scale, it is associated with weaker instruction quality, not stronger.

The sweet spot

There is also a more subtle result in the partial sample. Instruction quality appears to be non-monotonic in directive density: more directives help at first, then stop helping, and past a point start to hurt.

The full curve is in next week’s piece. The short version is that there is an optimal density range, after which additional directives stop strengthening the system.

Enough force to bind behavior. Not so much that the system turns into an overpacked rules document.

A real example

Here is the kind of instruction block the corpus is full of:

# Code should be clear, well documented, clear PHPDocs.

# Code must meet SOLID DRY KISS principles.

# Should be compatible with PSR standards when it need.

# Take care about performance
Enter fullscreen mode Exit fullscreen mode

It is not malicious. It is not absurd.

It is just weak.

Everything is abstract. Nothing is anchored. Headings are doing the work prose should do. The agent can read it, represent it, and still walk past most of it.

Now compare:

Never use `var_dump()` or `dd()` in committed code. Use `Log::debug()` instead.
Run `./vendor/bin/phpstan analyse src/` before every commit. Level 6 minimum.
Enter fullscreen mode Exit fullscreen mode

Same general intent. Completely different binding strength.

The second version names the construct, names the alternative, names the command, and names the threshold. It gives the model something concrete to hold onto.

That is what diagnostics should make visible.

What this means

Output guardrails still matter.

Prompt evaluation still matters.

Safety systems still matter.

But they do not answer the upstream question: Are the instructions themselves well-formed?

If the answer is no, then a large class of downstream failures will keep showing up as mysterious agent unreliability when the real problem is earlier and simpler.

The agent loaded the instruction and walked past it.

That is often not a model problem.

It is an input problem.

And input quality is measurable.

What’s next

These are corpus-level findings from a partial sample, not universal laws.

The sample is still in flight. The strongest claims here are about association, not proof of causality. Specific conflict-count case studies need source verification before publication. Popularity weighting is not yet applied, so “40% of repositories score in the bottom tier” is not the same claim as “40% of production agent work scores in the bottom tier.”

The full corpus run completes this week. Next week I publish the end-of-run analysis across the full sample — the complete distribution, the cross-cuts the partial sample cannot yet support, and the specific case studies this article deliberately held back. If you want to know where your stack lands, that is the piece to come back for.

For now, the central pattern is already stable enough to matter:

The ecosystem keeps responding to weak agent behavior by adding more instructions, while the corpus shows that more instruction files are usually associated with lower measured quality.

That is the undiagnosed input problem.

Not that instructions do not matter.
That they matter, measurably, and most teams still have no way to see whether theirs are helping or hurting.


This is part of the Instruction Best Practices series. Previous: Do NOT Think of a Pink Elephant, Precision Beats Clarity, 7 Formatting Rules for the Machine. I’m building instruction diagnostics for coding agents. Follow for the full corpus analysis.

Top comments (0)