<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Reporails</title>
    <description>The latest articles on DEV Community by Reporails (@reporails).</description>
    <link>https://dev.to/reporails</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12470%2Feb6c7166-21ab-442d-a75f-7ec3c525f1cd.png</url>
      <title>DEV Community: Reporails</title>
      <link>https://dev.to/reporails</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/reporails"/>
    <language>en</language>
    <item>
      <title>The State of AI Instruction Quality</title>
      <dc:creator> Gábor Mészáros</dc:creator>
      <pubDate>Tue, 21 Apr 2026 12:41:52 +0000</pubDate>
      <link>https://dev.to/reporails/the-state-of-ai-instruction-quality-35mn</link>
      <guid>https://dev.to/reporails/the-state-of-ai-instruction-quality-35mn</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Everybody has opinions about AGENTS.md/CLAUDE.md files. &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Best practices get shared. Templates get copied, and this folk-type knowledge dominates the industry. Last year, &lt;a href="https://github.blog/ai-and-ml/github-copilot/how-to-write-a-great-agents-md-lessons-from-over-2500-repositories/" rel="noopener noreferrer"&gt;GitHub analyzed 2,500 repos&lt;/a&gt; and published best-practice advice. We wanted to go further: measure at scale, publish the data, and let anyone verify.&lt;/p&gt;

&lt;p&gt;When the agent doesn't follow instructions and does something contradictory, the usual suspects are: &lt;em&gt;the model is inconsistent, LLMs are not deterministic, you need better guardrails, you need retries.&lt;/em&gt; &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The failures almost always get attributed to the model.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So we decided to measure. We built a diagnostic tool &lt;strong&gt;that treats instruction files as structured objects with measurable properties&lt;/strong&gt;. Deterministic. Reproducible. No LLM-as-judge. Then we pointed it at GitHub repositories with instruction files for five agents - &lt;strong&gt;Claude, Codex, Copilot, Cursor, and Gemini&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;28,721 repositories. 165,063 files. 3.3 million instructions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;... and one question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What if the instructions are the problem?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The dataset
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;28,721 projects.&lt;/strong&gt; Sourced from GitHub via API search, cloned, and deterministically analyzed. Each project was scanned for instruction files across five coding agents — then deduplicated to remove false positives from agent detection overlap.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Projects&lt;/th&gt;
&lt;th&gt;% of corpus&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;12,356&lt;/td&gt;
&lt;td&gt;43.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;11,206&lt;/td&gt;
&lt;td&gt;39.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;7,755&lt;/td&gt;
&lt;td&gt;27.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;7,291&lt;/td&gt;
&lt;td&gt;25.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;5,942&lt;/td&gt;
&lt;td&gt;20.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0j1xpnj80ntk84v8g6e3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0j1xpnj80ntk84v8g6e3.png" alt="Claude leads adoption at 43%, but all five agents have significant presence."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The percentages add up to more than 100% because &lt;strong&gt;37% of projects configure multiple agents&lt;/strong&gt;. More on that later.&lt;/p&gt;

&lt;p&gt;Key distributions stabilized early. A 9,582-repo sub-sample produced identical tier shares (±0.2pp) and the same mean scores as the full 12,076-repo intermediate sample. The final 28,721-repo corpus moved nothing. The patterns reported below are not small-sample artifacts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All classifications are deterministic&lt;/strong&gt; — the same file produces the same result every time. No LLM-as-judge. Sample classifications are published for inspection (methodology below). The tool is &lt;a href="https://github.com/reporails/cli" rel="noopener noreferrer"&gt;source-available&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  How we measured
&lt;/h2&gt;

&lt;p&gt;The analyzer parses each instruction file into &lt;strong&gt;atoms&lt;/strong&gt; — the smallest semantically distinct units of content. A heading is one atom. A bullet point is one atom. A paragraph is one atom. Each atom gets classified along a few dimensions, all deterministic, no LLM involved:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Charge classification.&lt;/strong&gt; A three-phase pipeline determines whether an atom is a directive ("use X"), a constraint ("do not use Y"), neutral content (context, explanation, structure), or ambiguous (could be read either way). Phase 1 detects negation and prohibition patterns. Phase 2 detects modal auxiliaries and direct commands. Phase 3 uses syntactic dependency parsing to catch imperatives that the first two phases missed. First definitive match wins. Atoms that partially match but don't clear any phase are marked ambiguous. Everything else is neutral.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Specificity.&lt;/strong&gt; Binary: does the instruction name a specific construct — a tool, file, command, flag, function, or config key — or does it stay at the category level? "Use consistent formatting" is abstract. "Format with &lt;code&gt;ruff format&lt;/code&gt;" is named. This is a text property, not a judgment call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File categorization.&lt;/strong&gt; Each file is classified as base config (your main CLAUDE.md or .cursorrules), a rule file, a skill definition, or a sub-agent definition — based on file path conventions for each agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Content type.&lt;/strong&gt; Charge classification separates behavioral content (directives and constraints) from structural content (headings, context paragraphs, examples). That's how we know what fraction of your file is actually doing work.&lt;/p&gt;

&lt;p&gt;The full tool is source-available (&lt;a href="https://github.com/reporails/cli/blob/main/LICENSE" rel="noopener noreferrer"&gt;BUSL-1.1&lt;/a&gt;). You can run &lt;code&gt;npx @reporails/cli check&lt;/code&gt; on your own project and inspect every finding. More on that at the end.&lt;/p&gt;




&lt;h2&gt;
  
  
  Finding 1: Most of your instruction file isn't instructions
&lt;/h2&gt;

&lt;p&gt;Here's what the median instruction file actually contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;50 content items&lt;/strong&gt; total&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;12 of those are actual directives&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The rest is headings, context paragraphs, examples, structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg45f7k3xx4n6naso8gok.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg45f7k3xx4n6naso8gok.png" alt="Median instruction file: 50 content items, 12 actual directives. The rest is structure."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Only 27% of your instruction file is doing what you think it does.&lt;/strong&gt; &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The other 73% is scaffolding. Headings that organize but don't instruct. Explanation paragraphs that compete for the model's attention without adding behavioral weight. Example blocks. Context-setting prose.&lt;/p&gt;

&lt;p&gt;That's not inherently bad. Structure matters. But if you're writing a 200-line CLAUDE.md and only 54 lines are actual instructions, you should probably know that.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;The average instruction is &lt;strong&gt;8.9 words&lt;/strong&gt; long. That's a sentence fragment.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Finding 2: 90% of instructions don't name what they're talking about
&lt;/h2&gt;

&lt;p&gt;This is the big one.&lt;/p&gt;

&lt;p&gt;We measured whether each instruction references specific tools, files, commands, or constructs by name — or whether it stays at the category level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two-thirds of all instructions are abstract.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Names specific constructs&lt;/th&gt;
&lt;th&gt;Uses category language&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;39.3%&lt;/td&gt;
&lt;td&gt;60.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;38.3%&lt;/td&gt;
&lt;td&gt;61.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;33.3%&lt;/td&gt;
&lt;td&gt;66.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;30.8%&lt;/td&gt;
&lt;td&gt;69.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;30.6%&lt;/td&gt;
&lt;td&gt;69.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;What does this look like in practice?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;: "Use consistent code formatting"&lt;br&gt;
&lt;strong&gt;Specific&lt;/strong&gt;: "Format with &lt;code&gt;ruff format&lt;/code&gt; before committing"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Abstract&lt;/strong&gt;: "Avoid using mocks in tests"&lt;br&gt;
&lt;strong&gt;Specific&lt;/strong&gt;: "Do not use &lt;code&gt;unittest.mock&lt;/code&gt; — use the real database via &lt;code&gt;test_db&lt;/code&gt; fixture"&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/cleverhoods/instruction-best-practices-precision-beats-clarity-lod"&gt;previous controlled experiments&lt;/a&gt;, specificity produced a 10.9x odds ratio in compliance (N=1000, p&amp;lt;10⁻³⁰). The instruction that names the exact construct gets followed. The one that describes it abstractly... mostly doesn't. This is consistent with independent findings from RuleArena (&lt;a href="https://arxiv.org/abs/2412.08972" rel="noopener noreferrer"&gt;Zhou et al., ACL 2025&lt;/a&gt;), where LLMs struggled systematically with complex rule-following tasks — even strong models fail when the rules themselves are ambiguous or underspecified.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;89.9% of all agent configurations&lt;/strong&gt; contain at least one instruction that doesn't name what it means. It's not a few projects. It's nearly everyone.&lt;/p&gt;


&lt;h2&gt;
  
  
  Finding 3: &lt;code&gt;agents.md&lt;/code&gt; is the most common instruction file
&lt;/h2&gt;

&lt;p&gt;Before we get into quality, let's look at what people are actually naming their files:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;agents.md&lt;/code&gt; / &lt;code&gt;AGENTS.md&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;20,654&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude.md&lt;/code&gt; / &lt;code&gt;CLAUDE.md&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;14,014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;gemini.md&lt;/code&gt; / &lt;code&gt;GEMINI.md&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;5,703&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.github/copilot-instructions.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5,647&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;.cursorrules&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,415&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;49,071 unique file paths&lt;/strong&gt; across the corpus. That's not a typo. The format fragmentation is real.&lt;/p&gt;

&lt;p&gt;A few things jumped out:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;claude.md&lt;/code&gt; (lowercase, 10,642) is &lt;strong&gt;3x more common&lt;/strong&gt; than &lt;code&gt;CLAUDE.md&lt;/code&gt; (3,372). Both work. The community clearly prefers lowercase.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;agents.md&lt;/code&gt; dominates — the Codex/generic format is the single most popular instruction file name.&lt;/li&gt;
&lt;li&gt;Skills and rules are already showing up in meaningful numbers: &lt;code&gt;.claude/rules/testing.md&lt;/code&gt; (422), &lt;code&gt;.agents/skills/tailwindcss-development/skill.md&lt;/code&gt; (334).&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Finding 4: Different agents, completely different config philosophies
&lt;/h2&gt;

&lt;p&gt;Not all agents are configured the same way. Not even close.&lt;/p&gt;

&lt;p&gt;We categorized every file into four types: &lt;strong&gt;base config&lt;/strong&gt; (your main CLAUDE.md, .cursorrules, etc.), &lt;strong&gt;rules&lt;/strong&gt; (scoped rule files), &lt;strong&gt;skills&lt;/strong&gt; (task-specific skill definitions), and &lt;strong&gt;sub-agents&lt;/strong&gt; (role-based agent definitions).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Base&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;th&gt;Skills&lt;/th&gt;
&lt;th&gt;Sub-agents&lt;/th&gt;
&lt;th&gt;Total files&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Claude&lt;/td&gt;
&lt;td&gt;18,733&lt;/td&gt;
&lt;td&gt;4,638&lt;/td&gt;
&lt;td&gt;10,692&lt;/td&gt;
&lt;td&gt;10,538&lt;/td&gt;
&lt;td&gt;44,601&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor&lt;/td&gt;
&lt;td&gt;5,903&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;19,843&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6,237&lt;/td&gt;
&lt;td&gt;1,716&lt;/td&gt;
&lt;td&gt;33,699&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Copilot&lt;/td&gt;
&lt;td&gt;16,026&lt;/td&gt;
&lt;td&gt;4,486&lt;/td&gt;
&lt;td&gt;10,352&lt;/td&gt;
&lt;td&gt;3,012&lt;/td&gt;
&lt;td&gt;33,876&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;19,001&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;8,911&lt;/td&gt;
&lt;td&gt;165&lt;/td&gt;
&lt;td&gt;28,158&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;10,253&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;3,039&lt;/td&gt;
&lt;td&gt;53&lt;/td&gt;
&lt;td&gt;13,419&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq3bqyzkqqkzg0cs1vag1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq3bqyzkqqkzg0cs1vag1.png" alt="Cursor is 60% rules files. Codex is 68% base config. Same goal, completely different structure."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor is 60% rules files.&lt;/strong&gt; The &lt;code&gt;.cursor/rules/&lt;/code&gt; system dominates its configuration surface. One agent's config looks nothing like another's.&lt;/p&gt;

&lt;p&gt;Claude is the only agent with a roughly balanced architecture across all four config types. Codex and Gemini are almost entirely base config — single-file setups.&lt;/p&gt;

&lt;p&gt;The median Cursor project has &lt;strong&gt;3 instruction files&lt;/strong&gt;. The median Codex project has &lt;strong&gt;1&lt;/strong&gt;. These aren't just different tools. They're different &lt;em&gt;configuration philosophies&lt;/em&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Finding 5: 37% of projects configure multiple agents
&lt;/h2&gt;

&lt;p&gt;10,620 projects in the corpus target two or more agents. That's not a niche pattern — it's over a third of all projects.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agents&lt;/th&gt;
&lt;th&gt;Projects&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;18,101&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;6,776&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2,687&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;949&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;208&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bnski07fneblyvftsat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4bnski07fneblyvftsat.png" alt="Over a third of projects configure instructions for multiple coding agents."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dominant pair is &lt;strong&gt;Claude + Codex&lt;/strong&gt; (5,038 projects). Makes sense — &lt;code&gt;CLAUDE.md&lt;/code&gt; + &lt;code&gt;AGENTS.md&lt;/code&gt; is the most natural multi-agent starting point.&lt;/p&gt;

&lt;p&gt;Here's what's interesting about multi-agent repos: &lt;strong&gt;the same developer, writing instructions at the same time, for the same project, produces measurably different instruction quality across agents.&lt;/strong&gt; The person didn't change. The project didn't change. The instruction format did.&lt;/p&gt;

&lt;p&gt;Some of that is structural. Cursor's &lt;code&gt;.mdc&lt;/code&gt; rules enforce a different format than Claude's markdown. Codex's &lt;code&gt;AGENTS.md&lt;/code&gt; invites a different writing style than Copilot's &lt;code&gt;copilot-instructions.md&lt;/code&gt;. The format shapes the content.&lt;/p&gt;


&lt;h2&gt;
  
  
  Finding 6: The most-copied skills are the vaguest
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting.&lt;/p&gt;

&lt;p&gt;13,309 unique skills across the corpus. Some of them appear in hundreds of repos — clearly copied from shared templates or community sources. So we measured them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Named%&lt;/strong&gt; = what fraction of a skill's instructions name a specific tool, file, or command (instead of using category language).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Repos&lt;/th&gt;
&lt;th&gt;Named%&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontend-design&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;271&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.8%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Almost entirely abstract advice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;web-design-guidelines&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;197&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.2%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generic design principles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vercel-react-best-practices&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;315&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mix of specific and vague&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pest-testing&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;216&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55.1%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Names actual test constructs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;livewire-development&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;75.5%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Names specific Livewire components&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;next-best-practices&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92.6%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Names almost everything&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;frontend-design&lt;/code&gt; is in 271 repos with 2.8% specificity. It's a wall of "follow responsive design principles" and "ensure accessibility compliance." That reads well. It sounds professional. It gives the model almost nothing concrete to act on.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;next-best-practices&lt;/code&gt; is in 76 repos with 92.6% specificity. It says things like "use &lt;code&gt;next/image&lt;/code&gt; for all images" and "prefer &lt;code&gt;server&lt;/code&gt; components over &lt;code&gt;client&lt;/code&gt;." It reads like a checklist. It tells the model exactly what to do.&lt;/p&gt;

&lt;p&gt;One is shared 3.5x more than the other.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The most popular skills are the most decorative.&lt;/strong&gt; The well-written ones barely spread.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feeksw2bf5spqdgvfba2x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feeksw2bf5spqdgvfba2x.png" alt="Each bubble is a community skill. The most popular ones cluster in the top-left — widely adopted, almost entirely abstract."&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  The best and worst skills (&amp;gt;50 repos)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Most specific:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Repos&lt;/th&gt;
&lt;th&gt;Named%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;next-best-practices&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;76&lt;/td&gt;
&lt;td&gt;92.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;shadcn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;74&lt;/td&gt;
&lt;td&gt;82.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;livewire-development&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;87&lt;/td&gt;
&lt;td&gt;75.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pest-testing&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;216&lt;/td&gt;
&lt;td&gt;55.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;laravel-best-practices&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;49.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Most vague:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Repos&lt;/th&gt;
&lt;th&gt;Named%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openspec-explore&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;110&lt;/td&gt;
&lt;td&gt;2.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontend-design&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;271&lt;/td&gt;
&lt;td&gt;2.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;web-design-guidelines&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;197&lt;/td&gt;
&lt;td&gt;10.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vercel-composition-patterns&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;131&lt;/td&gt;
&lt;td&gt;10.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;find-skills&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;113&lt;/td&gt;
&lt;td&gt;18.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice a pattern? The Laravel/Livewire ecosystem produces specific skills. The generic frontend/design ones stay abstract. &lt;strong&gt;Domain-specific communities write better instructions than cross-cutting ones.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Finding 7: Sub-agents are almost entirely persona prompts
&lt;/h2&gt;

&lt;p&gt;5,526 unique sub-agent roles in the corpus. Developers are building agent teams: code reviewers, architects, debuggers, testers, security auditors.&lt;/p&gt;

&lt;p&gt;The problem? &lt;strong&gt;Sub-agents are the most abstract config type in the entire corpus.&lt;/strong&gt; Only 17% of sub-agent instructions name specific constructs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Repos&lt;/th&gt;
&lt;th&gt;Named%&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;code-reviewer.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;236&lt;/td&gt;
&lt;td&gt;14.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;architect.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;89&lt;/td&gt;
&lt;td&gt;18.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;debugger.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;9.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;security-auditor.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;57&lt;/td&gt;
&lt;td&gt;14.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;test-runner.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;54&lt;/td&gt;
&lt;td&gt;10.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;frontend-developer.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;td&gt;9.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;p&gt;Most of these are persona prompts. "You are a senior code reviewer. You care about code quality, security, and maintainability." That's a role description, not an instruction set. It tells the model &lt;em&gt;who to be&lt;/em&gt;, not &lt;em&gt;what to do&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Compare this to a base config that says "run &lt;code&gt;uv run pytest tests/ -v&lt;/code&gt; before suggesting any commit" — that's 100% named, and the model knows exactly what action to take.&lt;/p&gt;


&lt;h2&gt;
  
  
  The anatomy chart: more directives, worse quality
&lt;/h2&gt;

&lt;p&gt;Here's where it all comes together.&lt;/p&gt;

&lt;p&gt;We measured three things for each config type: how big the files are, how many directives they contain, and what fraction of those directives actually name something specific.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8i6m6ud3s9dwnj3jft3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk8i6m6ud3s9dwnj3jft3.png" alt="Sub-agents have the most directives per file — and the least specific ones. More instructions doesn’t mean better instructions."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sub-agents have the &lt;strong&gt;largest&lt;/strong&gt; files (61 items median), the &lt;strong&gt;most&lt;/strong&gt; directives (17), and the &lt;strong&gt;worst&lt;/strong&gt; specificity (17%). They're the wordiest config type in the corpus and the least effective.&lt;/p&gt;

&lt;p&gt;Base configs are the opposite. Fewer directives (11), but 40% of them name specific constructs. The developer writing their own CLAUDE.md by hand, for their own project, produces the most actionable instructions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config type&lt;/th&gt;
&lt;th&gt;Files&lt;/th&gt;
&lt;th&gt;Median size&lt;/th&gt;
&lt;th&gt;Median directives&lt;/th&gt;
&lt;th&gt;Specificity&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base configs&lt;/td&gt;
&lt;td&gt;69,916&lt;/td&gt;
&lt;td&gt;50 items&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rules files&lt;/td&gt;
&lt;td&gt;29,122&lt;/td&gt;
&lt;td&gt;34 items&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31.2%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skills&lt;/td&gt;
&lt;td&gt;39,231&lt;/td&gt;
&lt;td&gt;59 items&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.8%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sub-agents&lt;/td&gt;
&lt;td&gt;15,484&lt;/td&gt;
&lt;td&gt;61 items&lt;/td&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.0%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern is clear: &lt;strong&gt;what developers write by hand is the most specific. What gets templated and shared gets progressively vaguer. And what tries hardest to sound authoritative — sub-agent persona prompts — is the most hollow.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;More instructions is not better instructions.&lt;/p&gt;

&lt;p&gt;Independent research supports the structural angle: FlowBench (&lt;a href="https://arxiv.org/abs/2406.14884" rel="noopener noreferrer"&gt;Xiao et al., 2024&lt;/a&gt;) found that presenting workflow knowledge in structured formats (flowcharts, numbered steps) improved LLM agent planning by 5-6 percentage points over prose — across GPT-4o, GPT-4-Turbo, and GPT-3.5-Turbo. Structure is not decoration. It changes what the model retrieves.&lt;/p&gt;


&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;Five things to know about these numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sampling bias.&lt;/strong&gt; GitHub API search, public repos only, English-skewed. Enterprise configurations, private repos, and non-English projects are not represented. This is not a random sample of all instruction files in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Classification accuracy.&lt;/strong&gt; The charge classifier is deterministic but not perfect. Edge cases exist: mixed-charge sentences, implicit constructs, domain jargon that looks like a category term but is actually a named tool. Specificity detection (named vs abstract) is simpler and more robust. Sample classifications are &lt;a href="https://github.com/reporails/30k-corpus" rel="noopener noreferrer"&gt;published for inspection&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Association, not causation.&lt;/strong&gt; "More directives correlate with lower specificity" is an observed pattern. We do not claim that adding directives &lt;em&gt;causes&lt;/em&gt; quality to drop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Snapshot.&lt;/strong&gt; Collected March–April 2026. Instruction practices are changing fast — &lt;code&gt;agents.md&lt;/code&gt; didn't exist six months ago. These numbers describe the ecosystem at collection time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No popularity weighting.&lt;/strong&gt; A 10-star hobby project counts the same as a 50K-star production repo. The distribution of instruction quality in &lt;em&gt;production&lt;/em&gt; agent work may differ.&lt;/p&gt;


&lt;h2&gt;
  
  
  What this means
&lt;/h2&gt;

&lt;p&gt;This isn't an article about AI models being bad at following instructions. The models are fine.&lt;/p&gt;

&lt;p&gt;This is an article about what we actually give them to work with.&lt;/p&gt;

&lt;p&gt;Most instruction files are three-quarters scaffolding. Two-thirds of the actual instructions don't name what they're talking about. The most popular community skills are the most decorative. Sub-agent definitions are the wordiest files in the corpus and the least specific.&lt;/p&gt;

&lt;p&gt;None of that is obvious from reading your own files. It wasn't obvious to us before we measured it. A well-structured CLAUDE.md &lt;em&gt;feels&lt;/em&gt; thorough. A shared skill with 271 repos &lt;em&gt;feels&lt;/em&gt; battle-tested. A sub-agent with 17 directives &lt;em&gt;feels&lt;/em&gt; comprehensive.&lt;/p&gt;

&lt;p&gt;Measurement shows something different.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://medium.com/@cleverhoods/the-undiagnosed-input-problem-03231442219d" rel="noopener noreferrer"&gt;The Undiagnosed Input Problem&lt;/a&gt;, I argued that the industry is great at inspecting outputs and weak at inspecting inputs. This corpus analysis is the evidence for that claim.&lt;/p&gt;

&lt;p&gt;The instruction files are there. The developers wrote them. They just have no way to know which parts are working and which parts are wallpaper.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;The analyzer we used for this corpus analysis is available as a CLI you can run against your own instruction files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/reporails/cli" rel="noopener noreferrer"&gt;Reporails&lt;/a&gt;&lt;/strong&gt; — instruction diagnostics for coding agents. Deterministic. No LLM-as-judge. 97 rules across structure, content, efficiency, maintenance, and governance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx @reporails/cli check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That scans your project, detects which agents are configured, and reports findings with specific line numbers and rule IDs. Here's what the output looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reporails — Diagnostics

  ┌─ Main (1)
  │ CLAUDE.md
  │   ⚠       Missing directory layout             CORE:C:0035
  │   ⚠ L9    7 of 7 instruction(s) lack reinfor…  CORE:C:0053
  │     ... and 16 more
  │
  └─ 21 findings

  Score: 7.9 / 10  ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░

  21 findings · 4 warnings · 1 info
  Compliance: HIGH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The corpus analysis used the same classification pipeline at scale. Fix the findings, run again, watch your score improve.&lt;/p&gt;

&lt;h3&gt;
  
  
  The dataset
&lt;/h3&gt;

&lt;p&gt;The full corpus is published at &lt;strong&gt;&lt;a href="https://github.com/reporails/30k-corpus" rel="noopener noreferrer"&gt;reporails/30k-corpus&lt;/a&gt;&lt;/strong&gt;. Three files:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Records&lt;/th&gt;
&lt;th&gt;What it contains&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;repos.jsonl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;28,721&lt;/td&gt;
&lt;td&gt;Per-project record: agents configured, stars, language, license, topics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stats_public.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Every aggregate statistic in this article&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;validation_key.csv&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2,814&lt;/td&gt;
&lt;td&gt;Sample classifications with source text for inspection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Verify any claim:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# "28,721 repositories"&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;repos.jsonl | &lt;span class="nb"&gt;wc&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt;

&lt;span class="c"&gt;# "43% Claude"&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;repos.jsonl | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import sys, json
repos = [json.loads(l) for l in sys.stdin]
claude = sum(1 for r in repos if 'claude' in r['canonical_agents'])
print(f'{claude}/{len(repos)} = {claude/len(repos)*100:.1f}%')
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every number in every table traces to that dataset. If you disagree with a finding, count the rows.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is part of the Instruction Quality series. Previous: &lt;a href="https://medium.com/@cleverhoods/the-undiagnosed-input-problem-03231442219d" rel="noopener noreferrer"&gt;The Undiagnosed Input Problem&lt;/a&gt;. Related: &lt;a href="https://cleverhoods.medium.com/instruction-best-practices-precision-beats-clarity-e1bcae806671" rel="noopener noreferrer"&gt;Precision Beats Clarity&lt;/a&gt; · &lt;a href="https://cleverhoods.medium.com/do-not-think-of-a-pink-elephant-7d40a26cd072" rel="noopener noreferrer"&gt;Do Not Think of a Pink Elephant&lt;/a&gt; · &lt;a href="https://cleverhoods.medium.com/claude-md-best-practices-7-formatting-rules-for-the-machine-a591afc3d9a9" rel="noopener noreferrer"&gt;7 Formatting Rules for the Machine&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
    <item>
      <title>The Undiagnosed Input Problem</title>
      <dc:creator> Gábor Mészáros</dc:creator>
      <pubDate>Wed, 08 Apr 2026 11:51:12 +0000</pubDate>
      <link>https://dev.to/reporails/the-undiagnosed-input-problem-4pmc</link>
      <guid>https://dev.to/reporails/the-undiagnosed-input-problem-4pmc</guid>
      <description>&lt;p&gt;The AI agent ecosystem has built a serious industry around controlling outputs. Guardrails. Safety classifiers. Output validation. Monitoring. Retry systems. Human review.&lt;/p&gt;

&lt;p&gt;All of that matters, but there is simpler upstream question that still goes mostly unmeasured:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are the instructions any good?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That sounds obvious, &lt;strong&gt;yet it is not how the industry behaves.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When an agent fails to follow instructions, the usual explanations come fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Models are probabilistic&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Agents are inconsistent&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;You need stronger guardrails&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;You need better monitoring&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;You need retries&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;You need humans in the loop&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;… and while those explanations are right to a certain degree, they also have a side effect: &lt;strong&gt;they turn instruction quality into a blind spot.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ecosystem has become extremely good at inspecting what comes out of the model, and surprisingly weak at inspecting what goes in.&lt;/p&gt;

&lt;h2&gt;
  
  
  The symptom
&lt;/h2&gt;

&lt;p&gt;Consider &lt;a href="https://sierra.ai/blog/benchmarking-ai-agents" rel="noopener noreferrer"&gt;τ-bench&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;It gives agents policy instructions and measures whether they follow them in realistic customer-service tasks. Airline and retail workflows. Real constraints. Real multi-step behavior.&lt;/p&gt;

&lt;p&gt;The benchmark result that gets repeated is the model result: even strong systems still fail a large share of tasks, and consistency across repeated attempts remains weak.&lt;/p&gt;

&lt;p&gt;The conclusion most people draw is straightforward: &lt;strong&gt;we need better models, better agents, better orchestration.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;My take: &lt;strong&gt;&lt;em&gt;Maybe&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But there is another question sitting underneath the benchmark:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Were the instructions themselves well-formed and well structured?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not just present. Not just long enough. Not just sincere.&lt;/p&gt;

&lt;p&gt;Well-formed. Well-structured. Well-organized.&lt;/p&gt;

&lt;p&gt;Specific enough to anchor behavior. Structured enough to survive context mixing. Non-conflicting across files. Positioned where the model can actually use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Those questions usually never gets asked.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The industry response
&lt;/h2&gt;

&lt;p&gt;I had a conversation recently where a lead solutions architect put the standard view plainly:&lt;/p&gt;

&lt;p&gt;“&lt;em&gt;The instruction merely influences the probability distribution over outputs. It doesn’t override it.&lt;/em&gt;”&lt;/p&gt;

&lt;p&gt;That is right about the mechanism but it is wrong about what follows from it.&lt;/p&gt;

&lt;p&gt;Yes, instructions operate probabilistically. &lt;strong&gt;But that does not mean all instructions are weak in the same way.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The shape of the distribution is not fixed. It changes with the properties of the instruction itself. Specificity sharpens it. Structure sharpens it. Conflict flattens it. Vague abstractions flatten it. Bad formatting can suppress it almost entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Across my earlier controlled experiments, small changes in wording and placement produced large changes in compliance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://cleverhoods.medium.com/do-not-think-of-a-pink-elephant-7d40a26cd072" rel="noopener noreferrer"&gt;Instruction&lt;/a&gt; ordering moved compliance by 25 percentage points with the same model and the same directive.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cleverhoods.medium.com/instruction-best-practices-precision-beats-clarity-e1bcae806671" rel="noopener noreferrer"&gt;Specificity&lt;/a&gt; produced roughly a 10x compliance effect when the instruction named the exact construct instead of describing it abstractly.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://cleverhoods.medium.com/claude-md-best-practices-7-formatting-rules-for-the-machine-a591afc3d9a9" rel="noopener noreferrer"&gt;Formatting&lt;/a&gt; changed whether the model reliably registered the instruction at all.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The problem is that most instruction systems are built without diagnostics.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;That is not an AI limitation. That is an engineering failure.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The folk system
&lt;/h2&gt;

&lt;p&gt;Right now, instruction practice spreads mostly through imitation.&lt;/p&gt;

&lt;p&gt;A popular repository posts “best practices” for Claude Code. Shared Cursor rules circulate as templates. People copy &lt;code&gt;AGENTS.md&lt;/code&gt; files between projects. Teams accumulate &lt;code&gt;CLAUDE.md&lt;/code&gt;, &lt;code&gt;.cursorrules&lt;/code&gt;, c&lt;code&gt;opilot-instructions.md&lt;/code&gt;, etc and project-specific rule files across multiple tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Copy, paste, hope, repeat.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some of that advice is useful. Almost none of it is tested in any controlled, reproducible way. That would be fine if instruction quality were self-evident. &lt;strong&gt;It is not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A long instruction file can feel thorough while being internally contradictory. A highly opinionated ruleset can feel disciplined while producing almost no behavioral influence on the model.&lt;/p&gt;

&lt;p&gt;A sprawling multi-file setup can look sophisticated while making the system worse.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Without diagnostics, developers do not know which instructions are binding, which are noise, and which are actively interfering with each other.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The gap
&lt;/h2&gt;

&lt;p&gt;The tooling split is now pretty clear.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output tooling&lt;/strong&gt; is mature. Guardrails AI validates structure. Lakera focuses on prompt injection and security. NeMo Guardrails enforces safety and conversational rails. Llama Guard classifies risky content. The output edge is crowded.&lt;/p&gt;

&lt;p&gt;Prompt testing is real. Promptfoo, Braintrust, and LangSmith can all help evaluate behavior. But they are primarily black-box systems: did the prompt produce the output you wanted?&lt;/p&gt;

&lt;p&gt;That is useful.&lt;/p&gt;

&lt;p&gt;It is not the same as measuring the instruction artifact itself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instruction-quality tooling&lt;/strong&gt; exists only in fragments. Some tools use LLM-as-judge. Some use deterministic local rules. But the category is still early, inconsistent, and mostly disconnected from measured behavioral outcomes.&lt;/p&gt;

&lt;p&gt;What is still largely missing is a deterministic way to inspect instruction files as engineered objects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how specific they are&lt;/li&gt;
&lt;li&gt;how directly they state intent&lt;/li&gt;
&lt;li&gt;whether they conflict across files&lt;/li&gt;
&lt;li&gt;whether they overuse headings&lt;/li&gt;
&lt;li&gt;whether they provide alternatives instead of bare prohibitions&lt;/li&gt;
&lt;li&gt;whether the system is getting denser while getting weaker&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Code gets static analysis.&lt;/p&gt;

&lt;p&gt;Instruction systems usually get &lt;em&gt;vibes&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we measured
&lt;/h2&gt;

&lt;p&gt;We built an analyzer that treats instruction files as structured objects with measurable properties. Deterministic. Reproducible. No LLM-as-judge.&lt;/p&gt;

&lt;p&gt;I am running it across a large live corpus of real repositories. The full run completes this week; what follows is what the partial sample already shows - stable enough to publish, not yet the full picture.&lt;/p&gt;

&lt;p&gt;Quality is reported on a 0-to-100 scale: &lt;code&gt;0&lt;/code&gt; means the file produces no measurable influence on model behavior, &lt;code&gt;100&lt;/code&gt; is the ceiling the framework can score.&lt;/p&gt;

&lt;p&gt;A fresh aggregation over &lt;strong&gt;12,076&lt;/strong&gt; completed instruction-file scans is virtually identical to an earlier &lt;strong&gt;9,582&lt;/strong&gt;-repo sample:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;bottom tier:&lt;/strong&gt; &lt;code&gt;40.3%&lt;/code&gt; vs &lt;code&gt;40.1%&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;top tier:&lt;/strong&gt; &lt;code&gt;12.1%&lt;/code&gt; vs &lt;code&gt;12.2%&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;mean quality score:&lt;/strong&gt; &lt;code&gt;27&lt;/code&gt; vs &lt;code&gt;27&lt;/code&gt;&lt;br&gt;
&lt;strong&gt;directive content ratio:&lt;/strong&gt; &lt;code&gt;27.9%&lt;/code&gt; vs &lt;code&gt;27.9%&lt;/code&gt; - the share of instruction sentences that directly tell the model what to do&lt;/p&gt;

&lt;p&gt;That matters because it means the pattern is stable.&lt;/p&gt;

&lt;p&gt;This does not look like a small-sample artifact.&lt;/p&gt;

&lt;p&gt;And the strongest finding is not what I expected.&lt;/p&gt;
&lt;h2&gt;
  
  
  More rules, lower quality
&lt;/h2&gt;

&lt;p&gt;The common response to bad agent behavior is to add more rules.&lt;/p&gt;

&lt;p&gt;More files. More guidance. More scoping. More edge-case coverage.&lt;/p&gt;

&lt;p&gt;The corpus says that strategy tends to backfire.&lt;/p&gt;

&lt;p&gt;Across &lt;strong&gt;12,076&lt;/strong&gt; repositories, instruction quality falls as instruction-file count rises:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Files per repo     N      Mean score   Bottom tier %   Top tier %
1                  4681   28           46.3%           16.9%
2-5                4796   26           37.3%            9.5%
6-20               1972   26           36.0%            8.8%
21-50               438   25           31.3%            5.7%
51-500              186   25           33.3%            5.4%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key number is the top-tier share.&lt;/p&gt;

&lt;p&gt;It collapses from &lt;code&gt;16.9%&lt;/code&gt; in single-file setups to &lt;code&gt;5.4%&lt;/code&gt; in repositories with &lt;code&gt;51&lt;/code&gt; to &lt;code&gt;500&lt;/code&gt; instruction files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is a roughly 3x drop.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The article version of that finding is simple:&lt;/p&gt;

&lt;p&gt;Developers respond to bad agent behavior by adding more rules. In the corpus, that strategy correlates with a 3x collapse in the probability of landing in the top tier.&lt;/p&gt;

&lt;p&gt;That does not prove file count causes low quality by itself.&lt;/p&gt;

&lt;p&gt;But it does show that rule proliferation is not rescuing these systems. At scale, it is associated with weaker instruction quality, not stronger.&lt;/p&gt;

&lt;h2&gt;
  
  
  The sweet spot
&lt;/h2&gt;

&lt;p&gt;There is also a more subtle result in the partial sample. Instruction quality appears to be non-monotonic in directive density: more directives help at first, then stop helping, and past a point start to hurt.&lt;/p&gt;

&lt;p&gt;The full curve is in next week’s piece. The short version is that there is an optimal density range, after which additional directives stop strengthening the system.&lt;/p&gt;

&lt;p&gt;Enough force to bind behavior. Not so much that the system turns into an overpacked rules document.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real example
&lt;/h2&gt;

&lt;p&gt;Here is the kind of instruction block the corpus is full of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Code should be clear, well documented, clear PHPDocs.

# Code must meet SOLID DRY KISS principles.

# Should be compatible with PSR standards when it need.

# Take care about performance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is not malicious. It is not absurd.&lt;/p&gt;

&lt;p&gt;It is just &lt;strong&gt;weak.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Everything is abstract. Nothing is anchored. Headings are doing the work prose should do. The agent can read it, represent it, and still walk past most of it.&lt;/p&gt;

&lt;p&gt;Now compare:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Never use &lt;span class="sb"&gt;`&lt;/span&gt;var_dump&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt; or &lt;span class="sb"&gt;`&lt;/span&gt;&lt;span class="nb"&gt;dd&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt; &lt;span class="k"&gt;in &lt;/span&gt;committed code. Use &lt;span class="sb"&gt;`&lt;/span&gt;Log::debug&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt; instead.
Run &lt;span class="sb"&gt;`&lt;/span&gt;./vendor/bin/phpstan analyse src/&lt;span class="sb"&gt;`&lt;/span&gt; before every commit. Level 6 minimum.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same general intent. Completely different binding strength.&lt;/p&gt;

&lt;p&gt;The second version names the construct, names the alternative, names the command, and names the threshold. &lt;strong&gt;It gives the model something concrete to hold onto.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is what diagnostics should make visible.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means
&lt;/h2&gt;

&lt;p&gt;Output guardrails still matter.&lt;/p&gt;

&lt;p&gt;Prompt evaluation still matters.&lt;/p&gt;

&lt;p&gt;Safety systems still matter.&lt;/p&gt;

&lt;p&gt;But they do not answer the upstream question: &lt;strong&gt;Are the instructions themselves well-formed?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If the answer is no, then a large class of downstream failures will keep showing up as mysterious agent unreliability when the real problem is earlier and simpler.&lt;/p&gt;

&lt;p&gt;The agent loaded the instruction and walked past it.&lt;/p&gt;

&lt;p&gt;That is often not a model problem.&lt;/p&gt;

&lt;p&gt;It is an input problem.&lt;/p&gt;

&lt;p&gt;And input quality is measurable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s next
&lt;/h2&gt;

&lt;p&gt;These are corpus-level findings from a partial sample, not universal laws.&lt;/p&gt;

&lt;p&gt;The sample is still in flight. The strongest claims here are about association, not proof of causality. Specific conflict-count case studies need source verification before publication. Popularity weighting is not yet applied, so “40% of repositories score in the bottom tier” is not the same claim as “40% of production agent work scores in the bottom tier.”&lt;/p&gt;

&lt;p&gt;The full corpus run completes this week. Next week I publish the end-of-run analysis across the full sample — the complete distribution, the cross-cuts the partial sample cannot yet support, and the specific case studies this article deliberately held back. If you want to know where your stack lands, that is the piece to come back for.&lt;/p&gt;

&lt;p&gt;For now, the central pattern is already stable enough to matter:&lt;/p&gt;

&lt;p&gt;The ecosystem keeps responding to weak agent behavior by adding more instructions, while the corpus shows that more instruction files are usually associated with lower measured quality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is the undiagnosed input problem.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Not that instructions do not matter.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;That they matter, measurably, and most teams still have no way to see whether theirs are helping or hurting.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;This is part of the Instruction Best Practices series. Previous: &lt;a href="https://cleverhoods.medium.com/do-not-think-of-a-pink-elephant-7d40a26cd072" rel="noopener noreferrer"&gt;Do NOT Think of a Pink Elephant&lt;/a&gt;, &lt;a href="https://cleverhoods.medium.com/instruction-best-practices-precision-beats-clarity-e1bcae806671" rel="noopener noreferrer"&gt;Precision Beats Clarity&lt;/a&gt;, &lt;a href="https://cleverhoods.medium.com/claude-md-best-practices-7-formatting-rules-for-the-machine-a591afc3d9a9" rel="noopener noreferrer"&gt;7 Formatting Rules for the Machine&lt;/a&gt;. I’m building instruction diagnostics for coding agents. Follow for the full corpus analysis.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>productivity</category>
      <category>claude</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
