Brad Kinnard

Posted on Apr 15

We Parsed 580 AI Instruction Files. 96% of the Content Can't Be Verified.

#ai #programming #opensource #typescript

Every AI coding agent reads an instruction file. CLAUDE.md, AGENTS.md, .cursorrules, whatever your agent uses. You write rules in it. The agent says "Done." And you have no idea whether it followed any of them.

We wanted to know what's actually inside these files. Not what people think they contain, but what a machine can extract and verify through static analysis. So we scraped instruction files from 568 public GitHub repos with 10+ stars, ran them through a parser backed by 102 matchers across 8 verifier engines (AST, filesystem, regex, tree-sitter, preference, tooling, config-file, git-history), and counted what came out.

The short version: across the entire corpus, 3.8% of lines were extracted as verifiable coding rules. The other 96% is markdown headers, code examples, project descriptions, build commands, agent behavior directives, and contextual prose.

The dataset

580 instruction files from 568 repos, including Sentry (43k stars), PingCAP/TiDB (40k), Lerna (36k), Dragonfly (30k), Kubernetes/kops (17k), javascript-obfuscator (16k), RabbitMQ (14k), Google APIs (14k), Redpanda (12k), and hundreds of others. Six file formats represented: AGENTS.md (149 files), CLAUDE.md (111), .cursorrules (102), .windsurfrules (95), GEMINI.md (89), and copilot-instructions.md (34). This sample skews toward larger public repos. Enterprise internal repos with stricter governance, or solo projects with tightly scoped instruction files, may look different. We'd like to see that data.

The parser reads each file and classifies every line: is this a rule that can be checked against code, or is it something else? "Something else" includes headers, blank lines, code blocks, explanatory prose, build instructions, and agent personality configuration.

Corpus stats: 8,222 total instruction lines parsed. 309 rules extracted. 7,913 lines classified as non-rule content.

What instruction files actually contain

The 96% that isn't rules breaks down into several categories. Some of it is necessary context (project structure explanations, build command documentation). Some of it is agent behavior configuration ("be succinct," "avoid providing explanations"). Some of it is just markdown formatting overhead.

Here's what stood out: 430 of the 580 files (74%) had zero extractable rules. Of those, 67 were completely empty to the parser: zero extracted, zero unparseable. Many were single-line redirects. Dragonfly's .cursorrules (30k stars) says "READ AGENTS.md." Umi's .cursorrules (16k stars) contains the single word "RULE.md." Mautic's GEMINI.md says "Read and follow all instructions in ./AGENTS.md."

At the other end, a few files were dense with rules. Apache Skywalking-java's CLAUDE.md extracted 6 rules from 26 lines (23%). Cloudflare chanfana's AGENTS.md: 5 rules from 21 lines (24%). But those files tend to be short, focused lists of concrete instructions.

The heavy files tell a different story. javascript-obfuscator's CLAUDE.md (16k stars): 197 lines, zero rules extracted. These files are documentation with no machine-verifiable instructions embedded.

Parse rate distribution across all 580 files

Parse Rate	Files	Percentage
0% (no rules)	430	74.1%
1-9%	70	12.1%
10-19%	54	9.3%
20-29%	13	2.2%
30-49%	11	1.9%
>= 80%	2	0.3%

Only 2 files (0.3%) had parse rates at or above 80%. Nearly three quarters had zero.

Types of content the parser correctly skips

"3.8% extraction rate" sounds like the parser is broken. It isn't. These are lines that genuinely aren't rules:

Markdown structure (headers, horizontal rules, blank lines). Code examples showing how to use a function or run a command. Project descriptions explaining what the repo does. Build and deployment instructions. Links to external documentation. Agent behavior directives that have no code-level representation ("be concise," "ask before making changes"). Workflow instructions ("use this branch strategy," "run tests before pushing").

The parser isn't failing on these. It's correctly identifying them as not-rules. The denominator is every line in the file, not every line that looks like it could be a rule.

A second metric tells the complementary story. 150 of 580 files (25.9%) contained at least one extractable rule. Across those 150 files, 309 rules is an average of 2.1 rules per file. So only a quarter of instruction files contain anything enforceable at all, and when they do, they typically contain two rules. The 3.8% describes the corpus-wide line ratio. The 25.9% and 2.1-per-file numbers describe what rule-writers are actually producing.

What a "verifiable rule" looks like

The 309 rules that did get extracted map to concrete checks. Things like:

"Use camelCase for function names" (AST naming check)
"No any types" (TypeScript type safety check)
"Use named exports, not default exports" (import pattern check)
"Prefer const over let" (preference ratio check)
"Test files must exist for every source file" (filesystem check)
"Use Yarn, not npm" (tooling check)

Each rule gets a category, a verifier type (AST, filesystem, regex, tree-sitter, preference, tooling, config-file, or git-history), and a qualifier (always, prefer, when-possible, avoid-unless, try-to, never).

Rule extraction by category

Category	Rules Extracted
naming	169
structure	44
code-style	29
forbidden-pattern	24
type-safety	20
dependency	12
error-handling	5
import-pattern	4
test-requirement	2

Naming rules dominate: 55% of all extracted rules. That's likely a combination of two factors. Naming conventions ("use camelCase," "kebab-case filenames") are the most concrete, unambiguous instructions people write, so they appear frequently. They're also the rule class that static analysis matchers handle most cleanly, so the parser has high affinity for them. We can't fully separate how much of the 55% is user behavior vs. parser strength, but both contribute.

Rule extraction by instruction file type

Type	Files	Files with Rules	Rules	Total Lines	Rate
copilot-instructions.md	34	13	33	556	5.9%
.cursorrules	102	37	79	1,508	5.2%
AGENTS.md	149	49	97	1,961	4.9%
.windsurfrules	95	22	50	1,866	2.7%
CLAUDE.md	111	20	38	1,501	2.5%
GEMINI.md	89	9	12	830	1.4%

copilot-instructions.md had the highest extraction rate (5.9%), likely because those files tend to be shorter and more prescriptive. GEMINI.md files had the lowest (1.4%).

E2E verification: does excalidraw follow its own instruction files?

This is a pipeline demonstration on one repo, not broad validation across ecosystems. We ran the full pipeline on excalidraw (~95k stars) because it's large, well-maintained, and has instruction files with extractable rules: both a CLAUDE.md and a copilot-instructions.md.

The parser found 9 verifiable rules across both files. Deterministic analysis scored 66.1% compliance. Semantic analysis (structural fingerprinting of 626 source files) produced 9 verdicts, all resolved via fast-path vector similarity. Zero LLM calls, zero cost:

Rule	Compliance	Method
Prefer functional components	0.976	structural-fast-path
PascalCase type naming	0.976	structural-fast-path
Async try/catch usage	0.983	structural-fast-path
Contextual error logging	0.979	structural-fast-path
Yarn as package manager	0.50	no matching topic
TypeScript required	0.50	no matching topic
Optional chaining preference	0.50	no matching topic
camelCase variables	0.50	no matching topic
UPPER_CASE constants	0.50	no matching topic

Rules that match established code pattern topics (component-structure, error-handling) score 0.97+, meaning the codebase's structural fingerprint strongly matches the instruction. The remaining five rules scored a neutral 0.50 because they describe tooling choices and naming conventions that don't have structural AST representations. That's itself a finding: even among the 4% of lines that get extracted as verifiable rules, some fall into categories that resist automated verification beyond simple presence checks. The verifier is real, but not comprehensive. No static analysis tool covers every rule class, and pretending otherwise would be dishonest.

Privacy note: 626 files scanned, all file IDs are opaque sequential integers. No source code strings, file paths, variable names, or comments appear in any payload. In this case, no LLM was even called.

What this means for anyone writing instruction files

Two clarifications before the takeaways. First, "96% can't be verified" means can't be verified through static analysis, not "is useless." Agent behavior configuration, project context, and workflow documentation all have value. They guide the agent even if no tool can confirm compliance after the fact. Second, the 4% that is verifiable still matters. Excalidraw's 9 extractable rules produced a 66.1% deterministic compliance score with specific failures at specific line numbers. Nine rules doesn't sound like much until three of them fail and you find the agent ignored your naming conventions across 626 files.

The real problem isn't that instruction files contain documentation. It's that most people don't know which of their lines are enforceable and which are suggestions the agent can silently drop. That ratio isn't fixed, either. People write unverifiable instructions because nobody's told them which phrasings produce checkable rules.

To write rules that can actually be checked:

Use imperative verbs with specific targets. "Use camelCase for all function names" is verifiable. "Follow good naming conventions" isn't.

Specify the tool or pattern, not the principle. "Prefer const over let" is a ratio check. "Write immutable code" is philosophy.

Include the file patterns your rules apply to. "All .ts files must use named exports" scopes the check. "Use named exports" is vague.

Keep rules and documentation separate. Rules are instructions. Documentation explains why. Mixing them dilutes both.

RuleProbe on GitHub: parse your own instruction files and see what's actually verifiable

The tool

RuleProbe is the parser and verifier behind this analysis. It reads 7 instruction file formats, extracts machine-verifiable rules using 102 built-in matchers across 14 categories, and checks agent output against each one. Deterministic by default, no API keys needed for the core pipeline. Optional semantic analysis for pattern-matching and consistency rules.

npx ruleprobe parse CLAUDE.md --show-unparseable
npx ruleprobe verify CLAUDE.md ./src --format summary

The --show-unparseable flag shows you exactly which lines were skipped and why. That's often the most useful output: it tells you which of your "rules" aren't rules at all.

moonrunnerkc / ruleprobe

Verify whether AI coding agents follow the instruction files they're given

RuleProbe

Verify whether AI coding agents actually follow the instruction files they're given

Why

Every AI coding agent reads an instruction file. None of them prove they followed it.

You write CLAUDE.md or AGENTS.md with specific rules: camelCase variables, no any types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.

RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Compliance scores with file paths and line numbers as evidence. Deterministic and reproducible by default. Optional semantic analysis for pattern-matching and consistency rules that require codebase-aware judgment.

Quick Start

npm install -g ruleprobe

Or run it directly:

npx ruleprobe --help

Parse an instruction file to see what rules RuleProbe can extract:

ruleprobe parse CLAUDE.md
ruleprobe parse AGENTS.md --show-unparseable

Verify agent output…

View on GitHub

DEV Community