Two AI coding agents were given the same task with the same 10-rule instruction file. Both scored 70% adherence. Here's the breakdown:
| Rule | Agent A | Agent B |
|---|---|---|
| camelCase variables | PASS | FAIL |
No any type |
FAIL | PASS |
| No console.log | FAIL | PASS |
| Named exports only | PASS | FAIL |
| Max 300 lines | PASS | FAIL |
| Test files exist | FAIL | PASS |
Agent A had a type safety gap. It used any for request parameters even though it defined the correct types in its own types.ts file. Agent B had a structural discipline gap. It used snake_case for a variable, added a default export following Express conventions over the project rules, and generated a 338-line file by adding features beyond the task scope.
Same score. Completely different engineering weaknesses. That table came from RuleProbe.
About this case study
The comparison uses simulated agent outputs with deliberate violations, not live agent runs. Raw JSON reports are in the repo under docs/case-study-data/. This is documented in the case study.
What RuleProbe is
RuleProbe is an open source CLI that reads AI coding agent instruction files and verifies whether the agent's output followed the rules. It covers six formats: CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, GEMINI.md, and .windsurfrules.
Verification is deterministic. No LLM in the pipeline. The same input produces the same report every time.
How it checks
Three methods, depending on the rule:
AST analysis via ts-morph handles code structure. Variable and function naming (camelCase), type and interface naming (PascalCase), type annotations (any detection), export style (named vs default), JSDoc presence on public functions, and import patterns (path aliases, deep relative imports).
Filesystem inspection handles file-level rules. File naming conventions (kebab-case) and whether test files exist for source files.
Regex handles content patterns like max line length.
v0.1.0 has 15 matchers across those three methods, covering TypeScript and JavaScript. ts-morph is the AST engine, so other languages aren't supported.
Output looks like this:
RuleProbe Adherence Report
Rules: 14 total | 11 passed | 3 failed | Score: 79%
PASS naming/naming-camelcase-variables-5
PASS naming/naming-pascalcase-types-7
FAIL forbidden-pattern/forbidden-no-any-type-1
src/handler.ts:12 - found: req: any
src/handler.ts:24 - found: data: any
FAIL forbidden-pattern/forbidden-no-console-log-10
src/handler.ts:18 - found: console.log("handling request")
FAIL test-requirement/test-files-exist-11
src/handler.ts - found: no test file found
File, line, violation. No ambiguity.
The conservative parser
This is a design choice worth explaining. When RuleProbe reads an instruction file, it only extracts rules it can map to a deterministic mechanical check. Everything else gets reported as unparseable.
ruleprobe parse CLAUDE.md --show-unparseable
"Write clean code" is unparseable. "Use the repository pattern" is unparseable. "Handle errors gracefully" is unparseable. These can't be verified without judgment, and judgment means variance between runs. RuleProbe doesn't do that.
The tradeoff: a 30-rule instruction file might produce 12 verified rules and 18 unparseable ones. You see both counts so you know exactly what's being checked and what isn't.
Running it
npx ruleprobe --help
Parse an instruction file
ruleprobe parse CLAUDE.md
Extracted 14 rules:
forbidden-no-any-type-1
Category: forbidden-pattern
Verifier: ast
Pattern: no-any (*.ts)
Source: "- TypeScript strict mode, no any types"
naming-kebab-case-files-4
Category: naming
Verifier: filesystem
Pattern: kebab-case (*.ts)
Source: "- File names: kebab-case"
Supports Verify agent output
ruleprobe verify CLAUDE.md ./agent-output --format text
--format json, --format markdown, and --format rdjson (reviewdog-compatible). Exit code 0 means all rules passed, 1 means violations found.
Compare two agents
ruleprobe compare AGENTS.md ./claude-output ./copilot-output \
--agents claude,copilot --format markdown
CI with the GitHub Action
name: RuleProbe
on: [pull_request]
jobs:
check-rules:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: moonrunnerkc/ruleprobe@v0.1.0
with:
instruction-file: AGENTS.md
output-dir: src
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
No external API keys. Posts results as a PR comment. Supports reviewdog rdjson for inline annotations if you use reviewdog in your pipeline. Exposes score, passed, failed, and total as step outputs, so you can gate merges on adherence thresholds in downstream steps.
All action inputs
Input
Default
What it does
instruction-file(required)
Path to your instruction file
output-dirsrcDirectory of code to verify
agentciAgent label for report metadata
modelunknownModel label for report metadata
formattexttext, json, or markdown
severityallerror, warning, or all
fail-on-violationtrueFail the check if any rule is violated
post-commenttruePost results as a PR comment
reviewdog-formatfalseAlso output rdjson
Programmatic API
Five functions if you want to integrate verification into your own tooling:
import {
parseInstructionFile,
verifyOutput,
generateReport,
formatReport,
extractRules
} from 'ruleprobe';
parseInstructionFilereads the instruction file.verifyOutputruns the rules.generateReportbuilds the adherence report with summary stats.formatReportrenders it as text, JSON, markdown, or rdjson.extractRulesworks on raw markdown content if you don't have a file path.
What it doesn't cover
15 matchers is a starting point, not full coverage. Real instruction files have rules RuleProbe can't verify yet: architectural patterns, error handling conventions, dependency constraints, API design rules. The parser will tell you what it skipped.
TypeScript and JavaScript only. ts-morph is the AST engine. Other languages would need a different parser.
No automated agent invocation. You run the agent separately and point RuleProbe at the output directory.
Security and dependencies
RuleProbe never executes scanned code, never makes network calls, never writes to the scanned directory. Paths are resolved and bounded to process.cwd(). Symlinks outside the project are skipped by default.
Four runtime dependencies: chalk 5.6.2, commander 12.1.0, glob 11.1.0, ts-morph 24.0.0. All pinned to exact versions. No semver ranges.
npm: ruleprobe | MIT license
moonrunnerkc
/
ruleprobe
Verify whether AI coding agents follow the instruction files they're given
RuleProbe
Verify whether AI coding agents actually follow the instruction files they're given
Why
Every AI coding agent reads an instruction file. None of them prove they followed it.
You write CLAUDE.md or AGENTS.md with specific rules: camelCase variables, no any types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.
RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Binary pass/fail, with file paths and line numbers as evidence. No LLM evaluation, no judgment calls. Deterministic and reproducible.
Quick Start
npm install -g ruleprobe
Or run it directly:
npx ruleprobe --help
Parse an instruction file to see what rules RuleProbe can extract:
ruleprobe parse CLAUDE.md
Extracted 14 rules
forbidden-no-any-type-1
Category: forbidden-pattern
Verifier: ast
Pattern: no-any (*.ts)
Source: "- TypeScript strict…
Top comments (0)