You test your code. Why aren’t you testing your AI instructions?

Lukas Metzler — Fri, 03 Apr 2026 21:38:59 +0000

You test your code. Why aren't you testing your AI instructions?

Why instruction quality matters more than model choice, and a tool to measure it.

Every team using AI coding tools writes instruction files. CLAUDE.md for Claude Code, AGENTS.md for Codex, copilot-instructions.md for GitHub Copilot, .cursorrules for Cursor. You spend time crafting these files, change a paragraph, push it, and hope for the best.

Your codebase has tests. Your APIs have contracts. Your AI instructions have hope.

I built agenteval to fix that.

The variable nobody is testing

A recent study tested three agent frameworks running the same model on 731 coding problems. Same model. Same tasks. The only difference was the instruction scaffolding.

The spread was 17 points.

We obsess over which model to use. Sonnet vs Opus vs GPT-5.4. But the instructions you give the model have a bigger effect on the outcome than the model itself. And nobody tests them.

Think about that. You wouldn't deploy an API without tests. You wouldn't ship a feature without CI. But the file that controls how your AI writes code? You edit it in a text editor and hope.

What goes wrong in instruction files

I've scanned a lot of instruction files at this point. The same problems show up everywhere.

Dead references

You renamed src/auth.ts to src/authentication.ts six months ago. Your instruction file still says "see src/auth.ts for the authentication module." The AI reads that instruction, looks for a file that doesn't exist, and gets confused.

This is the most common issue. Almost every instruction file over 3 months old has at least one dead reference.

Filler that eats your context budget

"Make sure to always thoroughly test everything and ensure comprehensive coverage of all edge cases in a robust manner."

That sentence burns 25 tokens and says nothing the model doesn't already know. With a 200K context window and a 30% instruction budget, you have about 60,000 tokens. Every token spent on "make sure to" is a token not available for actual code context.

The worst offenders: "it is important that", "in order to", "at the end of the day", "make sure to", "please ensure that". They're everywhere.

Contradictions

"Always use semicolons" in your code style section. "Follow the Prettier config" three sections later, where Prettier removes semicolons. The model gets conflicting instructions and picks one at random.

It happens more than you'd think, especially in files maintained by multiple people over months.

Context budget overruns

Your CLAUDE.md is 300 lines. Your AGENTS.md is 200 lines. Your copilot-instructions.md is 150 lines. Together they consume 40% of your model's context window before a single line of code is loaded.

The AI's performance degrades uniformly as instruction count increases. It's not that later instructions get ignored. All instructions get followed less precisely.

Overlap between files

Your CLAUDE.md says "use TypeScript strict mode, tabs for indentation." Your AGENTS.md says the same thing. That's duplicated instructions consuming double the tokens for zero additional value. Worse, when you update one copy and forget the other, they drift apart and contradict each other.

What agenteval does about it

agenteval is a CLI. You install it, run one command, and see what's wrong.

curl -fsSL [https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh](https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh) | bash
agenteval lint

It reads your instruction files, parses the markdown, counts tokens, checks file references, and reports real problems with actionable suggestions. No LLM in the loop. Deterministic. Runs in under a second.

Here's what it found on my own project the first time I ran it:

CLAUDE.md
  ERROR  Referenced file "docs/schema.md" does not exist
         → Remove the reference or create the missing file
  info   Section "Testing" contains 1 filler phrase(s)
         → Rewrite without phrases like 'make sure to'
  info   Vague instructions: "be careful with error handling"
         → Replace with a specific example or threshold

Every issue has a suggestion. You don't need to figure out what to do about it.

It supports every major instruction format:

CLAUDE.md (Claude Code)
AGENTS.md (OpenAI Codex)
.github/copilot-instructions.md and scoped .instructions.md files (GitHub Copilot)
.cursorrules (Cursor)
.claude/skills/*/SKILL.md (Anthropic skills)

Beyond linting: measuring instruction quality over time

The linter catches problems statically. But what if you want to know whether your instruction changes actually made the AI perform better?

agenteval has a deeper pipeline for that:

Harvest scans your git history for AI-assisted commits. It detects 14 tools (Claude, Copilot, Cursor, Devin, Aider, Amazon Q, Gemini, and more) and generates replayable benchmark tasks from them. Each task includes a snapshot of what your instruction files looked like at that commit. No synthetic test cases needed. Your own git history is the benchmark.
Run gives a task to an AI agent in an isolated git worktree, captures what it produces, and scores the result. Four dimensions: did it change the right files (correctness), did it only change what needed changing (precision), how many tokens did it use (efficiency), did it follow conventions (conventions).
Compare puts two runs side by side. Change your instruction files, re-run the same tasks, see if the scores improved. If both runs have instruction snapshots, it shows exactly what changed in your instructions alongside the score delta.
CI runs all your tasks and fails the build if instruction quality regresses. Add one line to your GitHub Actions workflow and instruction quality becomes a merge gate, just like tests:

- run: agenteval ci

If someone changes the instructions in a PR and quality drops below the threshold, the build fails.

Live review scores your working tree changes before you commit. Are your changes focused or scattered? Did you update tests? Any debug artifacts left in? Add --analyze and it sends the diff to your AI tool for convention compliance scoring.
Trends tracks scores over time. Is your team getting better at writing instructions this quarter? Which tasks are improving? Which are regressing?

The uncomfortable question

How much time has your team spent debating which AI model to use? Now how much time have you spent testing whether your instruction files actually work?

The model is a commodity. The instructions are your competitive advantage.

Try it

curl -fsSL [https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh](https://raw.githubusercontent.com/lukasmetzler/agenteval/main/install.sh) | bash
agenteval lint

One command. Standalone binary. Works on Linux and macOS. Point it at any project with instruction files and see what it finds.

Repo: https://github.com/lukasmetzler/agenteval

If you try it, I'd love to hear what it catches and what checks are missing.

DEV Community: Lukas Metzler

You test your code. Why aren’t you testing your AI instructions?