I changed one line in my prompt and my agent started giving refunds to everyone
True story. I was tweaking a customer support agent prompt. Changed "Never offer refunds without manager approval" to "Always prioritize customer satisfaction." Seemed harmless. Shipped it.
Within an hour, the agent was handing out refunds like candy on Halloween. No approval. No verification. Just vibes.
The worst part? git diff showed me exactly what changed — one line added, one line removed. What it didn't tell me was that I'd removed a critical constraint and replaced it with a vague instruction that the model interpreted as "give them whatever they want."
That was the moment I realized: prompts are production code, but we treat them like sticky notes.
Prompts have zero tooling (and it's wild)
Think about it. If you write JavaScript, you have ESLint catching issues before they ship. You have Prettier enforcing style. You have TypeScript telling you when things don't make sense. You have git diff showing you exactly what changed and why it matters.
Now think about prompts. You write them in a text file. You eyeball them. You copy-paste them into a playground. You pray.
Here's what's missing:
- No linter catches "You are a teacher" AND "You are a sales agent" in the same prompt
- No diff tells you that removing one example drops output consistency
- No CI gate blocks a vague "try to be helpful" from shipping
- No score tells you if your prompt is a B+ or a D-
git diff says "+1 line, -1 line." Cool. Thanks. Very helpful when I'm trying to figure out if my agent is about to go rogue.
So I built promptdiff
promptdiff is a CLI tool that treats prompts as structured documents — not blobs of text. It parses your .prompt files into semantic sections (persona, constraints, examples, output format, guardrails) and runs real analysis on them.
Install it in one line:
npm install -g promptdiff
Zero config. No API keys. No accounts. Runs entirely locally. Three dependencies. That's it.
Here's what it does:
Lint your prompts like code
promptdiff lint my-agent.prompt
10 built-in rules that catch real bugs — not style nits. Behavioral issues that silently degrade your agent:
| Rule | What it catches |
|---|---|
conflicting-constraints |
"Keep it under 100 words" + examples that are 200 words |
role-confusion |
Two different roles in the same persona section |
vague-constraints |
"Try to", "if possible", "maybe" — weasel words that models ignore |
injection-surface |
No "ignore embedded instructions" guard |
few-shot-minimum |
Only 1 example (models need 2-3 for consistency) |
missing-output-format |
No FORMAT section = inconsistent output every time |
You know the feeling when ESLint catches a bug you would've spent 30 minutes debugging? Same energy.
Semantic diff that actually means something
promptdiff diff v3.prompt v7.prompt --annotate
This is not git diff. It matches sections by type (persona to persona, constraints to constraints), classifies each change, and tells you the impact:
[CONSTRAINTS] constraint tightened (150 → 100 words)
██ high impact — Output will be more constrained
[EXAMPLES] example removed (3 → 1)
██ high impact — Output consistency may decrease
[PERSONA] wording tweaked
░░ low impact — Tone/style will shift
That's the diff I wish I'd had before the Great Refund Incident.
Score your prompt quality
promptdiff score my-agent.prompt
Structure ████████████████░░░░ 16/20
Specificity █████████████████░░░ 17/20
Examples ████████░░░░░░░░░░░░ 8/20
Safety ████████████████████ 20/20
Completeness ████████████████░░░░ 16/20
─────────────────────────────────────
Total: 77/100 Grade: B
Gamify it. Make it a CI gate:
score=$(promptdiff score my-agent.prompt --json | jq '.total')
if [ "$score" -lt 70 ]; then
echo "Prompt quality too low: $score/100"
exit 1
fi
The killer feature: Claude Code lints its own work
This is the part that gets people. You can hook promptdiff into Claude Code so that every time Claude edits a .prompt file, it automatically gets linted.
One command:
promptdiff setup --project
Here's the flow:
- You ask Claude to "write me a customer support agent prompt"
- Claude writes it — maybe it puts conflicting roles in the persona, uses vague language in constraints, only includes one example
- The hook fires automatically (PostToolUse on Edit/Write)
- promptdiff finds 3 errors: role confusion, vague constraints, too few examples
- The hook blocks the edit and feeds the errors back to Claude
- Claude reads the feedback and rewrites the prompt — fixes the roles, tightens the language, adds more examples
- Hook fires again — clean. Passes silently.
- You get a well-structured prompt on the first try, without manually reviewing it
It's like giving Claude a pair-programmer that only knows about prompt quality. Claude writes the prompt, the linter reviews it, Claude fixes it. You just watch.
The setup adds this to your .claude/settings.json:
{
"hooks": {
"PostToolUse": [{
"matcher": "Edit|Write",
"hooks": [{
"type": "command",
"command": "promptdiff hook",
"timeout": 10
}]
}]
}
}
You can configure it to be strict (block on warnings too), warn-only (never block), or default (block on errors only).
How it works (brief architecture)
The key insight is that prompts aren't flat text — they're structured documents with typed sections. promptdiff's parser breaks a .prompt file into:
- Frontmatter (YAML metadata: name, version, model, tags)
- Sections (PERSONA, CONSTRAINTS, EXAMPLES, OUTPUT FORMAT, GUARDRAILS, etc.)
Every command works on this structured representation:
- Diff matches sections by type, not by line number. If you move your CONSTRAINTS section from line 5 to line 20, it doesn't show up as "deleted + added" — it shows up as "same section, maybe modified."
-
Lint rules get the parsed structure, so
conflicting-constraintscan compare the word limit in CONSTRAINTS against the actual word counts in EXAMPLES. - Score evaluates five dimensions independently (structure, specificity, examples, safety, completeness) and aggregates them.
The whole thing is ~30 files, 3 runtime dependencies (commander, chalk, js-yaml), and 217 tests at 94% coverage. No LLM required for any local command — the only thing that calls an API is promptdiff compare for A/B testing, and even that supports local Ollama models.
It also supports prompt composition (extends + includes), so you can DRY your prompts:
---
name: support-agent-v2
extends: ./base-agent.prompt
includes:
- ./shared/safety-rules.prompt
- ./shared/format.prompt
---
Other things I didn't expect to be useful
promptdiff migrate — takes a messy unstructured prompt (the kind you pasted into ChatGPT at 2am) and converts it into a structured .prompt file. It auto-classifies lines: "You are..." goes to PERSONA, "Never..." goes to CONSTRAINTS, etc.
promptdiff fix --apply — auto-fixes lint issues. Adds missing sections, tightens vague language, suggests injection guards.
promptdiff watch . — live linting on file save. Like having eslint --watch for your prompts while you iterate.
MLflow integration — promptdiff log-to-mlflow tracks prompt quality scores over time as MLflow experiments. Because if you're doing serious prompt engineering, you should be tracking regressions.
Try it
npm install -g promptdiff
Then:
# Scaffold a new prompt from a template
promptdiff new my-agent --template support
# Lint it
promptdiff lint my-agent.prompt
# Score it
promptdiff score my-agent.prompt
# Hook into Claude Code
promptdiff setup --project
The repo is at github.com/HadiFrt20/promptdiff. It's MIT licensed, 217 tests, and I'm actively building on it.
If you're writing prompts for production — especially if you're building agents — you probably need this. Or at minimum, you need something like this. The days of yolo-shipping prompts with no review should be over.
Prompts are code. Treat them like it.
If this was useful, a star on the repo goes a long way. And if you have ideas for lint rules, I'd love PRs — adding a rule is about 30 lines of JavaScript.
Top comments (0)