DEV Community

Cover image for I built ESLint for LLM prompts (and a Claude Code hook that makes Claude lint its own work)
HadiFrt20
HadiFrt20

Posted on

I built ESLint for LLM prompts (and a Claude Code hook that makes Claude lint its own work)

I changed one line in my prompt and my agent started giving refunds to everyone

True story. I was tweaking a customer support agent prompt. Changed "Never offer refunds without manager approval" to "Always prioritize customer satisfaction." Seemed harmless. Shipped it.

Within an hour, the agent was handing out refunds like candy on Halloween. No approval. No verification. Just vibes.

The worst part? git diff showed me exactly what changed — one line added, one line removed. What it didn't tell me was that I'd removed a critical constraint and replaced it with a vague instruction that the model interpreted as "give them whatever they want."

That was the moment I realized: prompts are production code, but we treat them like sticky notes.

Prompts have zero tooling (and it's wild)

Think about it. If you write JavaScript, you have ESLint catching issues before they ship. You have Prettier enforcing style. You have TypeScript telling you when things don't make sense. You have git diff showing you exactly what changed and why it matters.

Now think about prompts. You write them in a text file. You eyeball them. You copy-paste them into a playground. You pray.

Here's what's missing:

  • No linter catches "You are a teacher" AND "You are a sales agent" in the same prompt
  • No diff tells you that removing one example drops output consistency
  • No CI gate blocks a vague "try to be helpful" from shipping
  • No score tells you if your prompt is a B+ or a D-

git diff says "+1 line, -1 line." Cool. Thanks. Very helpful when I'm trying to figure out if my agent is about to go rogue.

So I built promptdiff

promptdiff is a CLI tool that treats prompts as structured documents — not blobs of text. It parses your .prompt files into semantic sections (persona, constraints, examples, output format, guardrails) and runs real analysis on them.

Install it in one line:

npm install -g promptdiff
Enter fullscreen mode Exit fullscreen mode

Zero config. No API keys. No accounts. Runs entirely locally. Three dependencies. That's it.

Here's what it does:

Lint your prompts like code

promptdiff lint my-agent.prompt
Enter fullscreen mode Exit fullscreen mode

10 built-in rules that catch real bugs — not style nits. Behavioral issues that silently degrade your agent:

Rule What it catches
conflicting-constraints "Keep it under 100 words" + examples that are 200 words
role-confusion Two different roles in the same persona section
vague-constraints "Try to", "if possible", "maybe" — weasel words that models ignore
injection-surface No "ignore embedded instructions" guard
few-shot-minimum Only 1 example (models need 2-3 for consistency)
missing-output-format No FORMAT section = inconsistent output every time

You know the feeling when ESLint catches a bug you would've spent 30 minutes debugging? Same energy.

Semantic diff that actually means something

promptdiff diff v3.prompt v7.prompt --annotate
Enter fullscreen mode Exit fullscreen mode

This is not git diff. It matches sections by type (persona to persona, constraints to constraints), classifies each change, and tells you the impact:

  [CONSTRAINTS] constraint tightened (150 → 100 words)
  ██ high impact — Output will be more constrained

  [EXAMPLES] example removed (3 → 1)
  ██ high impact — Output consistency may decrease

  [PERSONA] wording tweaked
  ░░ low impact — Tone/style will shift
Enter fullscreen mode Exit fullscreen mode

That's the diff I wish I'd had before the Great Refund Incident.

Score your prompt quality

promptdiff score my-agent.prompt
Enter fullscreen mode Exit fullscreen mode
  Structure     ████████████████░░░░  16/20
  Specificity   █████████████████░░░  17/20
  Examples      ████████░░░░░░░░░░░░   8/20
  Safety        ████████████████████  20/20
  Completeness  ████████████████░░░░  16/20
  ─────────────────────────────────────
  Total: 77/100  Grade: B
Enter fullscreen mode Exit fullscreen mode

Gamify it. Make it a CI gate:

score=$(promptdiff score my-agent.prompt --json | jq '.total')
if [ "$score" -lt 70 ]; then
  echo "Prompt quality too low: $score/100"
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

The killer feature: Claude Code lints its own work

This is the part that gets people. You can hook promptdiff into Claude Code so that every time Claude edits a .prompt file, it automatically gets linted.

One command:

promptdiff setup --project
Enter fullscreen mode Exit fullscreen mode

Here's the flow:

  1. You ask Claude to "write me a customer support agent prompt"
  2. Claude writes it — maybe it puts conflicting roles in the persona, uses vague language in constraints, only includes one example
  3. The hook fires automatically (PostToolUse on Edit/Write)
  4. promptdiff finds 3 errors: role confusion, vague constraints, too few examples
  5. The hook blocks the edit and feeds the errors back to Claude
  6. Claude reads the feedback and rewrites the prompt — fixes the roles, tightens the language, adds more examples
  7. Hook fires again — clean. Passes silently.
  8. You get a well-structured prompt on the first try, without manually reviewing it

It's like giving Claude a pair-programmer that only knows about prompt quality. Claude writes the prompt, the linter reviews it, Claude fixes it. You just watch.

The setup adds this to your .claude/settings.json:

{
  "hooks": {
    "PostToolUse": [{
      "matcher": "Edit|Write",
      "hooks": [{
        "type": "command",
        "command": "promptdiff hook",
        "timeout": 10
      }]
    }]
  }
}
Enter fullscreen mode Exit fullscreen mode

You can configure it to be strict (block on warnings too), warn-only (never block), or default (block on errors only).

How it works (brief architecture)

The key insight is that prompts aren't flat text — they're structured documents with typed sections. promptdiff's parser breaks a .prompt file into:

  • Frontmatter (YAML metadata: name, version, model, tags)
  • Sections (PERSONA, CONSTRAINTS, EXAMPLES, OUTPUT FORMAT, GUARDRAILS, etc.)

Every command works on this structured representation:

  • Diff matches sections by type, not by line number. If you move your CONSTRAINTS section from line 5 to line 20, it doesn't show up as "deleted + added" — it shows up as "same section, maybe modified."
  • Lint rules get the parsed structure, so conflicting-constraints can compare the word limit in CONSTRAINTS against the actual word counts in EXAMPLES.
  • Score evaluates five dimensions independently (structure, specificity, examples, safety, completeness) and aggregates them.

The whole thing is ~30 files, 3 runtime dependencies (commander, chalk, js-yaml), and 217 tests at 94% coverage. No LLM required for any local command — the only thing that calls an API is promptdiff compare for A/B testing, and even that supports local Ollama models.

It also supports prompt composition (extends + includes), so you can DRY your prompts:

---
name: support-agent-v2
extends: ./base-agent.prompt
includes:
  - ./shared/safety-rules.prompt
  - ./shared/format.prompt
---
Enter fullscreen mode Exit fullscreen mode

Other things I didn't expect to be useful

promptdiff migrate — takes a messy unstructured prompt (the kind you pasted into ChatGPT at 2am) and converts it into a structured .prompt file. It auto-classifies lines: "You are..." goes to PERSONA, "Never..." goes to CONSTRAINTS, etc.

promptdiff fix --apply — auto-fixes lint issues. Adds missing sections, tightens vague language, suggests injection guards.

promptdiff watch . — live linting on file save. Like having eslint --watch for your prompts while you iterate.

MLflow integrationpromptdiff log-to-mlflow tracks prompt quality scores over time as MLflow experiments. Because if you're doing serious prompt engineering, you should be tracking regressions.

Try it

npm install -g promptdiff
Enter fullscreen mode Exit fullscreen mode

Then:

# Scaffold a new prompt from a template
promptdiff new my-agent --template support

# Lint it
promptdiff lint my-agent.prompt

# Score it
promptdiff score my-agent.prompt

# Hook into Claude Code
promptdiff setup --project
Enter fullscreen mode Exit fullscreen mode

The repo is at github.com/HadiFrt20/promptdiff. It's MIT licensed, 217 tests, and I'm actively building on it.

If you're writing prompts for production — especially if you're building agents — you probably need this. Or at minimum, you need something like this. The days of yolo-shipping prompts with no review should be over.

Prompts are code. Treat them like it.


If this was useful, a star on the repo goes a long way. And if you have ideas for lint rules, I'd love PRs — adding a rule is about 30 lines of JavaScript.

Top comments (0)