DEV Community

Andrew Cappelli
Andrew Cappelli

Posted on

I kept breaking my AI features by accident so I built unit tests for prompts

We test our code. Why don’t we test our AI?

When I started shipping LLM-powered features, I ran into a problem nobody warned me about: prompt drift.

I'd make a small tweak to a prompt, or swap to a newer model, and the outputs would silently change.

  • Wrong format
  • Different tone
  • Missing information

And I’d only find out a week later from a user report.

The fix in traditional software is obvious: unit tests.

But nobody had done the simple version for prompts.

So I built it.


Introducing prompt-ci

pip install prompt-ci
Enter fullscreen mode Exit fullscreen mode

It works in three commands:

prompt-ci init      # create a config file
prompt-ci record    # run your prompts, save outputs as golden files
prompt-ci check     # compare current outputs to golden, fail if they drift
Enter fullscreen mode Exit fullscreen mode

What it actually does

You define your prompts and test inputs in a YAML config:

provider: anthropic
model: claude-haiku-4-5-20251001
threshold: 0.80

tests:
  - name: summarize_bullets
    prompt: "Summarize in exactly 3 bullet points: {{input}}"
    input: "Your article text here..."

  - name: sentiment_check
    prompt: "Reply with one word -> positive, negative, or neutral:"
    input: "I love this product!"
    threshold: 0.95
Enter fullscreen mode Exit fullscreen mode

Run:

prompt-ci record
Enter fullscreen mode Exit fullscreen mode

This saves outputs to .golden/ as JSON.

Commit that directory -> it becomes your locked expected behavior.


Catch regressions in CI

On every PR:

prompt-ci check
Enter fullscreen mode Exit fullscreen mode

This re-runs the prompts and scores how similar the new outputs are to the golden files.

If the score drops below your threshold:

  • X Exit code 1
  • X CI fails

Semantic similarity, not string matching

This is the important part.

Exact string matching would be useless -> LLM outputs vary naturally.

But pure string diff misses the point entirely.

Instead, prompt-ci uses LLM-as-a-judge:

It sends both outputs to your model and asks it to rate semantic equivalence on a 0.0–1.0 scale.

This catches real regressions:

FAIL summarize_bullets  score=0.61  threshold=0.80

Expected:
- Revenue grew 23% YoY
- Margins expanded to 18%
- Guidance raised for Q4

Actual:
Revenue increased significantly year over year,
with notable margin improvements...
Enter fullscreen mode Exit fullscreen mode

Same facts, completely different format.

That’s a regression.


GitHub Actions in one step

- name: Prompt regression tests
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: prompt-ci check
Enter fullscreen mode Exit fullscreen mode

Try it without an API key

Use:

provider: mock
Enter fullscreen mode Exit fullscreen mode

This enables a local dry run -> no API key needed.

It uses token overlap scoring instead.


Get it

pip install prompt-ci
Enter fullscreen mode Exit fullscreen mode

GitHub:

https://github.com/Andrew-most-likely/prompt-ci


Final thought

If you're shipping anything with LLMs, you need this.

Curious what prompt testing workflows others are using -> drop them in the comments

Top comments (0)