Andrew Cappelli

Posted on Apr 2

I kept breaking my AI features by accident so I built unit tests for prompts

#opensource #webdev #ai #programming

We test our code. Why don’t we test our AI?

When I started shipping LLM-powered features, I ran into a problem nobody warned me about: prompt drift.

I'd make a small tweak to a prompt, or swap to a newer model, and the outputs would silently change.

Wrong format
Different tone
Missing information

And I’d only find out a week later from a user report.

The fix in traditional software is obvious: unit tests.

But nobody had done the simple version for prompts.

So I built it.

Introducing `prompt-ci`

pip install prompt-ci

It works in three commands:

prompt-ci init      # create a config file
prompt-ci record    # run your prompts, save outputs as golden files
prompt-ci check     # compare current outputs to golden, fail if they drift

What it actually does

You define your prompts and test inputs in a YAML config:

provider: anthropic
model: claude-haiku-4-5-20251001
threshold: 0.80

tests:
  - name: summarize_bullets
    prompt: "Summarize in exactly 3 bullet points: {{input}}"
    input: "Your article text here..."

  - name: sentiment_check
    prompt: "Reply with one word -> positive, negative, or neutral:"
    input: "I love this product!"
    threshold: 0.95

Run:

prompt-ci record

This saves outputs to .golden/ as JSON.

Commit that directory -> it becomes your locked expected behavior.

Catch regressions in CI

On every PR:

prompt-ci check

This re-runs the prompts and scores how similar the new outputs are to the golden files.

If the score drops below your threshold:

X Exit code 1
X CI fails

Semantic similarity, not string matching

This is the important part.

Exact string matching would be useless -> LLM outputs vary naturally.

But pure string diff misses the point entirely.

Instead, prompt-ci uses LLM-as-a-judge:

It sends both outputs to your model and asks it to rate semantic equivalence on a 0.0–1.0 scale.

This catches real regressions:

FAIL summarize_bullets  score=0.61  threshold=0.80

Expected:
- Revenue grew 23% YoY
- Margins expanded to 18%
- Guidance raised for Q4

Actual:
Revenue increased significantly year over year,
with notable margin improvements...

Same facts, completely different format.

That’s a regression.

GitHub Actions in one step

- name: Prompt regression tests
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: prompt-ci check

Try it without an API key

Use:

provider: mock

This enables a local dry run -> no API key needed.

It uses token overlap scoring instead.

Get it

pip install prompt-ci

GitHub:

https://github.com/Andrew-most-likely/prompt-ci

Final thought

If you're shipping anything with LLMs, you need this.

Curious what prompt testing workflows others are using -> drop them in the comments

DEV Community

I kept breaking my AI features by accident so I built unit tests for prompts

We test our code. Why don’t we test our AI?

Introducing `prompt-ci`

What it actually does

Catch regressions in CI

Semantic similarity, not string matching

GitHub Actions in one step

Try it without an API key

Get it

Final thought

Top comments (0)

We test our code. Why don’t we test our AI?

Introducing prompt-ci

What it actually does

Catch regressions in CI

Semantic similarity, not string matching

GitHub Actions in one step

Try it without an API key

Get it

Final thought

Introducing `prompt-ci`