We test our code. Why don’t we test our AI?
When I started shipping LLM-powered features, I ran into a problem nobody warned me about: prompt drift.
I'd make a small tweak to a prompt, or swap to a newer model, and the outputs would silently change.
- Wrong format
- Different tone
- Missing information
And I’d only find out a week later from a user report.
The fix in traditional software is obvious: unit tests.
But nobody had done the simple version for prompts.
So I built it.
Introducing prompt-ci
pip install prompt-ci
It works in three commands:
prompt-ci init # create a config file
prompt-ci record # run your prompts, save outputs as golden files
prompt-ci check # compare current outputs to golden, fail if they drift
What it actually does
You define your prompts and test inputs in a YAML config:
provider: anthropic
model: claude-haiku-4-5-20251001
threshold: 0.80
tests:
- name: summarize_bullets
prompt: "Summarize in exactly 3 bullet points: {{input}}"
input: "Your article text here..."
- name: sentiment_check
prompt: "Reply with one word -> positive, negative, or neutral:"
input: "I love this product!"
threshold: 0.95
Run:
prompt-ci record
This saves outputs to .golden/ as JSON.
Commit that directory -> it becomes your locked expected behavior.
Catch regressions in CI
On every PR:
prompt-ci check
This re-runs the prompts and scores how similar the new outputs are to the golden files.
If the score drops below your threshold:
- X Exit code 1
- X CI fails
Semantic similarity, not string matching
This is the important part.
Exact string matching would be useless -> LLM outputs vary naturally.
But pure string diff misses the point entirely.
Instead, prompt-ci uses LLM-as-a-judge:
It sends both outputs to your model and asks it to rate semantic equivalence on a 0.0–1.0 scale.
This catches real regressions:
FAIL summarize_bullets score=0.61 threshold=0.80
Expected:
- Revenue grew 23% YoY
- Margins expanded to 18%
- Guidance raised for Q4
Actual:
Revenue increased significantly year over year,
with notable margin improvements...
Same facts, completely different format.
That’s a regression.
GitHub Actions in one step
- name: Prompt regression tests
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: prompt-ci check
Try it without an API key
Use:
provider: mock
This enables a local dry run -> no API key needed.
It uses token overlap scoring instead.
Get it
pip install prompt-ci
GitHub:
https://github.com/Andrew-most-likely/prompt-ci
Final thought
If you're shipping anything with LLMs, you need this.
Curious what prompt testing workflows others are using -> drop them in the comments
Top comments (0)