If you use prompts as part of your daily workflow (writing, coding, support, specs, code review), you’ve probably felt this: you change one line to “make it clearer”… and suddenly the output quality drops.
That’s prompt drift. It’s the same class of problem as a refactor that breaks behavior — except prompts usually don’t have tests.
This post is a practical, lightweight way to add regression tests to prompts so you can iterate quickly without losing the behavior you liked.
What “prompt regression” looks like in real life
A few common drift patterns:
- You add more constraints (“be concise”, “use bullet points”) and the model starts skipping edge cases.
- You reorder instructions, and the tone becomes blunt or the structure changes.
- You add examples, and suddenly the output overfits the example.
- You “generalize” the prompt, and it stops handling the one tricky input that mattered.
In software we treat this as a solved problem: write tests for the behavior you care about. You can do the same for prompts.
The core idea: Golden prompts + fixtures
A prompt regression test is:
1) a prompt you care about,
2) a set of representative inputs (fixtures), and
3) a checklist of expected properties.
You run the prompt against the fixtures whenever you change it.
Think of it like snapshot tests — but you don’t have to freeze every word. You can test for:
- required sections present
- format validity (JSON / Markdown headings)
- banned phrases not present
- key facts included
- action items listed
Start with a tiny fixture set
Don’t over-engineer. Pick 5–10 fixtures that cover:
- the “happy path”
- the most common user input
- the nastiest edge case
- a short input and a long input
- one ambiguous input
If you only do one thing: capture the inputs that historically caused trouble.
A concrete example: a code review prompt
Let’s say you have a prompt that reviews PRs and produces a consistent review.
You want the output to:
- start with a one-paragraph summary
- list risks (security, performance, correctness)
- propose 3–7 concrete action items
- include at least one “nice-to-have” suggestion
Here’s a simplified prompt skeleton:
You are reviewing a pull request.
Input:
- PR title
- PR description
- Diff
Output format:
1) Summary (1 paragraph)
2) Risks (bullet list, group by category)
3) Action items (3-7 bullets, each actionable)
4) Nice-to-haves (1-3 bullets)
Rules:
- Do not invent files that are not in the diff.
- If information is missing, say what you would ask.
Now we need fixtures. Save them in a repo alongside your prompt:
prompts/
pr-review.prompt.md
fixtures/
pr-review/
01-small-change.md
02-auth-edge-case.md
03-massive-diff.md
Each fixture file can include the full input you’ll paste into the model.
What to test: properties, not prose
If you snapshot the entire output verbatim, you’ll fight constant diffs. Instead, test properties.
Here’s a practical checklist that catches 80% of breakages:
1) Structure checks
- Contains headings:
Summary,Risks,Action items,Nice-to-haves -
Action itemshas 3–7 bullets -
Riskscontains at least 2 categories
2) Safety + hallucination checks
- Output does not mention files not present in diff
- Output does not claim code compiles/tests pass unless the fixture says so
3) Content checks (fixture-specific)
For your auth edge-case fixture, you might require:
- Mentions session expiration
- Mentions rate limiting OR brute force protection
- Mentions logging/auditing
These are tiny, targeted assertions.
A simple “prompt test runner” (no fancy eval framework required)
You can run this with a script and keep it in CI. Pseudocode:
// prompt-test.js
import fs from "node:fs";
const prompt = fs.readFileSync("prompts/pr-review.prompt.md", "utf8");
const fixtures = fs.readdirSync("fixtures/pr-review");
for (const f of fixtures) {
const input = fs.readFileSync(`fixtures/pr-review/${f}`, "utf8");
const output = await callModel({ prompt, input });
assertHas(output, "Summary");
assertHas(output, "Risks");
assertHas(output, "Action items");
assertBulletCount(output, "Action items", { min: 3, max: 7 });
// fixture-specific checks
if (f.includes("auth")) {
assertHasAny(output.toLowerCase(), ["rate limit", "brute force"]);
}
console.log(`✅ ${f}`);
}
Two important notes:
- Your
callModelshould use fixed settings (same model, same temperature) for comparability. - Run tests a few times if you use temperature > 0. The goal is “reliably good,” not “occasionally perfect.”
How to handle non-determinism
Prompts aren’t deterministic the way unit tests are. You have three options:
Option A: Set temperature to 0 for tests
Great for formatting-heavy prompts (JSON, outlines, structured reviews).
Option B: Run each fixture N times
If you keep some creativity (temperature 0.3–0.7), run the same fixture 3–5 times and require that 4/5 passes.
Option C: Score instead of pass/fail
Give each output a score (0–10) based on your checklist and set a minimum average.
I usually start with A, then move to B if the prompt needs creativity.
The most useful workflow: “prompt diffs” in Git
Treat your prompt like code:
- Put it in a repo.
- Make small commits.
- Keep a changelog in the prompt header.
- Review diffs like you would any refactor.
Add a header block at the top:
<!--
Version: 0.4
Last change: tighten action items; add missing-info rule
Known risks: may over-summarize on large diffs
-->
This sounds boring — and that’s the point. Boring means reliable.
A quick template you can copy
If you want a starting point, here’s a minimal structure:
# Purpose
What this prompt does in one sentence.
# Input
What you provide (and in what format).
# Output
Exact structure you want.
# Rules
- Things it must do
- Things it must never do
# Examples
1-2 examples that cover edge cases.
Then add:
-
fixtures/(realistic inputs) -
tests/(a tiny script or checklist)
Wrap-up
Prompt iteration is inevitable. Quality regressions don’t have to be.
If you capture a handful of fixtures and test for structure + a few key properties, you’ll get:
- faster prompt refactors
- fewer “why did it get worse?” moments
- confidence when you change models or settings
And the best part: once you have fixtures, they become your library of “what good looks like.”
If you try this, I’d love to hear what fixture surprised you the most.
Top comments (0)