If you’ve ever had an AI-assisted workflow that used to work and then suddenly started producing garbage, you’ve met the most annoying kind of bug: the prompt regression.
You didn’t change the code. (Or you did… but it was “just a wording tweak”.)
Now your agent:
- stops following your repo conventions
- hallucinates file paths
- forgets to include tests
- writes paragraphs instead of patches
In normal software, we solve this with regression tests.
So let’s do the obvious thing: treat your prompt like code and test it.
This post is a practical pattern I use: a Prompt Regression Test that catches “small” prompt edits before they break your workflow.
What is a prompt regression?
A regression is any change that unintentionally breaks previously correct behavior.
With prompts, regressions usually come from:
- Accidental ambiguity (you removed a constraint that mattered)
- Conflicting instructions (“be concise” + “explain your reasoning”)
- Hidden dependencies (the model relied on your example format)
- Tooling drift (your agent’s environment or tools changed)
The fix isn’t “never edit prompts”. The fix is make edits safe.
The idea: a golden prompt + golden output
A minimal regression test has three parts:
- A prompt (the thing you’re editing)
- A fixture input (some representative context)
- An expected output (a “golden” snapshot)
Then you run your model in CI and compare the output.
If the output changes, you either:
- reject the prompt change (it broke behavior)
- or update the golden output intentionally (you meant to change behavior)
That’s it. You’ve created a “unit test” for a prompt.
A concrete example: code changes should come as a patch
Let’s say you want an assistant that proposes changes in a diff-first format.
1) The prompt (versioned)
Create a file like prompts/refactor_v1.md:
You are a careful senior engineer.
Task: refactor the provided code to improve readability without changing behavior.
Rules:
- Output MUST be a unified diff.
- Only change files that are mentioned.
- If you need clarification, ask questions *before* producing a diff.
- Include a short test plan as comments at the end of the diff.
Output format:
1) ```
diff
...
2) ```
text
Test plan: ...
2) The fixture input
Fixtures are the “context bundle” you’ll keep stable.
Store one in fixtures/refactor_small.json:
{
"file": "src/date.ts",
"content": "export function formatDate(d: Date){return d.toISOString().slice(0,10)}\n"
}
3) The golden output
Run your model once, review it manually, and save it as goldens/refactor_small.diff.
The golden output doesn’t have to be perfect. It has to be acceptable and stable.
A tiny test runner (Node.js)
Here’s a small runner that executes the prompt against the fixture and compares output.
// prompt-test.js
import fs from "node:fs";
import crypto from "node:crypto";
function normalize(s) {
return s
.replace(/\r\n/g, "\n")
.replace(/[ \t]+\n/g, "\n")
.trim() + "\n";
}
const prompt = fs.readFileSync("prompts/refactor_v1.md", "utf8");
const fixture = JSON.parse(fs.readFileSync("fixtures/refactor_small.json", "utf8"));
const expected = fs.readFileSync("goldens/refactor_small.diff", "utf8");
// Replace this with your actual model call.
async function callModel({ prompt, fixture }) {
// return await openai.responses.create(...)
throw new Error("Implement callModel() for your provider");
}
const actual = await callModel({ prompt, fixture });
const a = normalize(actual);
const e = normalize(expected);
if (a !== e) {
const hash = (x) => crypto.createHash("sha256").update(x).digest("hex").slice(0, 12);
console.error("Prompt regression detected!");
console.error("expected:", hash(e));
console.error("actual: ", hash(a));
process.exit(1);
}
console.log("Prompt test passed");
That’s intentionally boring. Boring is good. It means you can run it everywhere.
But models are stochastic — won’t this fail constantly?
Yes, if you let outputs vary.
To make prompt regression tests usable, you need to stabilize the run:
- Set
temperatureto0(or as low as your provider allows) - Fix the model/version in CI (don’t silently upgrade)
- Normalize whitespace (as above)
- Keep fixtures small and representative
If you still get drift, switch from “exact match” to a contract check.
Contract checks (more robust)
Instead of matching the whole output, assert key properties:
- output contains exactly one
diffblock - diff touches only allowed files - includes aTest plan:section - no prose outside fenced blocks
These checks catch regressions that matter without being brittle.
The checklist I actually use
When I add prompt tests to a repo, I start with this:
- One golden test per workflow (refactor, bugfix, PR summary, etc.)
- One “edge case” fixture (missing file, ambiguous task, multiple files)
- A contract test that validates structure (diff-only, JSON-only, etc.)
- Run it in CI on every PR
If a prompt edit breaks the tests, that’s a gift: you caught it before it hit production.
A quick naming convention
Prompts are code. Treat them like it.
prompts/<workflow>_v1.mdfixtures/<workflow>_<case>.jsongoldens/<workflow>_<case>.<ext>
Then you can evolve intentionally:
- keep
..._v1stable - create
..._v2when you want a behavioral change - migrate fixtures gradually
Where this pays off
Prompt regressions are subtle because they feel “non-deterministic”.
But the workflow you’re building is deterministic in the ways that matter:
- output shape
- file boundaries
- conventions
- safety constraints
Testing those constraints is the difference between a fun prototype and something you can rely on.
If you’re already writing tests for your code, you’re 80% of the way there.
Add tests for your prompts — and stop breaking your AI workflow by accident.
Top comments (0)