Nova Elvaris

Posted on Mar 10

The Prompt Regression Test: Stop Breaking Your AI Workflow by Accident

#testing #mobile

If you’ve ever had an AI-assisted workflow that used to work and then suddenly started producing garbage, you’ve met the most annoying kind of bug: the prompt regression.

You didn’t change the code. (Or you did… but it was “just a wording tweak”.)
Now your agent:

stops following your repo conventions
hallucinates file paths
forgets to include tests
writes paragraphs instead of patches

In normal software, we solve this with regression tests.
So let’s do the obvious thing: treat your prompt like code and test it.

This post is a practical pattern I use: a Prompt Regression Test that catches “small” prompt edits before they break your workflow.

What is a prompt regression?

A regression is any change that unintentionally breaks previously correct behavior.

With prompts, regressions usually come from:

Accidental ambiguity (you removed a constraint that mattered)
Conflicting instructions (“be concise” + “explain your reasoning”)
Hidden dependencies (the model relied on your example format)
Tooling drift (your agent’s environment or tools changed)

The fix isn’t “never edit prompts”. The fix is make edits safe.

The idea: a golden prompt + golden output

A minimal regression test has three parts:

A prompt (the thing you’re editing)
A fixture input (some representative context)
An expected output (a “golden” snapshot)

Then you run your model in CI and compare the output.

If the output changes, you either:

reject the prompt change (it broke behavior)
or update the golden output intentionally (you meant to change behavior)

That’s it. You’ve created a “unit test” for a prompt.

A concrete example: code changes should come as a patch

Let’s say you want an assistant that proposes changes in a diff-first format.

1) The prompt (versioned)

Create a file like prompts/refactor_v1.md:

You are a careful senior engineer.

Task: refactor the provided code to improve readability without changing behavior.

Rules:
- Output MUST be a unified diff.
- Only change files that are mentioned.
- If you need clarification, ask questions *before* producing a diff.
- Include a short test plan as comments at the end of the diff.

Output format:
1) ```

diff
...

2) ```

text
Test plan: ...

2) The fixture input

Fixtures are the “context bundle” you’ll keep stable.
Store one in fixtures/refactor_small.json:

{
  "file": "src/date.ts",
  "content": "export function formatDate(d: Date){return d.toISOString().slice(0,10)}\n"
}

3) The golden output

Run your model once, review it manually, and save it as goldens/refactor_small.diff.

The golden output doesn’t have to be perfect. It has to be acceptable and stable.

A tiny test runner (Node.js)

Here’s a small runner that executes the prompt against the fixture and compares output.

// prompt-test.js
import fs from "node:fs";
import crypto from "node:crypto";

function normalize(s) {
  return s
    .replace(/\r\n/g, "\n")
    .replace(/[ \t]+\n/g, "\n")
    .trim() + "\n";
}

const prompt = fs.readFileSync("prompts/refactor_v1.md", "utf8");
const fixture = JSON.parse(fs.readFileSync("fixtures/refactor_small.json", "utf8"));
const expected = fs.readFileSync("goldens/refactor_small.diff", "utf8");

// Replace this with your actual model call.
async function callModel({ prompt, fixture }) {
  // return await openai.responses.create(...)
  throw new Error("Implement callModel() for your provider");
}

const actual = await callModel({ prompt, fixture });

const a = normalize(actual);
const e = normalize(expected);

if (a !== e) {
  const hash = (x) => crypto.createHash("sha256").update(x).digest("hex").slice(0, 12);
  console.error("Prompt regression detected!");
  console.error("expected:", hash(e));
  console.error("actual:  ", hash(a));
  process.exit(1);
}

console.log("Prompt test passed");

That’s intentionally boring. Boring is good. It means you can run it everywhere.

But models are stochastic — won’t this fail constantly?

Yes, if you let outputs vary.

To make prompt regression tests usable, you need to stabilize the run:

Set temperature to 0 (or as low as your provider allows)
Fix the model/version in CI (don’t silently upgrade)
Normalize whitespace (as above)
Keep fixtures small and representative

If you still get drift, switch from “exact match” to a contract check.

Contract checks (more robust)

Instead of matching the whole output, assert key properties:

output contains exactly one diff block - diff touches only allowed files - includes a Test plan: section - no prose outside fenced blocks

These checks catch regressions that matter without being brittle.

The checklist I actually use

When I add prompt tests to a repo, I start with this:

One golden test per workflow (refactor, bugfix, PR summary, etc.)
One “edge case” fixture (missing file, ambiguous task, multiple files)
A contract test that validates structure (diff-only, JSON-only, etc.)
Run it in CI on every PR

If a prompt edit breaks the tests, that’s a gift: you caught it before it hit production.

A quick naming convention

Prompts are code. Treat them like it.

prompts/<workflow>_v1.md
fixtures/<workflow>_<case>.json
goldens/<workflow>_<case>.<ext>

Then you can evolve intentionally:

keep ..._v1 stable
create ..._v2 when you want a behavioral change
migrate fixtures gradually

Where this pays off

Prompt regressions are subtle because they feel “non-deterministic”.
But the workflow you’re building is deterministic in the ways that matter:

output shape
file boundaries
conventions
safety constraints

Testing those constraints is the difference between a fun prototype and something you can rely on.

If you’re already writing tests for your code, you’re 80% of the way there.
Add tests for your prompts — and stop breaking your AI workflow by accident.

DEV Community