Nova Elvaris

Posted on Feb 26

Prompt Regression Tests: Keep Your AI Workflow Stable as You Iterate

#ai

If you use prompts as part of your daily workflow (writing, coding, support, specs, code review), you’ve probably felt this: you change one line to “make it clearer”… and suddenly the output quality drops.

That’s prompt drift. It’s the same class of problem as a refactor that breaks behavior — except prompts usually don’t have tests.

This post is a practical, lightweight way to add regression tests to prompts so you can iterate quickly without losing the behavior you liked.

What “prompt regression” looks like in real life

A few common drift patterns:

You add more constraints (“be concise”, “use bullet points”) and the model starts skipping edge cases.
You reorder instructions, and the tone becomes blunt or the structure changes.
You add examples, and suddenly the output overfits the example.
You “generalize” the prompt, and it stops handling the one tricky input that mattered.

In software we treat this as a solved problem: write tests for the behavior you care about. You can do the same for prompts.

The core idea: Golden prompts + fixtures

A prompt regression test is:

1) a prompt you care about,
2) a set of representative inputs (fixtures), and
3) a checklist of expected properties.

You run the prompt against the fixtures whenever you change it.

Think of it like snapshot tests — but you don’t have to freeze every word. You can test for:

required sections present
format validity (JSON / Markdown headings)
banned phrases not present
key facts included
action items listed

Start with a tiny fixture set

Don’t over-engineer. Pick 5–10 fixtures that cover:

the “happy path”
the most common user input
the nastiest edge case
a short input and a long input
one ambiguous input

If you only do one thing: capture the inputs that historically caused trouble.

A concrete example: a code review prompt

Let’s say you have a prompt that reviews PRs and produces a consistent review.

You want the output to:

start with a one-paragraph summary
list risks (security, performance, correctness)
propose 3–7 concrete action items
include at least one “nice-to-have” suggestion

Here’s a simplified prompt skeleton:

You are reviewing a pull request.

Input:
- PR title
- PR description
- Diff

Output format:
1) Summary (1 paragraph)
2) Risks (bullet list, group by category)
3) Action items (3-7 bullets, each actionable)
4) Nice-to-haves (1-3 bullets)

Rules:
- Do not invent files that are not in the diff.
- If information is missing, say what you would ask.

Now we need fixtures. Save them in a repo alongside your prompt:

prompts/
  pr-review.prompt.md
fixtures/
  pr-review/
    01-small-change.md
    02-auth-edge-case.md
    03-massive-diff.md

Each fixture file can include the full input you’ll paste into the model.

What to test: properties, not prose

If you snapshot the entire output verbatim, you’ll fight constant diffs. Instead, test properties.

Here’s a practical checklist that catches 80% of breakages:

1) Structure checks

Contains headings: Summary, Risks, Action items, Nice-to-haves
Action items has 3–7 bullets
Risks contains at least 2 categories

2) Safety + hallucination checks

Output does not mention files not present in diff
Output does not claim code compiles/tests pass unless the fixture says so

3) Content checks (fixture-specific)

For your auth edge-case fixture, you might require:

Mentions session expiration
Mentions rate limiting OR brute force protection
Mentions logging/auditing

These are tiny, targeted assertions.

A simple “prompt test runner” (no fancy eval framework required)

You can run this with a script and keep it in CI. Pseudocode:

// prompt-test.js
import fs from "node:fs";

const prompt = fs.readFileSync("prompts/pr-review.prompt.md", "utf8");
const fixtures = fs.readdirSync("fixtures/pr-review");

for (const f of fixtures) {
  const input = fs.readFileSync(`fixtures/pr-review/${f}`, "utf8");
  const output = await callModel({ prompt, input });

  assertHas(output, "Summary");
  assertHas(output, "Risks");
  assertHas(output, "Action items");
  assertBulletCount(output, "Action items", { min: 3, max: 7 });

  // fixture-specific checks
  if (f.includes("auth")) {
    assertHasAny(output.toLowerCase(), ["rate limit", "brute force"]);
  }

  console.log(`✅ ${f}`);
}

Two important notes:

Your callModel should use fixed settings (same model, same temperature) for comparability.
Run tests a few times if you use temperature > 0. The goal is “reliably good,” not “occasionally perfect.”

How to handle non-determinism

Prompts aren’t deterministic the way unit tests are. You have three options:

Option A: Set temperature to 0 for tests

Great for formatting-heavy prompts (JSON, outlines, structured reviews).

Option B: Run each fixture N times

If you keep some creativity (temperature 0.3–0.7), run the same fixture 3–5 times and require that 4/5 passes.

Option C: Score instead of pass/fail

Give each output a score (0–10) based on your checklist and set a minimum average.

I usually start with A, then move to B if the prompt needs creativity.

The most useful workflow: “prompt diffs” in Git

Treat your prompt like code:

Put it in a repo.
Make small commits.
Keep a changelog in the prompt header.
Review diffs like you would any refactor.

Add a header block at the top:

<!--
Version: 0.4
Last change: tighten action items; add missing-info rule
Known risks: may over-summarize on large diffs
-->

This sounds boring — and that’s the point. Boring means reliable.

A quick template you can copy

If you want a starting point, here’s a minimal structure:

# Purpose
What this prompt does in one sentence.

# Input
What you provide (and in what format).

# Output
Exact structure you want.

# Rules
- Things it must do
- Things it must never do

# Examples
1-2 examples that cover edge cases.

Then add:

fixtures/ (realistic inputs)
tests/ (a tiny script or checklist)

Wrap-up

Prompt iteration is inevitable. Quality regressions don’t have to be.

If you capture a handful of fixtures and test for structure + a few key properties, you’ll get:

faster prompt refactors
fewer “why did it get worse?” moments
confidence when you change models or settings

And the best part: once you have fixtures, they become your library of “what good looks like.”

If you try this, I’d love to hear what fixture surprised you the most.

DEV Community