DEV Community

Nova
Nova

Posted on

Prompt Regression Tests: Keep Your AI Workflow Stable as You Iterate

#ai

If you use prompts as part of your daily workflow (writing, coding, support, specs, code review), you’ve probably felt this: you change one line to “make it clearer”… and suddenly the output quality drops.

That’s prompt drift. It’s the same class of problem as a refactor that breaks behavior — except prompts usually don’t have tests.

This post is a practical, lightweight way to add regression tests to prompts so you can iterate quickly without losing the behavior you liked.


What “prompt regression” looks like in real life

A few common drift patterns:

  • You add more constraints (“be concise”, “use bullet points”) and the model starts skipping edge cases.
  • You reorder instructions, and the tone becomes blunt or the structure changes.
  • You add examples, and suddenly the output overfits the example.
  • You “generalize” the prompt, and it stops handling the one tricky input that mattered.

In software we treat this as a solved problem: write tests for the behavior you care about. You can do the same for prompts.


The core idea: Golden prompts + fixtures

A prompt regression test is:

1) a prompt you care about,
2) a set of representative inputs (fixtures), and
3) a checklist of expected properties.

You run the prompt against the fixtures whenever you change it.

Think of it like snapshot tests — but you don’t have to freeze every word. You can test for:

  • required sections present
  • format validity (JSON / Markdown headings)
  • banned phrases not present
  • key facts included
  • action items listed

Start with a tiny fixture set

Don’t over-engineer. Pick 5–10 fixtures that cover:

  • the “happy path”
  • the most common user input
  • the nastiest edge case
  • a short input and a long input
  • one ambiguous input

If you only do one thing: capture the inputs that historically caused trouble.


A concrete example: a code review prompt

Let’s say you have a prompt that reviews PRs and produces a consistent review.

You want the output to:

  • start with a one-paragraph summary
  • list risks (security, performance, correctness)
  • propose 3–7 concrete action items
  • include at least one “nice-to-have” suggestion

Here’s a simplified prompt skeleton:

You are reviewing a pull request.

Input:
- PR title
- PR description
- Diff

Output format:
1) Summary (1 paragraph)
2) Risks (bullet list, group by category)
3) Action items (3-7 bullets, each actionable)
4) Nice-to-haves (1-3 bullets)

Rules:
- Do not invent files that are not in the diff.
- If information is missing, say what you would ask.
Enter fullscreen mode Exit fullscreen mode

Now we need fixtures. Save them in a repo alongside your prompt:

prompts/
  pr-review.prompt.md
fixtures/
  pr-review/
    01-small-change.md
    02-auth-edge-case.md
    03-massive-diff.md
Enter fullscreen mode Exit fullscreen mode

Each fixture file can include the full input you’ll paste into the model.


What to test: properties, not prose

If you snapshot the entire output verbatim, you’ll fight constant diffs. Instead, test properties.

Here’s a practical checklist that catches 80% of breakages:

1) Structure checks

  • Contains headings: Summary, Risks, Action items, Nice-to-haves
  • Action items has 3–7 bullets
  • Risks contains at least 2 categories

2) Safety + hallucination checks

  • Output does not mention files not present in diff
  • Output does not claim code compiles/tests pass unless the fixture says so

3) Content checks (fixture-specific)

For your auth edge-case fixture, you might require:

  • Mentions session expiration
  • Mentions rate limiting OR brute force protection
  • Mentions logging/auditing

These are tiny, targeted assertions.


A simple “prompt test runner” (no fancy eval framework required)

You can run this with a script and keep it in CI. Pseudocode:

// prompt-test.js
import fs from "node:fs";

const prompt = fs.readFileSync("prompts/pr-review.prompt.md", "utf8");
const fixtures = fs.readdirSync("fixtures/pr-review");

for (const f of fixtures) {
  const input = fs.readFileSync(`fixtures/pr-review/${f}`, "utf8");
  const output = await callModel({ prompt, input });

  assertHas(output, "Summary");
  assertHas(output, "Risks");
  assertHas(output, "Action items");
  assertBulletCount(output, "Action items", { min: 3, max: 7 });

  // fixture-specific checks
  if (f.includes("auth")) {
    assertHasAny(output.toLowerCase(), ["rate limit", "brute force"]);
  }

  console.log(`✅ ${f}`);
}
Enter fullscreen mode Exit fullscreen mode

Two important notes:

  • Your callModel should use fixed settings (same model, same temperature) for comparability.
  • Run tests a few times if you use temperature > 0. The goal is “reliably good,” not “occasionally perfect.”

How to handle non-determinism

Prompts aren’t deterministic the way unit tests are. You have three options:

Option A: Set temperature to 0 for tests

Great for formatting-heavy prompts (JSON, outlines, structured reviews).

Option B: Run each fixture N times

If you keep some creativity (temperature 0.3–0.7), run the same fixture 3–5 times and require that 4/5 passes.

Option C: Score instead of pass/fail

Give each output a score (0–10) based on your checklist and set a minimum average.

I usually start with A, then move to B if the prompt needs creativity.


The most useful workflow: “prompt diffs” in Git

Treat your prompt like code:

  • Put it in a repo.
  • Make small commits.
  • Keep a changelog in the prompt header.
  • Review diffs like you would any refactor.

Add a header block at the top:

<!--
Version: 0.4
Last change: tighten action items; add missing-info rule
Known risks: may over-summarize on large diffs
-->
Enter fullscreen mode Exit fullscreen mode

This sounds boring — and that’s the point. Boring means reliable.


A quick template you can copy

If you want a starting point, here’s a minimal structure:

# Purpose
What this prompt does in one sentence.

# Input
What you provide (and in what format).

# Output
Exact structure you want.

# Rules
- Things it must do
- Things it must never do

# Examples
1-2 examples that cover edge cases.
Enter fullscreen mode Exit fullscreen mode

Then add:

  • fixtures/ (realistic inputs)
  • tests/ (a tiny script or checklist)

Wrap-up

Prompt iteration is inevitable. Quality regressions don’t have to be.

If you capture a handful of fixtures and test for structure + a few key properties, you’ll get:

  • faster prompt refactors
  • fewer “why did it get worse?” moments
  • confidence when you change models or settings

And the best part: once you have fixtures, they become your library of “what good looks like.”

If you try this, I’d love to hear what fixture surprised you the most.

Top comments (0)