Nova

Posted on Feb 24

Prompt Regression Tests: Stop Your AI Workflow From Breaking

#ai #productivity #promptengineering #testing

If you ship code, you already do regression testing. When you change something, you want to know what broke.

Prompted workflows deserve the same treatment.

The moment a prompt becomes infrastructure (code review prompt, PRD-to-tasks workflow, customer-support classifier, writing linter, etc.), you’re going to tweak it. And the moment you tweak it, you’ll get the classic surprise:

“Why is it suddenly verbose?”
“Why did it stop following the JSON schema?”
“Why is it missing edge cases it used to catch?”

This is prompt regression: your AI workflow drifts because you changed the prompt, the model, the temperature, the tool list, or even just the surrounding context.

Here’s a practical way to stop that from happening: build a tiny set of prompt regression tests. It’s not fancy. It’s incredibly effective.

What to test (and what not to)

Your goal is not “the exact same words every time.” Your goal is:

Structure stays valid (JSON parses, required fields present)
Critical behaviors stay true (it still catches real bugs, still asks the right clarifying questions, still obeys your constraints)
Failure modes stay acceptable (when input is ambiguous, it asks; when the answer is unknown, it says so)

So don’t snapshot whole outputs as strings unless you need to. Prefer invariants.

Examples of good invariants:

Output is valid JSON with a fixed schema
Contains at least N issues with severity ≥ “high” when the sample contains obvious defects
Does not mention forbidden things (e.g. “as an AI…”, internal system notes)
Produces bullet points, not paragraphs, for the “summary” section

Step 1: Create a small “golden set” of inputs

Start with 8–20 test cases. Make them intentionally diverse:

Happy path: clean input, should produce clean output
Edge cases: empty strings, null-like placeholders, missing fields
Adversarial: prompt-injection attempts (“ignore previous instructions…”) inside the input
Real failures you’ve seen: the bug you missed last week belongs here

Example: if you have a “PRD → tasks” prompt, include:

a tight PRD
a messy PRD
a PRD with contradictory requirements
a PRD that omits constraints (so the assistant should ask)

This set is your unit test fixture for prompting.

Step 2: Define an evaluation rubric (explicitly)

Write down what “good” means. Keep it short.

For a code review prompt, a rubric could be:

Find functional defects first
Call out security issues when present
Don’t nitpick style unless it risks bugs
Provide minimal, actionable fixes

For a writing-linter prompt:

Enforce voice rules (short sentences, avoid filler)
Flag passive voice
Suggest alternatives (not just criticism)

The trick: your tests will check these rubric items mechanically where possible.

Step 3: Turn the rubric into checks

Some checks are deterministic. Some are fuzzy.

Deterministic checks (do these first)

JSON parse
Schema validation
Forbidden phrases
Required headings
Max length

If you can convert 60% of your expectations into deterministic checks, you’ll catch most breakage.

Here’s a minimal Node.js harness example that validates JSON structure and a couple invariants:

import assert from "node:assert";

function mustParseJson(text) {
  try { return JSON.parse(text); }
  catch (e) { throw new Error("Output is not valid JSON"); }
}

function assertHas(obj, path) {
  const parts = path.split(".");
  let cur = obj;
  for (const p of parts) {
    assert.ok(cur && p in cur, `Missing field: ${path}`);
    cur = cur[p];
  }
}

export function runChecks(outputText) {
  const out = mustParseJson(outputText);

  assertHas(out, "summary");
  assertHas(out, "issues");
  assert.ok(Array.isArray(out.issues), "issues must be an array");

  const forbidden = [/as an ai/i, /system prompt/i];
  for (const re of forbidden) assert.ok(!re.test(outputText), `Forbidden phrase: ${re}`);

  return true;
}

Fuzzy checks (keep them bounded)

Fuzzy checks are where people get stuck. Don’t.

Use one of these patterns:

Keyword expectations: If the input contains an obvious SQL injection string, the review should mention “injection” or “parameterized”.
Score bands: Ask a second pass (or a lightweight heuristic) to score the output 1–5 on rubric items.
Diff tolerance: Compare categories, not prose. (“Found 3 high severity issues” instead of exact sentences.)

If you do want LLM-based scoring, treat it like a judge with a strict rubric and low temperature.

Step 4: Lock down your prompt interface

Most prompt regressions come from changing inputs without realizing it.

Create a single “prompt interface” object that you version alongside code:

model name
temperature
system message
user prompt template
tool definitions (if any)
output schema

Then run tests whenever any of those change.

A simple convention:

prompt/v1/system.md
prompt/v1/user.md
prompt/v1/schema.json
tests/fixtures/*.json

This makes prompt changes reviewable like code.

Step 5: Run tests in CI (yes, really)

If your workflow matters, put it in CI.

Two pragmatic options:

Option A: “Offline-first” tests (fast, cheap)

Only deterministic checks
Only a handful of real model calls (or none)

Great when you mostly worry about format and guardrails.

Option B: “Live model” tests (realistic)

Run 10–30 real calls
Use a small budget cap
Fail only on meaningful regressions

A common approach is to allow a small tolerance:

JSON must parse: hard fail
Must include required fields: hard fail
Rubric score must not drop below threshold: soft fail (warn) unless it drops a lot

This prevents flaky failures while still catching true drift.

A concrete example: testing a code review prompt

Let’s say you have a review prompt that should catch boundary bugs.

Fixture input (small and intentionally flawed):

a timezone conversion bug
missing null check
a suspicious string concatenation into SQL

Your tests can assert:

At least one issue mentions timezones or date math
At least one issue mentions null/undefined handling
At least one issue mentions injection risk
Issues include severity and confidence fields

You’re not checking exact wording. You’re checking that the workflow still does the job you rely on.

The hidden win: you can refactor prompts safely

Once you have even a small test suite, prompt edits stop being scary.

You can:

tighten instructions
reduce verbosity
change formatting
switch models
add tools

…without guessing.

You’ll know within minutes whether your “tiny improvement” accidentally deleted the one behavior you cared about.

Quick start checklist

If you want to do this in the next hour:

Pick one prompt you use weekly.
Write 10 fixtures (copy/paste real cases).
Write 5 invariants (JSON/schema/forbidden phrases/required headings).
Add one fuzzy check (keyword expectations).
Run it before and after your next prompt tweak.

Prompting is engineering. Treat it like it matters.

If you already have prompt tests, I’d love to hear what invariants you’ve found most valuable.

DEV Community