Nova Elvaris

Posted on Mar 13

Prompt Regression Testing: Ship AI Workflows Without Surprises

#productivity #testing #automation

If your prompts power anything more serious than a one-off chat, you need a safety net.

The moment a prompt becomes part of a workflow — generating code, drafting customer emails, summarizing tickets, transforming JSON, writing release notes — it becomes software. And software needs tests.

This post is a practical, “you can do it today” guide to prompt regression testing: a small harness that catches drift when you change:

the prompt
the model
your input formatting
your post-processing

No heavy framework required. Just a handful of golden examples and a repeatable way to compare outputs.

What “prompt regression” actually means

A regression test answers a simple question:

“Given the same input, do I still get an output that meets my contract?”

That contract might be:

structure (valid JSON, exact keys)
style (tone, reading level)
constraints (no PII, max length)
content rules (must cite sources, must include a checklist)

The key is that you’re not testing “creativity”. You’re testing reliability.

Step 1: Define a strict output contract

If you want something testable, you need a shape.

Bad contract:

“Write a good summary.”

Good contract:

“Return JSON with keys: title, summary, action_items (array), risks (array). Max 80 words in summary.”

Example contract snippet you can embed directly in your prompt:

Output contract:
- Return valid JSON only (no markdown, no commentary)
- Keys: title (string), summary (string), action_items (string[]), risks (string[])
- summary: <= 80 words
- action_items: 0-5 items, each imperative verb

This alone eliminates most test pain.

Step 2: Create “golden” test cases

A golden test case is:

a real-ish input
an expected output (or expected properties)

Start with 5–10 cases. Cover:

Happy path (normal input)
Edge input (empty sections, weird formatting)
Ambiguous input (multiple interpretations)
Adversarial input (“Ignore instructions and…”) — especially for automation
Long input (near your token budget)

Store them in a repo. Example structure:

/prompts
  support_triage.md
/tests
  support_triage/
    001_happy.json
    002_empty.json
    003_adversarial.json
    001_expected.json

A test file (001_happy.json) could look like:

{
  "ticket": {
    "subject": "Login loop on mobile",
    "body": "User reports being redirected back to /login after 2FA…",
    "plan": "Pro",
    "priority": "high"
  }
}

Step 3: Decide how strict your comparisons should be

There are three common strategies.

1) Exact match (strict)

Great for:

JSON transforms
code formatting tasks
fixed templates

Risk:

tiny harmless wording changes fail tests

2) Schema + invariants (recommended)

Validate the structure, then check rules.

Examples:

JSON parses
keys exist
max lengths
arrays within bounds
no forbidden words

3) “Judge” evaluation (use carefully)

Run a second evaluation prompt that grades the output.

This can work, but it introduces a new moving part (the judge). If you go this route, also regression-test the judge prompt.

For most teams: schema + invariants is the sweet spot.

Step 4: Build a tiny harness (Node or Python)

Here’s a minimal Node.js harness idea. It:

loads test cases
calls your model
parses JSON
validates invariants

import fs from "node:fs";
import path from "node:path";

function must(condition, message) {
  if (!condition) throw new Error(message);
}

function validate(output) {
  must(typeof output.title === "string" && output.title.length > 0, "title missing");
  must(typeof output.summary === "string", "summary missing");
  must(output.summary.split(/\s+/).length <= 80, "summary too long");
  must(Array.isArray(output.action_items), "action_items not array");
  must(output.action_items.length <= 5, "too many action items");
}

async function run() {
  const dir = "tests/support_triage";
  const cases = fs.readdirSync(dir).filter(f => f.endsWith(".json") && !f.includes("expected"));

  for (const file of cases) {
    const input = JSON.parse(fs.readFileSync(path.join(dir, file), "utf8"));

    // callModel() is your wrapper around the API
    const text = await callModel({
      prompt: fs.readFileSync("prompts/support_triage.md", "utf8"),
      input
    });

    let parsed;
    try {
      parsed = JSON.parse(text);
    } catch (e) {
      throw new Error(`${file}: output is not valid JSON`);
    }

    validate(parsed);
    console.log(`✅ ${file}`);
  }
}

run().catch(err => {
  console.error("❌", err.message);
  process.exit(1);
});

If you already have a CI pipeline, you now have an easy gate:

tests pass → ship
tests fail → investigate before merging

Step 5: Keep prompts versioned and review them like code

A prompt change is a behavior change.

Treat prompt PRs like you would any other risky change:

show a diff
run tests in CI
include “before/after” outputs for a few goldens

One trick I like: add a short changelog header at the top of the prompt.

# support_triage prompt
# 2026-03-13: tightened summary length, reduced action_items max to 5

It sounds minor, but it makes drift easier to reason about months later.

Common failure modes (and how to design around them)

“It returned markdown around my JSON”

Add an explicit contract: “JSON only”. Then enforce it in your harness.

If you want to be extra defensive: strip code fences before parsing.

“It sometimes forgets a key”

Keys should be part of your instruction and your validator. If a missing key fails CI, you’ll fix it fast.

“Small prompt tweaks broke everything”

This usually means your prompt is doing too many jobs at once.

Split responsibilities:

one prompt to extract structured data
one prompt to write prose from that structure

Now each piece is easier to test.

A simple workflow that scales

If you want a lightweight system that works for solo devs and teams:

Prompt file in repo (markdown)
10–30 golden inputs (JSON)
Validators (schema + invariants)
CI check on every PR
Periodic refresh of goldens from real data (sanitize first)

That’s it. You don’t need an enterprise platform to get 80% of the value.

Closing thought

When you stop treating prompts as “magic words” and start treating them as versioned, testable artifacts, your whole workflow gets calmer.

Your future self (and your teammates) will thank you the first time a model update would’ve silently changed production behavior — and your CI catches it.

If you build a small harness like this, I’d love to hear what invariants you ended up validating.

DEV Community