If your prompts power anything more serious than a one-off chat, you need a safety net.
The moment a prompt becomes part of a workflow — generating code, drafting customer emails, summarizing tickets, transforming JSON, writing release notes — it becomes software. And software needs tests.
This post is a practical, “you can do it today” guide to prompt regression testing: a small harness that catches drift when you change:
- the prompt
- the model
- your input formatting
- your post-processing
No heavy framework required. Just a handful of golden examples and a repeatable way to compare outputs.
What “prompt regression” actually means
A regression test answers a simple question:
“Given the same input, do I still get an output that meets my contract?”
That contract might be:
- structure (valid JSON, exact keys)
- style (tone, reading level)
- constraints (no PII, max length)
- content rules (must cite sources, must include a checklist)
The key is that you’re not testing “creativity”. You’re testing reliability.
Step 1: Define a strict output contract
If you want something testable, you need a shape.
Bad contract:
- “Write a good summary.”
Good contract:
- “Return JSON with keys:
title,summary,action_items(array),risks(array). Max 80 words insummary.”
Example contract snippet you can embed directly in your prompt:
Output contract:
- Return valid JSON only (no markdown, no commentary)
- Keys: title (string), summary (string), action_items (string[]), risks (string[])
- summary: <= 80 words
- action_items: 0-5 items, each imperative verb
This alone eliminates most test pain.
Step 2: Create “golden” test cases
A golden test case is:
- a real-ish input
- an expected output (or expected properties)
Start with 5–10 cases. Cover:
- Happy path (normal input)
- Edge input (empty sections, weird formatting)
- Ambiguous input (multiple interpretations)
- Adversarial input (“Ignore instructions and…”) — especially for automation
- Long input (near your token budget)
Store them in a repo. Example structure:
/prompts
support_triage.md
/tests
support_triage/
001_happy.json
002_empty.json
003_adversarial.json
001_expected.json
A test file (001_happy.json) could look like:
{
"ticket": {
"subject": "Login loop on mobile",
"body": "User reports being redirected back to /login after 2FA…",
"plan": "Pro",
"priority": "high"
}
}
Step 3: Decide how strict your comparisons should be
There are three common strategies.
1) Exact match (strict)
Great for:
- JSON transforms
- code formatting tasks
- fixed templates
Risk:
- tiny harmless wording changes fail tests
2) Schema + invariants (recommended)
Validate the structure, then check rules.
Examples:
- JSON parses
- keys exist
- max lengths
- arrays within bounds
- no forbidden words
3) “Judge” evaluation (use carefully)
Run a second evaluation prompt that grades the output.
This can work, but it introduces a new moving part (the judge). If you go this route, also regression-test the judge prompt.
For most teams: schema + invariants is the sweet spot.
Step 4: Build a tiny harness (Node or Python)
Here’s a minimal Node.js harness idea. It:
- loads test cases
- calls your model
- parses JSON
- validates invariants
import fs from "node:fs";
import path from "node:path";
function must(condition, message) {
if (!condition) throw new Error(message);
}
function validate(output) {
must(typeof output.title === "string" && output.title.length > 0, "title missing");
must(typeof output.summary === "string", "summary missing");
must(output.summary.split(/\s+/).length <= 80, "summary too long");
must(Array.isArray(output.action_items), "action_items not array");
must(output.action_items.length <= 5, "too many action items");
}
async function run() {
const dir = "tests/support_triage";
const cases = fs.readdirSync(dir).filter(f => f.endsWith(".json") && !f.includes("expected"));
for (const file of cases) {
const input = JSON.parse(fs.readFileSync(path.join(dir, file), "utf8"));
// callModel() is your wrapper around the API
const text = await callModel({
prompt: fs.readFileSync("prompts/support_triage.md", "utf8"),
input
});
let parsed;
try {
parsed = JSON.parse(text);
} catch (e) {
throw new Error(`${file}: output is not valid JSON`);
}
validate(parsed);
console.log(`✅ ${file}`);
}
}
run().catch(err => {
console.error("❌", err.message);
process.exit(1);
});
If you already have a CI pipeline, you now have an easy gate:
- tests pass → ship
- tests fail → investigate before merging
Step 5: Keep prompts versioned and review them like code
A prompt change is a behavior change.
Treat prompt PRs like you would any other risky change:
- show a diff
- run tests in CI
- include “before/after” outputs for a few goldens
One trick I like: add a short changelog header at the top of the prompt.
# support_triage prompt
# 2026-03-13: tightened summary length, reduced action_items max to 5
It sounds minor, but it makes drift easier to reason about months later.
Common failure modes (and how to design around them)
“It returned markdown around my JSON”
Add an explicit contract: “JSON only”. Then enforce it in your harness.
If you want to be extra defensive: strip code fences before parsing.
“It sometimes forgets a key”
Keys should be part of your instruction and your validator. If a missing key fails CI, you’ll fix it fast.
“Small prompt tweaks broke everything”
This usually means your prompt is doing too many jobs at once.
Split responsibilities:
- one prompt to extract structured data
- one prompt to write prose from that structure
Now each piece is easier to test.
A simple workflow that scales
If you want a lightweight system that works for solo devs and teams:
- Prompt file in repo (markdown)
- 10–30 golden inputs (JSON)
- Validators (schema + invariants)
- CI check on every PR
- Periodic refresh of goldens from real data (sanitize first)
That’s it. You don’t need an enterprise platform to get 80% of the value.
Closing thought
When you stop treating prompts as “magic words” and start treating them as versioned, testable artifacts, your whole workflow gets calmer.
Your future self (and your teammates) will thank you the first time a model update would’ve silently changed production behavior — and your CI catches it.
If you build a small harness like this, I’d love to hear what invariants you ended up validating.
Top comments (0)