If you treat prompts as “just text,” you’ll keep shipping the same kind of breakages:
- A small wording change makes outputs longer/shorter.
- A new requirement causes the model to ignore an old one.
- You “improve” a prompt for one case and quietly regress three others.
In code, we solved this with unit tests, fixtures, and diffs. You can do the same for prompts.
This post is a practical pattern I use: a prompt harness—a tiny test runner that executes a prompt against a set of example inputs and compares the output to expectations.
What a prompt harness is (in one sentence)
A prompt harness is a repeatable way to run:
- the same prompt
- against a known set of inputs (fixtures)
- under the same settings
- and evaluate the outputs with checks (assertions, rubrics, or golden-file diffs).
You’re building a safety net so you can iterate quickly without guessing.
When this is worth it
Use a harness when your prompt is:
- part of a workflow (triage, summarization, tagging, refactoring, doc generation)
- used by more than one person
- run on more than one kind of input
- something you expect to iterate on
If it’s a one-off question, skip it. If it’s infrastructure, test it.
The minimal setup
Create a folder like this:
/prompts
summarize_ticket.md
/fixtures
ticket_short.json
ticket_messy.json
ticket_angry_customer.json
/expected
ticket_short.md
ticket_messy.md
ticket_angry_customer.md
/harness
run.mjs
checks.mjs
The concept:
-
prompts/contains the prompt template you’re actually using. -
fixtures/contains representative inputs. -
expected/contains “golden” outputs (or a scoring config). -
harness/runs the matrix: each fixture through the prompt.
A concrete example: “support ticket summarizer”
Here’s a prompt template (prompts/summarize_ticket.md):
You are an assistant that turns raw support tickets into a triage-friendly summary.
Return Markdown with exactly these sections:
## Summary
(1–2 sentences)
## Impact
(one of: low | medium | high)
## Suspected root cause
(bullets, include uncertainties)
## Next actions
(3–6 bullets, ordered)
Constraints:
- Do not include private customer data (emails, phone numbers)
- If information is missing, say what you need
Ticket:
{{TICKET_JSON}}
Notice the key testability trick: explicit structure. It’s much easier to assert “the output contains ## Next actions and 3–6 bullets” than to assert “it feels good.”
Fixture inputs (pick the weird ones on purpose)
A good fixture set is not “average.” It’s the edge-cases that break you in production:
- missing fields
- long rambly text
- conflicting signals
- angry tone
- a ticket that contains PII you want removed
Example fixture (fixtures/ticket_messy.json):
{
"id": "SUP-18422",
"subject": "Invoice export fails sometimes",
"body": "Hey, it fails w/ 500. Started after last update. Stacktrace pasted below... also my email is jane.doe@example.com.",
"environment": "EU-West",
"attachments": ["stacktrace.txt"],
"reported_by": "Jane Doe"
}
Checks: assertions + rubrics (not vibes)
Golden files are great for stable formats, but sometimes you want flexible checks.
Start with these cheap, high-signal assertions:
- Structure checks: required headings present.
- Length checks: Summary is <= 2 sentences.
- Classification checks: Impact is one of the allowed values.
- Safety checks: no emails/phone numbers.
In pseudo-code:
export function checks(markdown) {
assert(markdown.includes("## Summary"))
assert(markdown.includes("## Next actions"))
assert(/## Impact\n\n(low|medium|high)/i.test(markdown))
assert(!/\b\S+@\S+\b/.test(markdown)) // no email
const nextActions = section(markdown, "## Next actions")
const bullets = nextActions.split("\n").filter(l => l.trim().startsWith("- "))
assert(bullets.length >= 3 && bullets.length <= 6)
}
This already catches the most common regressions.
Golden files: use diffs when format matters
For structured Markdown outputs, I like golden files because diffs are readable.
Workflow:
- Run harness → write outputs to
out/ - If changes are intentional → copy to
expected/ - If changes are accidental → revert prompt change
That’s the same muscle memory as snapshot testing.
Determinism: make runs comparable
You won’t get perfect determinism from a model, but you can get “stable enough”:
- keep temperature low (0–0.3)
- fix your system prompt + template
- avoid instructions like “be creative”
- normalize whitespace before comparing
Even if the wording shifts slightly, your checks should stay stable.
A tiny Node runner (good enough for CI)
Here’s a minimal runner (harness/run.mjs) conceptually:
import fs from "node:fs";
import path from "node:path";
import { checks } from "./checks.mjs";
import { renderTemplate, callModel, normalize } from "./utils.mjs";
const prompt = fs.readFileSync("prompts/summarize_ticket.md", "utf8");
const fixturesDir = "fixtures";
const expectedDir = "expected";
const files = fs.readdirSync(fixturesDir).filter(f => f.endsWith(".json"));
let failed = 0;
for (const file of files) {
const input = fs.readFileSync(path.join(fixturesDir, file), "utf8");
const rendered = renderTemplate(prompt, { TICKET_JSON: input });
const output = await callModel(rendered, { temperature: 0.2 });
const md = normalize(output);
// Assertions
try { checks(md); } catch (e) {
console.error(`❌ ${file}: ${e.message}`);
failed++;
}
// Optional golden diff
const expectedPath = path.join(expectedDir, file.replace(".json", ".md"));
if (fs.existsSync(expectedPath)) {
const expected = normalize(fs.readFileSync(expectedPath, "utf8"));
if (expected !== md) {
console.error(`⚠️ ${file}: output differs from expected`);
failed++;
}
}
}
process.exit(failed ? 1 : 0);
You can run this locally, and you can run it in CI. The point isn’t perfection—it’s fast feedback.
The real win: safe iteration
Once you have a harness, you can refactor prompts aggressively:
- tighten structure
- add new requirements
- improve tone
- handle new edge cases
…and you’ll know immediately what you broke.
If you want to level this up, add:
- a “stress fixture” with very long input
- a cost/time budget per run
- a rubric scorer that grades (0–2) for key qualities
- a
--updateflag that updates golden files intentionally
A simple rule of thumb
If a prompt matters enough to version, it matters enough to test.
Build a harness once. Then treat your prompts like code: small changes, quick feedback, and confident shipping.
Top comments (0)