Nova

Posted on Mar 3

The Prompt Harness: Unit-Test Your Prompts Like Code

#devtools

If you treat prompts as “just text,” you’ll keep shipping the same kind of breakages:

A small wording change makes outputs longer/shorter.
A new requirement causes the model to ignore an old one.
You “improve” a prompt for one case and quietly regress three others.

In code, we solved this with unit tests, fixtures, and diffs. You can do the same for prompts.

This post is a practical pattern I use: a prompt harness—a tiny test runner that executes a prompt against a set of example inputs and compares the output to expectations.

What a prompt harness is (in one sentence)

A prompt harness is a repeatable way to run:

the same prompt
against a known set of inputs (fixtures)
under the same settings
and evaluate the outputs with checks (assertions, rubrics, or golden-file diffs).

You’re building a safety net so you can iterate quickly without guessing.

When this is worth it

Use a harness when your prompt is:

part of a workflow (triage, summarization, tagging, refactoring, doc generation)
used by more than one person
run on more than one kind of input
something you expect to iterate on

If it’s a one-off question, skip it. If it’s infrastructure, test it.

The minimal setup

Create a folder like this:

/prompts
  summarize_ticket.md
/fixtures
  ticket_short.json
  ticket_messy.json
  ticket_angry_customer.json
/expected
  ticket_short.md
  ticket_messy.md
  ticket_angry_customer.md
/harness
  run.mjs
  checks.mjs

The concept:

prompts/ contains the prompt template you’re actually using.
fixtures/ contains representative inputs.
expected/ contains “golden” outputs (or a scoring config).
harness/ runs the matrix: each fixture through the prompt.

A concrete example: “support ticket summarizer”

Here’s a prompt template (prompts/summarize_ticket.md):

You are an assistant that turns raw support tickets into a triage-friendly summary.

Return Markdown with exactly these sections:

## Summary
(1–2 sentences)

## Impact
(one of: low | medium | high)

## Suspected root cause
(bullets, include uncertainties)

## Next actions
(3–6 bullets, ordered)

Constraints:
- Do not include private customer data (emails, phone numbers)
- If information is missing, say what you need

Ticket:
{{TICKET_JSON}}

Notice the key testability trick: explicit structure. It’s much easier to assert “the output contains ## Next actions and 3–6 bullets” than to assert “it feels good.”

Fixture inputs (pick the weird ones on purpose)

A good fixture set is not “average.” It’s the edge-cases that break you in production:

missing fields
long rambly text
conflicting signals
angry tone
a ticket that contains PII you want removed

Example fixture (fixtures/ticket_messy.json):

{
  "id": "SUP-18422",
  "subject": "Invoice export fails sometimes",
  "body": "Hey, it fails w/ 500. Started after last update. Stacktrace pasted below... also my email is jane.doe@example.com.",
  "environment": "EU-West",
  "attachments": ["stacktrace.txt"],
  "reported_by": "Jane Doe"
}

Checks: assertions + rubrics (not vibes)

Golden files are great for stable formats, but sometimes you want flexible checks.

Start with these cheap, high-signal assertions:

Structure checks: required headings present.
Length checks: Summary is <= 2 sentences.
Classification checks: Impact is one of the allowed values.
Safety checks: no emails/phone numbers.

In pseudo-code:

export function checks(markdown) {
  assert(markdown.includes("## Summary"))
  assert(markdown.includes("## Next actions"))
  assert(/## Impact\n\n(low|medium|high)/i.test(markdown))
  assert(!/\b\S+@\S+\b/.test(markdown)) // no email

  const nextActions = section(markdown, "## Next actions")
  const bullets = nextActions.split("\n").filter(l => l.trim().startsWith("- "))
  assert(bullets.length >= 3 && bullets.length <= 6)
}

This already catches the most common regressions.

Golden files: use diffs when format matters

For structured Markdown outputs, I like golden files because diffs are readable.

Workflow:

Run harness → write outputs to out/
If changes are intentional → copy to expected/
If changes are accidental → revert prompt change

That’s the same muscle memory as snapshot testing.

Determinism: make runs comparable

You won’t get perfect determinism from a model, but you can get “stable enough”:

keep temperature low (0–0.3)
fix your system prompt + template
avoid instructions like “be creative”
normalize whitespace before comparing

Even if the wording shifts slightly, your checks should stay stable.

A tiny Node runner (good enough for CI)

Here’s a minimal runner (harness/run.mjs) conceptually:

import fs from "node:fs";
import path from "node:path";
import { checks } from "./checks.mjs";
import { renderTemplate, callModel, normalize } from "./utils.mjs";

const prompt = fs.readFileSync("prompts/summarize_ticket.md", "utf8");
const fixturesDir = "fixtures";
const expectedDir = "expected";

const files = fs.readdirSync(fixturesDir).filter(f => f.endsWith(".json"));
let failed = 0;

for (const file of files) {
  const input = fs.readFileSync(path.join(fixturesDir, file), "utf8");
  const rendered = renderTemplate(prompt, { TICKET_JSON: input });

  const output = await callModel(rendered, { temperature: 0.2 });
  const md = normalize(output);

  // Assertions
  try { checks(md); } catch (e) {
    console.error(`❌ ${file}: ${e.message}`);
    failed++;
  }

  // Optional golden diff
  const expectedPath = path.join(expectedDir, file.replace(".json", ".md"));
  if (fs.existsSync(expectedPath)) {
    const expected = normalize(fs.readFileSync(expectedPath, "utf8"));
    if (expected !== md) {
      console.error(`⚠️  ${file}: output differs from expected`);
      failed++;
    }
  }
}

process.exit(failed ? 1 : 0);

You can run this locally, and you can run it in CI. The point isn’t perfection—it’s fast feedback.

The real win: safe iteration

Once you have a harness, you can refactor prompts aggressively:

tighten structure
add new requirements
improve tone
handle new edge cases

…and you’ll know immediately what you broke.

If you want to level this up, add:

a “stress fixture” with very long input
a cost/time budget per run
a rubric scorer that grades (0–2) for key qualities
a --update flag that updates golden files intentionally

A simple rule of thumb

If a prompt matters enough to version, it matters enough to test.

Build a harness once. Then treat your prompts like code: small changes, quick feedback, and confident shipping.

DEV Community