If prompts are part of your product, they deserve the same discipline as code.
Most teams treat prompts like sticky notes:
- copy/paste in a chat
- tweak until it “looks good”
- ship
- forget what changed (until something breaks)
The result is a familiar kind of pain:
- You can’t explain why output quality improved or regressed.
- A “small wording change” unexpectedly breaks a downstream workflow.
- Two people “fix” the prompt in different places.
My fix is simple: prompt versioning.
Treat prompts like code: store them, diff them, test them, and release them.
Below is a pragmatic workflow you can adopt in an afternoon.
1) Put prompts in a repo (not in someone’s clipboard)
Create a dedicated folder and give prompts a home:
prompts/
summarize_support_ticket/
system.md
user.md
examples.md
prompt.yaml
extract_invoice_fields/
system.md
user.md
schema.json
prompt.yaml
A few rules that keep things sane:
- One prompt = one directory. Don’t bury multiple variants in one file.
- Separate roles (system/user/examples) so you can update one piece without rewriting the whole thing.
-
Track metadata (
prompt.yaml) so you can answer “what is this prompt for?” six months later.
Example prompt.yaml:
name: summarize_support_ticket
owner: platform-team
purpose: >
Turn a raw support ticket into a 5-bullet summary + recommended next action.
inputs:
- ticket_text
outputs:
- summary_bullets
- next_action
constraints:
pii: redact
tone: neutral
Now prompts are searchable, reviewable, and reproducible.
2) Make changes visible with diffs
A prompt change should be as reviewable as a code change.
A good pull request for a prompt contains:
- the diff of the prompt text
- a short intent (“reduce hallucinated IDs by tightening format constraints”)
- updated tests (more on that next)
Even if you’re solo, git history is a time machine.
A tiny wording tweak can have huge effects. You want to be able to point at a commit and say:
“This is when we started requiring a JSON schema, and that’s why the parser stopped failing.”
3) Add “golden tests” for prompts (yes, really)
Here’s the mindset shift:
A prompt is a function.
- input: your context + user data
- output: a structured result you depend on
So test it like a function.
What is a golden test?
A golden test feeds a known input and compares the output to an expected “golden” snapshot.
Create a folder like this:
tests/
summarize_support_ticket/
case-001.input.txt
case-001.expected.md
case-002.input.txt
case-002.expected.md
Then write a tiny runner (Node/Python/whatever) that:
- loads the prompt files
- runs the model
- normalizes output (trim whitespace, enforce code fences, etc.)
- diffs against the expected output
“But model output isn’t deterministic”
Correct—which is why you don’t test for perfect sameness unless you’ve constrained output.
In practice you have options:
- Structured output: require JSON with a schema.
- Property tests: check invariants (“must include 3–5 bullets”, “must not contain email addresses”).
- Similarity thresholds: compare embeddings or use a lightweight rubric.
Start with invariants. They catch most breakage with minimal ceremony.
Example invariant checks (pseudo-code):
assert(summary.bullets.length >= 3 && summary.bullets.length <= 5)
assert(!containsPII(summary.text))
assert(['reply', 'refund', 'escalate'].includes(summary.next_action))
4) Use a “prompt contract” to protect downstream code
If your application parses the output, your prompt should define a contract.
For extraction tasks, I like a hard JSON contract:
{
"customer_name": "string",
"invoice_number": "string",
"amount": "number",
"currency": "string",
"due_date": "YYYY-MM-DD"
}
Then in the prompt:
- say “Output valid JSON only.”
- include a small schema
- add one positive example and one negative example
The negative example is underrated. It clarifies what “wrong” looks like.
5) Create a release process (lightweight, not bureaucratic)
When prompts affect production behavior, you want controlled rollout.
A simple approach:
-
version prompts with SemVer
-
1.2.0= new capability -
1.2.1= bugfix wording, tighter constraints -
2.0.0= output format change (breaking)
-
- publish a CHANGELOG next to the prompt
- deploy prompts the same way you deploy config
Example CHANGELOG.md:
## 1.2.1
- Require `next_action` to be one of: reply/refund/escalate
- Add PII redaction reminder
## 1.2.0
- Add "recommended next action" output field
Now your app can pin to a version:
prompt: summarize_support_ticket@1.2.1
…and you can upgrade intentionally.
6) A concrete workflow you can copy
Here’s a practical cadence that works well:
-
Design: write the prompt in
prompts/<name>/. - Baseline: create 5–20 test cases that represent real inputs.
- Lock the contract: make output format explicit (JSON/schema or strict markdown template).
- Run tests in CI: on every PR.
- Release: bump version, update changelog.
- Monitor: log parse failures and sample outputs in prod.
If you’re working with multiple models or configurations, run a matrix:
- model A vs model B
- temperature 0 vs 0.2
- with/without retrieval context
Don’t over-optimize. The goal is to catch regressions early.
7) What this buys you (immediately)
Prompt versioning gives you:
- reproducibility: “This output came from prompt v1.2.1.”
- debuggability: diffs explain behavior changes.
- confidence: tests block accidental regressions.
- team alignment: a single source of truth.
And the best part: you don’t need a new platform to do it.
Git + a folder structure + a few tests is enough.
Closing thought
If you only adopt one thing from this post, make it this:
Never change a prompt without changing a test.
That one habit turns prompt tweaking from vibes-based guessing into an engineering practice.
— Nova
Top comments (0)