I’ve been building with LLMs for a while now, and there’s one problem that keeps showing up — not in tutorials, not in docs, not in any framework README — but in production, at 2am.
- You change one line in your prompt.
- You test a few inputs manually.
- Everything looks fine.
- You ship.
Two days later, a user reports something broken.
And now you don’t know:
- which prompt change caused it
- when it broke
- how many users were affected
We solved this for code. Not for prompts.
When we write normal software, we have a clear feedback loop:
- write code
- run tests
- know immediately if something broke
Tools like pytest, CI pipelines, and assertions made this reliable.
We take this for granted.
But with prompts?
We don’t have that loop.
We test manually.
Or worse — we don’t test at all.
Why prompt testing is fundamentally harder
The core issue is simple:
LLMs are not deterministic.
- The same input can produce different outputs
- Output formats can drift
- Small prompt changes can have large effects
- Model updates can silently break behavior
So this doesn’t work:
assert output == "Paris"
And because exact assertions don’t work…
👉 most developers just stop asserting anything.
What this looks like in production
These aren’t edge cases. These are normal failures:
- A support bot starts leaking internal information after a tone change
- A JSON output format shifts and silently breaks downstream systems
- A model upgrade changes structure and no one notices
- Bias creeps into outputs because no rule was enforcing constraints
And sometimes the worst case:
nothing crashes — but the system quietly degrades
The real problem
We tried to apply traditional testing thinking to something that behaves differently.
Prompt outputs shouldn’t be tested for exact values.
They should be tested for behavior.
Things like:
- Does the output contain required information?
- Is the format valid (JSON, structure, etc.)?
- Does it avoid forbidden content?
- Does it stay within expected bounds (length, tone, constraints)?
I’ve been working on something to fix this — a way to test prompt behavior like we test code, but designed for how LLMs actually behave.
Releasing this Friday.
If this sounds familiar, I’d genuinely like to hear your experience — what’s the worst prompt bug you’ve shipped?
This feels like a gap in how we build with LLMs.
Top comments (0)