Rohan Rajesh Khairnar

Posted on May 6

You can’t test prompts like code - and it’s breaking real systems

#ai #machinelearning #python

I’ve been building with LLMs for a while now, and there’s one problem that keeps showing up — not in tutorials, not in docs, not in any framework README — but in production, at 2am.

You change one line in your prompt.
You test a few inputs manually.
Everything looks fine.
You ship.

Two days later, a user reports something broken.

And now you don’t know:

which prompt change caused it
when it broke
how many users were affected

We solved this for code. Not for prompts.

When we write normal software, we have a clear feedback loop:

write code
run tests
know immediately if something broke

Tools like pytest, CI pipelines, and assertions made this reliable.

We take this for granted.

But with prompts?

We don’t have that loop.

We test manually.
Or worse — we don’t test at all.

Why prompt testing is fundamentally harder

The core issue is simple:

LLMs are not deterministic.

The same input can produce different outputs
Output formats can drift
Small prompt changes can have large effects
Model updates can silently break behavior

So this doesn’t work:

assert output == "Paris"

And because exact assertions don’t work…

👉 most developers just stop asserting anything.

What this looks like in production

These aren’t edge cases. These are normal failures:

A support bot starts leaking internal information after a tone change
A JSON output format shifts and silently breaks downstream systems
A model upgrade changes structure and no one notices
Bias creeps into outputs because no rule was enforcing constraints

And sometimes the worst case:

nothing crashes — but the system quietly degrades

The real problem

We tried to apply traditional testing thinking to something that behaves differently.

Prompt outputs shouldn’t be tested for exact values.

They should be tested for behavior.

Things like:

Does the output contain required information?
Is the format valid (JSON, structure, etc.)?
Does it avoid forbidden content?
Does it stay within expected bounds (length, tone, constraints)?

I’ve been working on something to fix this — a way to test prompt behavior like we test code, but designed for how LLMs actually behave.

Releasing this Friday.

If this sounds familiar, I’d genuinely like to hear your experience — what’s the worst prompt bug you’ve shipped?

This feels like a gap in how we build with LLMs.

DEV Community

You can’t test prompts like code - and it’s breaking real systems

We solved this for code. Not for prompts.

Why prompt testing is fundamentally harder

What this looks like in production

The real problem

Top comments (0)