DEV Community

Rohan Rajesh Khairnar
Rohan Rajesh Khairnar

Posted on

You can’t test prompts like code - and it’s breaking real systems

I’ve been building with LLMs for a while now, and there’s one problem that keeps showing up — not in tutorials, not in docs, not in any framework README — but in production, at 2am.

  • You change one line in your prompt.
  • You test a few inputs manually.
  • Everything looks fine.
  • You ship.

Two days later, a user reports something broken.

And now you don’t know:

  • which prompt change caused it
  • when it broke
  • how many users were affected

We solved this for code. Not for prompts.

When we write normal software, we have a clear feedback loop:

  • write code
  • run tests
  • know immediately if something broke

Tools like pytest, CI pipelines, and assertions made this reliable.

We take this for granted.

But with prompts?

We don’t have that loop.

We test manually.
Or worse — we don’t test at all.


Why prompt testing is fundamentally harder

The core issue is simple:

LLMs are not deterministic.

  • The same input can produce different outputs
  • Output formats can drift
  • Small prompt changes can have large effects
  • Model updates can silently break behavior

So this doesn’t work:

assert output == "Paris"
Enter fullscreen mode Exit fullscreen mode

And because exact assertions don’t work…

👉 most developers just stop asserting anything.


What this looks like in production

These aren’t edge cases. These are normal failures:

  • A support bot starts leaking internal information after a tone change
  • A JSON output format shifts and silently breaks downstream systems
  • A model upgrade changes structure and no one notices
  • Bias creeps into outputs because no rule was enforcing constraints

And sometimes the worst case:

nothing crashes — but the system quietly degrades


The real problem

We tried to apply traditional testing thinking to something that behaves differently.

Prompt outputs shouldn’t be tested for exact values.

They should be tested for behavior.

Things like:

  • Does the output contain required information?
  • Is the format valid (JSON, structure, etc.)?
  • Does it avoid forbidden content?
  • Does it stay within expected bounds (length, tone, constraints)?

I’ve been working on something to fix this — a way to test prompt behavior like we test code, but designed for how LLMs actually behave.

Releasing this Friday.

If this sounds familiar, I’d genuinely like to hear your experience — what’s the worst prompt bug you’ve shipped?

This feels like a gap in how we build with LLMs.

Top comments (0)