How do you test your LLM agents before shipping changes?

#ai #discuss #llm #testing

Genuinely curious how other engineers are handling this.

Every time I change a prompt, swap a model, or tweak a tool, I've struggled to get a reliable answer to a simple question: did the agent get better or worse overall?

The challenge I keep hitting is that aggregate metrics (average success rate, total tokens) usually look fine, but specific task types silently break. The easy tasks improve, masking the regressions on the hard ones. By the time someone notices, it's already in production.

Here’s what I tried before landing on something that actually worked:

LLM-as-judge scoring: Too inconsistent between runs. Hard to tell if a score change was real or just statistical noise.
Manual spot-checking: Useful early on, but didn't scale past ~10 task types.
Comparing trace-level metrics statistically: Looking at distributions of tokens, duration, and cost per specific task ended up being the most reliable signal, so much so that I ended up building my own tooling around it.

What does your testing setup look like? Do you have CI gates that block deploys on agent regressions, or is it mostly manual review?

DEV Community

How do you test your LLM agents before shipping changes?

Top comments (0)