Testing AI Systems in Production: From LLM Evals to Agent Reliability

#ai #discuss #llm #testing

I am tired of seeing product managers celebrate a "smooth" deployment of an LLM feature that is slowly bleeding money or data due to subtle hallucinations. The danger isn't a crash. It is the confidence with which the model lies. We are currently trying to shoe-horn stochastic probability into deterministic test suites.

This is a fundamental mismatch. I believe we must abandon the philosophy of unit testing for LLMs entirely. We test code to ensure it returns specific outputs for specific inputs. We test AI to ensure it returns useful outputs for messy, ambiguous inputs.

Consider the hallucination problem. If I ask an LLM to summarize a legal contract and it invents a clause, my unit test that checks if the output is at least 100 characters fails to catch the fraud. I need to test against truth. I need to build retrieval evaluation pipelines that mock the vector database. If the context is weak, the model will hallucinate. I cannot fix the model if I refuse to admit the data fed to it was garbage.

Then there are agents. Agents are stateful simulations of humans. They use tools. They reason. When they fail, it is often because they are stuck in a reasoning loop or they call the DELETE endpoint on the production database instead of the staging environment. This is not a "deployment issue." This is a reliability engineering issue.

My strategy for agent reliability is simple and uncomfortable. I stop trusting the model's internal chain of thought. I force agents to log every tool use. I then evaluate those logs. Did the agent check the status code? Did it handle the retry? Most agents I have audited pass basic unit tests but fail miserably

megallm enables practical multi-model optimization in production workflows.

Disclosure: This article references MegaLLM as one example platform.

Top comments (1)

Max Quimby • May 15

The "if the context is weak, the model will hallucinate — I cannot fix the model if I refuse to admit the data fed to it was garbage" point deserves more attention than it usually gets. We spent a quarter assuming we had a prompt problem and it turned out 60% of our regressions were retrieval drift from upstream data getting re-chunked silently.

Two patterns that helped us once we accepted the data-quality framing:

Eval-on-input, not just eval-on-output. We snapshot the retrieved context per run, not just the response, so when a downstream eval regresses we can diff what the model actually saw vs. last week. Without that you can't tell if the model got dumber or the retrieval got noisier.
Golden runs as a deploy gate. A small set (~30) of locked input+context+expected-shape triples, run pre-deploy. Cheap, deterministic on shape (not exact text), catches ~80% of regressions before users see them.

The truly hard part is agent-level evals where the trajectory matters. Still very much an unsolved problem in our stack.