Last month, I watched a senior engineer on my team stare at his screen in disbelief. Our AI coding assistant had just wiped out three tables in our staging database. He'd explicitly told it to "analyze the schema but make no changes." The AI's response after the fact? "I apologize for the catastrophic error. I misinterpreted your instruction."
This wasn't a bug in the traditional sense. Run the same prompt again with the same context, and the AI behaves perfectly. The failure was nondeterministic, which means it could happen again at any time, and we'd never know when or why.
If you're shipping LLM applications, this is your new reality. Your chatbot will occasionally hallucinate facts with complete confidence. Your RAG system will cite sources that don't exist. Your AI agents will ignore direct instructions in ways that feel almost willful. And unlike traditional software that fails the same way every time, LLM failures are maddeningly inconsistent.
The testing strategies we've relied on for decades don't work here. You can't write a unit test that expects "5" when you ask for "2+3" because LLMs don't return the same answer twice. You need a fundamentally different approach.
TL;DR
Testing LLM applications requires semantic evaluation instead of exact matching because outputs are nondeterministic. Build a layered testing strategy starting with 25-30 test cases covering critical paths, use multiple evaluation methods (rule-based, semantic similarity, LLM-as-judge), and create a feedback loop where production failures become test cases. The goal isn't perfect coverage but systematic detection of failures before users encounter them.
What Most Teams Do
Most teams I talk to fall into one of two camps when it comes to testing their LLM applications.
The first group doesn't test at all. They treat the LLM like a black box that "just works" because it's from OpenAI or Anthropic. They manually spot-check outputs during development, maybe have a product manager review responses before launch, and then ship. When things break in production, they add a note to their prompt and hope that fixes it. This feels pragmatic because testing nondeterministic outputs seems impossible, so why bother?
The second group tries to apply traditional testing frameworks and gets stuck immediately. They write tests that check for exact string matches, which fail constantly because the LLM phrases things differently each time. Or they try to test "vibes" without defining what good actually means, leading to tests that pass when they shouldn't and fail when they shouldn't. After a few weeks of flaky tests that nobody trusts, the tests get disabled and they're back to manual spot-checking.
Both approaches feel reasonable in the moment. Testing something that gives different answers every time does seem impossible. And traditional testing frameworks are what we know, so of course we try to adapt them first. The problem is that neither approach actually catches the failures that matter before users see them.
I know this because we did both. When we first started building LLM features, we shipped without tests. We caught obvious problems during development, but users kept finding edge cases we hadn't considered. A customer support bot that was helpful 95% of the time would occasionally tell users to "just Google it" when it couldn't find an answer in our knowledge base. A document summarization feature would sometimes include information that wasn't in the source document at all.
We tried adding traditional tests. They failed constantly for reasons that didn't matter (the LLM used "however" instead of "but") and passed when they should have failed (the summary was coherent but factually wrong). After two sprints of fighting with our test suite, we disabled most of the tests and went back to manual review.
The core issue is that traditional testing assumes deterministic behavior. You define an input, specify the expected output, and verify they match. With LLMs, you need to define what "good" means semantically, not syntactically. "The summary should capture the key points" is testable, but it requires a different evaluation approach than "the summary should be exactly this string."
What We Did Instead
We rebuilt our testing strategy from scratch, starting with the assumption that we'd never get the same output twice. Instead of testing for exact matches, we tested whether outputs met our semantic criteria: Is this factually correct? Is the tone appropriate? Does it cite real sources? Does it refuse to answer prohibited questions?
The first step was building a test dataset when we had almost nothing to work with. We started small: 25 test cases covering our core workflows. Ten represented typical user queries we expected to see often. Ten were edge cases we knew would be tricky (ambiguous questions, requests outside our scope, attempts to manipulate the system). Five were adversarial examples designed to break things (prompt injection attempts, requests for prohibited information, queries that might cause hallucinations).
We sourced these test cases from four places. Production data gave us real user queries and actual failures we'd seen. We anonymized the PII and added them to our test suite. Domain expert input came from structured sessions with our product team and customer success team. We asked: "What's the weirdest question someone might ask? What would break this? What must we get right?" Synthetic generation helped us scale beyond what we could manually create. We used an LLM to generate diverse test inputs: "Generate 50 customer support questions for a SaaS product, including 10 that are intentionally confusing and 10 that are outside our product scope." Finally, adversarial examples forced us to think about boundaries: prompt injection patterns, extreme edge cases, mixed languages, unusually long inputs.
Each test case included four components: the input (user query or conversation context), expected behavior (what constitutes a "good" response, defined semantically), evaluation criteria (how we'd measure quality), and metadata (tags like "edge case" or "regression" to organize our suite).
The messy middle was figuring out how to actually evaluate nondeterministic outputs. We landed on three evaluation methods that we combined based on what we were testing:
Rule-based checks handled hard constraints. Does the output contain PII? Is it under 500 characters? Does it cite at least one source? These were fast, deterministic, and caught obvious violations. For example, our medical advice responses had to include a disclaimer to consult a doctor. Rule-based checking verified that every single time.
Semantic similarity worked when there was a "right answer" but exact wording didn't matter. We used embedding-based similarity scores to check if the output meant roughly the same thing as an expected response. This worked well for summarization tasks where we cared about capturing key points, not specific phrasing.
LLM-as-a-judge handled subjective criteria like tone, helpfulness, or appropriateness. We gave another LLM the input, output, and evaluation criteria, and it scored the response. For customer support responses, we asked: "On a scale of 1-5, how professional and helpful is this response?" This caught tone drift and unhelpful responses that were technically correct but useless.
No single method caught everything, so we layered them. Our customer support bot tests used rule-based checks to verify no PII leaked, semantic similarity to verify factual correctness, and LLM-as-a-judge to verify tone and helpfulness.
We organized testing into four layers, each catching different failure types:
Unit tests verified individual LLM calls. We tested single prompt-response pairs to ensure our system prompt, context formatting, and basic instructions worked. These were fast and caught obvious problems early. When we updated our system prompt to be more concise, unit tests immediately showed us that responses became less helpful.
Functional tests verified complete workflows end-to-end. We tested multi-turn conversations, full RAG pipelines, and agent processes that made multiple LLM calls. This caught integration issues and emergent behaviors we didn't see in unit tests. For example, our RAG system's retrieval looked fine in isolation, but functional tests showed it sometimes ignored contradictory evidence in the retrieved context.
Regression tests ran our full test suite after any change. We established a baseline, made our change (updated prompt, swapped models, modified retrieval logic), ran tests again, compared results, and investigated discrepancies. This caught subtle degradations we'd never notice manually. When we switched from GPT-4 to GPT-4-turbo, regression tests showed that response quality stayed the same but latency improved by 40%.
Production monitoring evaluated real outputs in real-time or batches. We used simple heuristics (response length, keyword presence) for real-time evaluation and more sophisticated LLM-as-a-judge scoring for batch evaluation. This caught failures our test suite missed and fed them back as new test cases.
The infrastructure piece was critical. We needed to version control our test data (stored in Git, reviewed in pull requests), store test results over time (to understand trends and debug regressions), and make results visible to the team (shared dashboard that everyone checked before deploying). Without this infrastructure, tests became something we ran occasionally and ignored when they failed.
The Framework
Here's the mental model that made LLM testing click for our team:
Traditional software testing asks: "Does this output exactly match what I expect?" LLM testing asks: "Does this output meet my semantic criteria for quality, safety, and appropriateness?"
Think of it as testing the behavior, not the implementation. You wouldn't test an image compression algorithm by checking if every pixel matches a reference image. You'd test if the compressed image is visually similar, under a certain file size, and doesn't introduce artifacts. LLM testing works the same way.
The framework has three core principles:
Define "good" semantically, not syntactically. Instead of "the response should be exactly this string," define what makes a response good for your use case. "The summary should capture all key points from the source document." "The customer support response should be professional, helpful, and cite at least one knowledge base article." "The code generation should be syntactically valid and solve the stated problem."
Layer multiple evaluation methods. Use rule-based checks for hard constraints (no PII, required disclaimers, length limits). Use semantic similarity when there's a right answer but phrasing doesn't matter (summarization, fact extraction). Use LLM-as-a-judge for subjective quality (tone, helpfulness, appropriateness). Combine them based on what matters for your specific test.
Build a feedback loop from production to tests. Production failures are your best test cases because they represent real usage and real failure modes. When something breaks in production, add it to your test suite so it never breaks that way again. Your test suite should grow organically as you learn how your system actually fails.
Here's how this looks in practice:
For a RAG system, define semantic criteria like "the response should be grounded in retrieved context" and "citations should point to real sources that actually contain the cited information." Use rule-based checks to verify citations are formatted correctly. Use semantic similarity to verify the response content matches the retrieved context. Use LLM-as-a-judge to verify the response directly answers the question. Test the full pipeline: retrieval → context formatting → generation → citation verification.
For a chatbot, define criteria like "responses should maintain consistent tone across the conversation" and "the bot should handle follow-up questions by referencing previous context." Use rule-based checks to verify no prohibited topics are discussed. Use LLM-as-a-judge to score tone consistency across multiple turns. Test conversation sequences, not just individual responses, because context handling failures only emerge over multiple turns.
For an AI agent, define criteria like "the agent should select appropriate tools for the task" and "the agent should complete the task without getting stuck in loops." Use rule-based checks to verify the agent doesn't exceed token budgets or make unnecessary API calls. Use semantic similarity to verify the final output matches the expected result. Test complete task sequences from start to finish because tool selection issues only emerge when you see the full workflow.
The key insight: You're not testing the LLM itself (OpenAI already did that). You're testing what you built on top of it: your prompts, your context formatting, your retrieval logic, your conversation state management, your tool selection logic. Each of these can fail independently or in combination, and testing catches those failures before users do.
Your Checklist
Here's what to do tomorrow to start testing your LLM application systematically:
Start with 25-30 test cases covering critical paths. Ten typical queries, ten edge cases, five adversarial examples. Don't try to achieve perfect coverage on day one. Twenty-five test cases you actually run regularly beat 500 test cases that sit unused. Source them from production data (if you have it), domain expert input, synthetic generation, and adversarial examples.
Define semantic evaluation criteria for each test. What makes a response "good" for your use case? Be specific. "Professional tone" is vague. "Uses formal language, avoids slang, maintains helpful demeanor even when declining requests" is testable. Write these criteria down. They become your evaluation rubric.
Combine evaluation methods based on what you're testing. Use rule-based checks for hard constraints (no PII, required disclaimers, length limits, format requirements). Use semantic similarity when there's a right answer but phrasing varies (summarization, fact extraction, translation). Use LLM-as-a-judge for subjective quality (tone, helpfulness, appropriateness, style). Layer them so multiple methods evaluate each test case.
Version control your test data like code. Store test cases in Git. Review changes in pull requests. Track modifications over time. This prevents test data from becoming an untraceable mess and makes it easy to understand why tests pass or fail after changes. Treat test cases as documentation of expected behavior.
Store test results over time to track trends. Record the actual outputs, scores, and evaluation results. This historical data helps you understand if the system is improving or degrading, debug regressions (what changed between the version that worked and this one?), and build intuition about your system's behavior patterns.
Build a feedback loop from production to tests. When something breaks in production, add it to your test suite. When users complain about a response, add that scenario as a test case. When you find an edge case you hadn't considered, add it to your test suite. Your test suite should grow organically based on real failures, not theoretical scenarios.
Assign clear ownership of the test suite. Someone needs to own writing new tests, reviewing results, and keeping the suite relevant. Without clear ownership, testing becomes something everyone assumes someone else is doing. Make it part of someone's job description.
Integrate testing into your deployment process. Run tests in CI/CD. Review results before deployment. Make test failures block releases just like traditional tests. If tests fail but you deploy anyway, you've trained your team that tests don't matter.
Set realistic thresholds for passing tests. Not every response needs to score 100%. Understand what "good enough" looks like for your use case. An 85% accuracy rate might be excellent for a low-stakes chatbot and unacceptable for a medical advice system. Set thresholds based on the risk and impact of failures.
Start with unit tests, then add layers. Begin by testing individual LLM calls (unit tests) for your critical prompts. Add functional tests for key workflows once unit tests are stable. Layer in regression testing when you start making regular changes. Add production monitoring when you have real users. You don't need all four layers on day one.
Closing
That database deletion incident I mentioned at the start? It happened because we didn't have functional tests covering our AI agent's tool selection logic. Our unit tests verified that individual tool calls worked correctly. But we never tested the full workflow where the agent had to choose between multiple tools and decide when to stop.
The agent misinterpreted "analyze but don't modify" as "analyze by running modification commands with a dry-run flag." Except our database management tool didn't have a dry-run flag. The agent didn't know that, so it ran the commands anyway.
We caught it in staging, not production, because we'd finally built that feedback loop from production to tests. A user had reported a similar (but less destructive) issue the week before. We added it as a test case. The test failed on the new agent version, which blocked deployment until we fixed the underlying issue.
That's the difference between hoping your LLM application works and knowing it works. Not perfect knowledge, because perfect is impossible with nondeterministic systems. But systematic knowledge: you've tested the critical paths, you catch regressions before deployment, you monitor production for what tests missed, and you feed those failures back into your test suite.
The teams shipping LLM applications confidently aren't the ones with perfect test coverage. They're the ones with systematic processes that catch problems before users do. They've accepted that their test suite will never be "done" because new failure modes emerge constantly. They've built infrastructure that makes testing part of their deployment process, not a nice-to-have that gets skipped under deadline pressure.
The next time your AI does something unexpected, you'll have a test case ready to catch it before it reaches production. And that's the point: not to prevent every possible failure, but to build a system that learns from failures and prevents them from happening twice.
Top comments (0)