AI is everywhere right now. Everyone is talking about AI testing—more specifically, testing LLMs. We keep hearing that it’s a completely different world, that QA engineers need to reskill, that the main issue is non-determinism, and that testing such systems is nearly impossible. Or that classical testing methods simply don’t apply.
In reality, not really. It’s just overcomplicated.
Long before AI, we already had systems that behaved non-deterministically. If you’ve worked in AdTech, you know exactly what this means. Whether a banner is shown or not, and to whom—it all depended on hundreds of parameters and real-time data. The same input didn’t necessarily produce the same output. In fact, the same input produced different results.
And somehow, we still tested those systems.
In such systems, logic was verified at a lower level—in isolated environments, through unit and integration tests, with clearly defined boundary values. We tested at the edges: with these inputs, the result must be A; with slightly different inputs, the system transitions into another state. The fact that above the boundary there was more variation was not surprising—it was expected. That’s how the system was supposed to behave.
The exact same principle applies to LLM-based systems. But first, let’s put things into perspective.
What Are You Actually Testing?
In most cases, you are testing an AI-powered application or system—not the model itself.
You only need to test the model in two cases:
- If your company is building its own model
- If you are actually training or fine-tuning the model with your own data (fine-tuning, labeling, humans-in-the-loop, etc.)
In all other cases, you do not need to test the model. Yes, that’s a strict statement—intentionally.
If you’re using OpenAI, Anthropic, or any other cloud model, you are testing the integration. If you’re using an open-source model locally, you are still testing whether it works correctly within your system. That is integration testing. This is a third-party dependency, and the same rules apply as with any other integration.
Separate your application testing from third-party behavior. Mock the model. Build an emulator. Ensure your tests are fast and stable. Your goal is to find when your application fails—not to retest OpenAI.
What About Model Updates?
Of course, you need control. Just like with any integration.
If the model version changes, you need to verify that your use case hasn’t been broken. But this is not model testing in the fundamental sense. It’s regression testing—checking whether the integration still works and whether behavior hasn’t changed in a critical way.
A small regression dataset, a few key scenarios, a sanity check—and move on. We are not testing the model architecture. We are verifying that our system still works.
The Interesting Part — When You Train the Model Yourself
If you’re using an open-source model and fine-tuning it with your own data, then you’re no longer testing the base model—you’re testing your training. In other words, you’re verifying whether you trained the model correctly.
And here, things are much simpler than they appear.
The goal of testing remains exactly the same as always: find where the system fails.
In the case of LLMs, that means:
- The model gets stuck
- It produces nonsense
- It starts hallucinating
- It becomes unstable with certain inputs
Where does this happen? Most often—at the edges.
Where data is scarce. Where classes overlap. Where context is ambiguous. That’s where the model behaves more randomly. And that’s expected.
A Practical Testing Approach
The method is simple:
- Take a boundary case
- Run it hundreds of times
- Measure the percentage of correct responses
- If the percentage meets your defined threshold—the test passes. / If not—the test fails.
The fix? Not code changes. You need more or better-quality training data.
Now take clear, non-boundary cases where the answer should be almost always correct. Run them hundreds of times again. The expected result is close to 99–100%.
If you see significant variation here, you have a problem with training and/or data quality. This is not rocket science—it’s basic statistical stability testing.
When Everything Becomes Part of a System
Once the AI model becomes part of a larger system, additional challenges appear—performance, monitoring, real-world data, unexpected inputs. But this is just system-level testing—operational testing.
Does the system work under real conditions?
Does it handle load?
Do fallback mechanisms work?
Security, etc.
Fundamentally, this is no different from testing any other complex system.
Final Thought
AI testing is not magic.
LLM non-determinism is not a revolution in testing, nor is it a problem. We have already tested complex, probabilistic systems before.
The key is to clearly separate:
- where you are testing your application
- where you are testing integration
- and where you are actually testing your data
Everything else is the same: the same principles, the same goals, the same discipline.
Of course, if you are building your own AI model from scratch—that’s a different topic.
I’ll leave that for another time.
Top comments (0)